Unix Technical Forum

LIKE search and performance

This is a discussion on LIKE search and performance within the Pgsql Performance forums, part of the PostgreSQL category; --> Hi, I have a table with varchar and text columns, and I have to search through these text in ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 09:50 AM
Andy
 
Posts: n/a
Default LIKE search and performance

Hi,

I have a table with varchar and text columns, and I have to search through
these text in the whole table.

An example would be:
SELECT * FROM table
WHERE name like '%john%' or street like '%srt%'

Anyway, the query planner always does seq scan on the whole table and that
takes some time. How can this be optimized or made in another way to be
faster?

I tried to make indexes on the columns but no success.

PG 8.2

Regards,
Andy.

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 09:50 AM
Richard Huxton
 
Posts: n/a
Default Re: LIKE search and performance

Andy wrote:
> SELECT * FROM table
> WHERE name like '%john%' or street like '%srt%'
>
> Anyway, the query planner always does seq scan on the whole table and that
> takes some time. How can this be optimized or made in another way to be
> faster?
>
> I tried to make indexes on the columns but no success.


None of the normal indexes will work for finding text in the middle of a
string. If you do think of a simple way of solving this, drop a short
letter explaining your idea to your local patent office followed by the
Nobel prize committee.

However, one of the contrib packages is "tsearch2" which is designed to
do keyword searches on text for you. It'll also handle stemming (e.g.
"search" will match "searching" etc.) with the right choice of
dictionary. Loads of other clever stuff in it too.

It's one of the optional packages with most Linux packaging systems and
on the Windows one too. If you install from source see the contrib/
directory for details.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 09:50 AM
Alexander Staubo
 
Posts: n/a
Default Re: LIKE search and performance

On 5/23/07, Andy <frum@ar-sd.net> wrote:
> An example would be:
> SELECT * FROM table
> WHERE name like '%john%' or street like '%srt%'
>
> Anyway, the query planner always does seq scan on the whole table and that
> takes some time. How can this be optimized or made in another way to be
> faster?


There's no algorithm in existence that can "index" arbitrary
substrings the way you think. The only rational way to accomplish this
is to first break the text into substrings using some algorithm (eg.,
words delimited by whitespace and punctuation), and index the
substrings individually.

You can do this using vanilla PostgreSQL, and you can use Tsearch2
and/or its GIN indexes to help speed up the searches. The simplest
solution would be to put all the substrings in a table, one row per
substring, along with an attribute referencing the source table/row.

At this point you have effectively reduced your search space: you can
use a query to isolate the words in your "dictionary" that contain the
substrings. So a query might be:

select ... from persons where id in (
select person_id from person_words
where word like '%john%';
)

The "like" search, even though it will use a sequential scan, is bound
to be much faster on these small words than searching for substrings
through large gobs of text in the persons table.

Note that PostgreSQL *can* exploit the index for *prefix* matching, if
you tell it to use the right operator class:

create index persons_name_index on persons (name text_pattern_ops);

or, if you're using varchars (though there is rarely any reason to):

create index persons_name_index on persons (name varchar_pattern_ops);

(These two may be identical internally. Check the docs.)

Now you can do:

select ... from persons where name like 'john%';

which will yield a query plan such as this:

Index Scan using persons_name_index on persons (cost=0.00..8.27
rows=1 width=29) (actual time=0.184..0.373 rows=51 loops=1)
Index Cond: ((name ~>=~ 'john'::text) AND (name ~<~ 'joho'::text))
Filter: (title ~~ 'john%'::text)

Alexander.

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 09:50 AM
Guido Neitzer
 
Posts: n/a
Default Re: LIKE search and performance

Am 23.05.2007 um 09:08 schrieb Andy:

> I have a table with varchar and text columns, and I have to search
> through these text in the whole table.
>
> An example would be:
> SELECT * FROM table
> WHERE name like '%john%' or street
> like '%srt%'
>
> Anyway, the query planner always does seq scan on the whole table
> and that takes some time. How can this be optimized or made in
> another way to be faster?


The problem is that normal indexes cannot be used for "contains"
queries.

If you need fulltext search capabilities you have to take a look at
tsearch2 or an external search engine like Lucene.

cug

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 09:50 AM
Rigmor Ukuhe
 
Posts: n/a
Default Re: LIKE search and performance

Andy wrote:
> Hi,
>
> I have a table with varchar and text columns, and I have to search
> through these text in the whole table.
>
> An example would be:
> SELECT * FROM table
> WHERE name like '%john%' or street like '%srt%'
>
> Anyway, the query planner always does seq scan on the whole table and
> that takes some time. How can this be optimized or made in another way
> to be faster?


Use tsearch2 (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/) for full text indexing.

Rigmor

>
> I tried to make indexes on the columns but no success.
>
> PG 8.2
>
> Regards,
> Andy.





---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 09:50 AM
Andy
 
Posts: n/a
Default Re: LIKE search and performance

Thank you all for the answers.
I will try your suggestions and see what that brings in terms of
performance.

Andy.

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailtogsql-performance-owner@postgresql.org] On Behalf Of
> Rigmor Ukuhe
> Sent: Wednesday, May 23, 2007 6:52 PM
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] LIKE search and performance
>
> Andy wrote:
> > Hi,
> >
> > I have a table with varchar and text columns, and I have to search
> > through these text in the whole table.
> >
> > An example would be:
> > SELECT * FROM table
> > WHERE name like '%john%' or

> street like '%srt%'
> >
> > Anyway, the query planner always does seq scan on the whole

> table and
> > that takes some time. How can this be optimized or made in

> another way
> > to be faster?

>
> Use tsearch2
> (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/) for
> full text indexing.
>
> Rigmor
>
> >
> > I tried to make indexes on the columns but no success.
> >
> > PG 8.2
> >
> > Regards,
> > Andy.

>
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>
>



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 09:50 AM
James Mansion
 
Posts: n/a
Default Re: LIKE search and performance

Alexander Staubo wrote:
> On 5/23/07, Andy <frum@ar-sd.net> wrote:
>> An example would be:
>> SELECT * FROM table
>> WHERE name like '%john%' or street like
>> '%srt%'
>>
>> Anyway, the query planner always does seq scan on the whole table and
>> that
>> takes some time. How can this be optimized or made in another way to be
>> faster?

>
> There's no algorithm in existence that can "index" arbitrary
> substrings the way you think. The only rational way to accomplish this
> is to first break the text into substrings using some algorithm (eg.,
> words delimited by whitespace and punctuation), and index the
> substrings individually.

That seems rather harsh. If I'd put an index on each of these colomns
I'd certainly
expect it to use the indices - and I'm pretty sure that Sybase would.
I'd expect
it to scan the index leaf pages instead of the table itself - they
should be much
more compact and also likely to be hot in cache.

Why *wouldn't* the planner do this?

James


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 09:50 AM
Magnus Hagander
 
Posts: n/a
Default Re: LIKE search and performance

James Mansion wrote:
> Alexander Staubo wrote:
>> On 5/23/07, Andy <frum@ar-sd.net> wrote:
>>> An example would be:
>>> SELECT * FROM table
>>> WHERE name like '%john%' or street like
>>> '%srt%'
>>>
>>> Anyway, the query planner always does seq scan on the whole table and
>>> that
>>> takes some time. How can this be optimized or made in another way to be
>>> faster?

>>
>> There's no algorithm in existence that can "index" arbitrary
>> substrings the way you think. The only rational way to accomplish this
>> is to first break the text into substrings using some algorithm (eg.,
>> words delimited by whitespace and punctuation), and index the
>> substrings individually.

> That seems rather harsh. If I'd put an index on each of these colomns
> I'd certainly
> expect it to use the indices - and I'm pretty sure that Sybase would.
> I'd expect
> it to scan the index leaf pages instead of the table itself - they
> should be much
> more compact and also likely to be hot in cache.


If Sybase is still like SQL Server (or the other way around), it *may*
end up scanning the index *IFF* the index is a clustered index. If it's
a normal index, it will do a sequential scan on the table.

Yes, the leaf page of the index is more compact, but you also have to
scan the intermediate pages to get to the leaf pages. But again, it can
be a win. On such a system.

It's not a win on PostgreSQL, because of our MVCC implementation. We
need to scan *both* index *and* data pages if we go down that route, in
which case it's a lot faster to just scan the data pages alone.

I don't really know how MSSQL deals with this now that they have
MVCC-ish behavior, and I have no idea at all if sybase has anything like
MVCC.


> Why *wouldn't* the planner do this?


The question should be why the optimizer doesn't consider it, and the
executor uses it. The planner really doesn't decide that part :-)
Hopefully, the answer can be found above.

//Magnus

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 09:50 AM
James Mansion
 
Posts: n/a
Default Re: LIKE search and performance


> If Sybase is still like SQL Server (or the other way around), it *may*
> end up scanning the index *IFF* the index is a clustered index. If it's
> a normal index, it will do a sequential scan on the table.
>
>

Are you sure its not covered? Have to check at work - but I'm off next
week so it'll have to wait.

> It's not a win on PostgreSQL, because of our MVCC implementation. We
> need to scan *both* index *and* data pages if we go down that route, in
> which case it's a lot faster to just scan the data pages alone.
>
>

Why do you need to go to all the data pages - doesn't the index
structure contain all the keys so
you prefilter and then check to see if the *matched* items are still in
view? I'll be first to admit I
know zip about Postgres, but it seems odd - doesn't the index contain
copies of the key values?.

I suspect that I mis-spoke with 'leaf'. I really just mean 'all index
pages with data', since the scan
does not even need to be in index order, just a good way to get at the
data in a compact way.



---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 09:50 AM
Mark Lewis
 
Posts: n/a
Default Re: LIKE search and performance

On Thu, 2007-05-24 at 21:54 +0100, James Mansion wrote:
> > If Sybase is still like SQL Server (or the other way around), it *may*
> > end up scanning the index *IFF* the index is a clustered index. If it's
> > a normal index, it will do a sequential scan on the table.
> >
> >

> Are you sure its not covered? Have to check at work - but I'm off next
> week so it'll have to wait.
>
> > It's not a win on PostgreSQL, because of our MVCC implementation. We
> > need to scan *both* index *and* data pages if we go down that route, in
> > which case it's a lot faster to just scan the data pages alone.
> >
> >

> Why do you need to go to all the data pages - doesn't the index
> structure contain all the keys so
> you prefilter and then check to see if the *matched* items are still in
> view? I'll be first to admit I
> know zip about Postgres, but it seems odd - doesn't the index contain
> copies of the key values?.
>
> I suspect that I mis-spoke with 'leaf'. I really just mean 'all index
> pages with data', since the scan
> does not even need to be in index order, just a good way to get at the
> data in a compact way.


PG could scan the index looking for matches first and only load the
actual rows if it found a match, but that could only be a possible win
if there were very few matches, because the difference in cost between a
full index scan and a sequential scan would need to be greater than the
cost of randomly fetching all of the matching data rows from the table
to look up the visibility information.

So yes it would be possible, but the odds of it being faster than a
sequential scan are small enough to make it not very useful.

And since it's basically impossible to know the selectivity of this kind
of where condition, I doubt the planner would ever realistically want to
choose that plan anyway because of its poor worst-case behavior.

-- Mark

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 03:31 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com