This is a discussion on LIKE search and performance within the Pgsql Performance forums, part of the PostgreSQL category; --> Hi, I have a table with varchar and text columns, and I have to search through these text in ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi, I have a table with varchar and text columns, and I have to search through these text in the whole table. An example would be: SELECT * FROM table WHERE name like '%john%' or street like '%srt%' Anyway, the query planner always does seq scan on the whole table and that takes some time. How can this be optimized or made in another way to be faster? I tried to make indexes on the columns but no success. PG 8.2 Regards, Andy. |
| |||
| Andy wrote: > SELECT * FROM table > WHERE name like '%john%' or street like '%srt%' > > Anyway, the query planner always does seq scan on the whole table and that > takes some time. How can this be optimized or made in another way to be > faster? > > I tried to make indexes on the columns but no success. None of the normal indexes will work for finding text in the middle of a string. If you do think of a simple way of solving this, drop a short letter explaining your idea to your local patent office followed by the Nobel prize committee. However, one of the contrib packages is "tsearch2" which is designed to do keyword searches on text for you. It'll also handle stemming (e.g. "search" will match "searching" etc.) with the right choice of dictionary. Loads of other clever stuff in it too. It's one of the optional packages with most Linux packaging systems and on the Windows one too. If you install from source see the contrib/ directory for details. -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On 5/23/07, Andy <frum@ar-sd.net> wrote: > An example would be: > SELECT * FROM table > WHERE name like '%john%' or street like '%srt%' > > Anyway, the query planner always does seq scan on the whole table and that > takes some time. How can this be optimized or made in another way to be > faster? There's no algorithm in existence that can "index" arbitrary substrings the way you think. The only rational way to accomplish this is to first break the text into substrings using some algorithm (eg., words delimited by whitespace and punctuation), and index the substrings individually. You can do this using vanilla PostgreSQL, and you can use Tsearch2 and/or its GIN indexes to help speed up the searches. The simplest solution would be to put all the substrings in a table, one row per substring, along with an attribute referencing the source table/row. At this point you have effectively reduced your search space: you can use a query to isolate the words in your "dictionary" that contain the substrings. So a query might be: select ... from persons where id in ( select person_id from person_words where word like '%john%'; ) The "like" search, even though it will use a sequential scan, is bound to be much faster on these small words than searching for substrings through large gobs of text in the persons table. Note that PostgreSQL *can* exploit the index for *prefix* matching, if you tell it to use the right operator class: create index persons_name_index on persons (name text_pattern_ops); or, if you're using varchars (though there is rarely any reason to): create index persons_name_index on persons (name varchar_pattern_ops); (These two may be identical internally. Check the docs.) Now you can do: select ... from persons where name like 'john%'; which will yield a query plan such as this: Index Scan using persons_name_index on persons (cost=0.00..8.27 rows=1 width=29) (actual time=0.184..0.373 rows=51 loops=1) Index Cond: ((name ~>=~ 'john'::text) AND (name ~<~ 'joho'::text)) Filter: (title ~~ 'john%'::text) Alexander. ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Am 23.05.2007 um 09:08 schrieb Andy: > I have a table with varchar and text columns, and I have to search > through these text in the whole table. > > An example would be: > SELECT * FROM table > WHERE name like '%john%' or street > like '%srt%' > > Anyway, the query planner always does seq scan on the whole table > and that takes some time. How can this be optimized or made in > another way to be faster? The problem is that normal indexes cannot be used for "contains" queries. If you need fulltext search capabilities you have to take a look at tsearch2 or an external search engine like Lucene. cug ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Andy wrote: > Hi, > > I have a table with varchar and text columns, and I have to search > through these text in the whole table. > > An example would be: > SELECT * FROM table > WHERE name like '%john%' or street like '%srt%' > > Anyway, the query planner always does seq scan on the whole table and > that takes some time. How can this be optimized or made in another way > to be faster? Use tsearch2 (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/) for full text indexing. Rigmor > > I tried to make indexes on the columns but no success. > > PG 8.2 > > Regards, > Andy. ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Thank you all for the answers. I will try your suggestions and see what that brings in terms of performance. Andy. > -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto > Rigmor Ukuhe > Sent: Wednesday, May 23, 2007 6:52 PM > Cc: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] LIKE search and performance > > Andy wrote: > > Hi, > > > > I have a table with varchar and text columns, and I have to search > > through these text in the whole table. > > > > An example would be: > > SELECT * FROM table > > WHERE name like '%john%' or > street like '%srt%' > > > > Anyway, the query planner always does seq scan on the whole > table and > > that takes some time. How can this be optimized or made in > another way > > to be faster? > > Use tsearch2 > (http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/) for > full text indexing. > > Rigmor > > > > > I tried to make indexes on the columns but no success. > > > > PG 8.2 > > > > Regards, > > Andy. > > > > > ---------------------------(end of > broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > > ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| Alexander Staubo wrote: > On 5/23/07, Andy <frum@ar-sd.net> wrote: >> An example would be: >> SELECT * FROM table >> WHERE name like '%john%' or street like >> '%srt%' >> >> Anyway, the query planner always does seq scan on the whole table and >> that >> takes some time. How can this be optimized or made in another way to be >> faster? > > There's no algorithm in existence that can "index" arbitrary > substrings the way you think. The only rational way to accomplish this > is to first break the text into substrings using some algorithm (eg., > words delimited by whitespace and punctuation), and index the > substrings individually. That seems rather harsh. If I'd put an index on each of these colomns I'd certainly expect it to use the indices - and I'm pretty sure that Sybase would. I'd expect it to scan the index leaf pages instead of the table itself - they should be much more compact and also likely to be hot in cache. Why *wouldn't* the planner do this? James ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| James Mansion wrote: > Alexander Staubo wrote: >> On 5/23/07, Andy <frum@ar-sd.net> wrote: >>> An example would be: >>> SELECT * FROM table >>> WHERE name like '%john%' or street like >>> '%srt%' >>> >>> Anyway, the query planner always does seq scan on the whole table and >>> that >>> takes some time. How can this be optimized or made in another way to be >>> faster? >> >> There's no algorithm in existence that can "index" arbitrary >> substrings the way you think. The only rational way to accomplish this >> is to first break the text into substrings using some algorithm (eg., >> words delimited by whitespace and punctuation), and index the >> substrings individually. > That seems rather harsh. If I'd put an index on each of these colomns > I'd certainly > expect it to use the indices - and I'm pretty sure that Sybase would. > I'd expect > it to scan the index leaf pages instead of the table itself - they > should be much > more compact and also likely to be hot in cache. If Sybase is still like SQL Server (or the other way around), it *may* end up scanning the index *IFF* the index is a clustered index. If it's a normal index, it will do a sequential scan on the table. Yes, the leaf page of the index is more compact, but you also have to scan the intermediate pages to get to the leaf pages. But again, it can be a win. On such a system. It's not a win on PostgreSQL, because of our MVCC implementation. We need to scan *both* index *and* data pages if we go down that route, in which case it's a lot faster to just scan the data pages alone. I don't really know how MSSQL deals with this now that they have MVCC-ish behavior, and I have no idea at all if sybase has anything like MVCC. > Why *wouldn't* the planner do this? The question should be why the optimizer doesn't consider it, and the executor uses it. The planner really doesn't decide that part :-) Hopefully, the answer can be found above. //Magnus ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| > If Sybase is still like SQL Server (or the other way around), it *may* > end up scanning the index *IFF* the index is a clustered index. If it's > a normal index, it will do a sequential scan on the table. > > Are you sure its not covered? Have to check at work - but I'm off next week so it'll have to wait. > It's not a win on PostgreSQL, because of our MVCC implementation. We > need to scan *both* index *and* data pages if we go down that route, in > which case it's a lot faster to just scan the data pages alone. > > Why do you need to go to all the data pages - doesn't the index structure contain all the keys so you prefilter and then check to see if the *matched* items are still in view? I'll be first to admit I know zip about Postgres, but it seems odd - doesn't the index contain copies of the key values?. I suspect that I mis-spoke with 'leaf'. I really just mean 'all index pages with data', since the scan does not even need to be in index order, just a good way to get at the data in a compact way. ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| ||||
| On Thu, 2007-05-24 at 21:54 +0100, James Mansion wrote: > > If Sybase is still like SQL Server (or the other way around), it *may* > > end up scanning the index *IFF* the index is a clustered index. If it's > > a normal index, it will do a sequential scan on the table. > > > > > Are you sure its not covered? Have to check at work - but I'm off next > week so it'll have to wait. > > > It's not a win on PostgreSQL, because of our MVCC implementation. We > > need to scan *both* index *and* data pages if we go down that route, in > > which case it's a lot faster to just scan the data pages alone. > > > > > Why do you need to go to all the data pages - doesn't the index > structure contain all the keys so > you prefilter and then check to see if the *matched* items are still in > view? I'll be first to admit I > know zip about Postgres, but it seems odd - doesn't the index contain > copies of the key values?. > > I suspect that I mis-spoke with 'leaf'. I really just mean 'all index > pages with data', since the scan > does not even need to be in index order, just a good way to get at the > data in a compact way. PG could scan the index looking for matches first and only load the actual rows if it found a match, but that could only be a possible win if there were very few matches, because the difference in cost between a full index scan and a sequential scan would need to be greater than the cost of randomly fetching all of the matching data rows from the table to look up the visibility information. So yes it would be possible, but the odds of it being faster than a sequential scan are small enough to make it not very useful. And since it's basically impossible to know the selectivity of this kind of where condition, I doubt the planner would ever realistically want to choose that plan anyway because of its poor worst-case behavior. -- Mark ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |