Unix Technical Forum

text_position worst case runtime

This is a discussion on text_position worst case runtime within the pgsql Hackers forums, part of the PostgreSQL category; --> Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > >>Tom Lane <tgl@sss.pgh.pa.us> writes: >> >>>And how much code would ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 04-12-2008, 02:30 AM
Mark Dilger
 
Posts: n/a
Default Re: text_position worst case runtime

Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
>
>>Tom Lane <tgl@sss.pgh.pa.us> writes:
>>
>>>And how much code would those take? The bottom line here is that we
>>>don't have a pile of complaints about the performance of text_position,
>>>so it's difficult to justify making it much more complicated than it
>>>is now.

>
>
>>It seems somewhat contrary to the Postgres design philosophy to assume that
>>all strings are small.

>
>
> That is a straw-man argument. If we try to optimize every single
> function in the system to the Nth degree, we'll end up with a system
> that is unmaintainable (and likely unusably buggy as well). We've got
> to set limits on the amount of complexity we're willing to accept in
> the core code.
>
> Note that I have not said "you can't put Boyer-Moore into core".
> What I've said is that the case to justify doing that hasn't been made.
> And handwaving about "design philosophy" isn't the kind of case I'm
> looking for --- common applications in which it makes a real performance
> difference are what I'm looking for.
>
> At this point we haven't even been shown any evidence that text_position
> itself is what to optimize if you need to do searches in large text
> strings. It seems entirely likely to me that the TOAST mechanisms would
> be the bottleneck, instead. And one should also consider other approaches
> entirely, like indexes (tsearch2 anyone?).


In case anyone is following this thread specifically for the biological sequence
data aspect of it, I should mention that I wrote a GiST index for the dna and
protein sequence datatypes. The performance of the index was inconsistent. For
certain data, I could get about two orders of magnitude speed increase on
selects, where the select was based on a limited regular expression approximate
match against the data. But if you change the regular expression (or to a
degree, if you change the data) the performance can drop off to roughly tied
with a sequential scan. And of course, inserts are far more expensive because
the index has to be kept up to date.

If anyone wants specifics, send me an email and I'll put something together.

mark

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #12 (permalink)  
Old 04-12-2008, 02:30 AM
Tom Lane
 
Posts: n/a
Default Re: text_position worst case runtime

Hannu Krosing <hannu@skype.net> writes:
> I guess our regex implementation already uses boyer-moore or similar.
> Why not just expose the match position of substring('text' in 'regex')
> using some function, called match_position(int searched_text, int
> regex, int matchnum) ?


If it did that might be a nice solution, but I'm not sure that it does
use B-M ... I can't find either "Boyer" or "Moore" in its source code.

There's no particular reason to suppose offhand that a regex engine
would be faster than the naive code for fixed patterns.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #13 (permalink)  
Old 04-12-2008, 02:30 AM
Hannu Krosing
 
Posts: n/a
Default Re: text_position worst case runtime

Ühel kenal päeval, R, 2006-05-19 kell 18:18, kirjutas Tom Lane:
> Hannu Krosing <hannu@skype.net> writes:
> > I guess our regex implementation already uses boyer-moore or similar.
> > Why not just expose the match position of substring('text' in 'regex')
> > using some function, called match_position(int searched_text, int
> > regex, int matchnum) ?

>
> If it did that might be a nice solution, but I'm not sure that it does
> use B-M ... I can't find either "Boyer" or "Moore" in its source code.


Ok, maybe it is not optimised for finding longish strings inside even
longers trings.

I had a (false ?) memory that we used some variant of pcre, and that
pcre uses BM. I may be false on both accounts. (I know that python
borrowed its re module from pcre).

> There's no particular reason to suppose offhand that a regex engine
> would be faster than the naive code for fixed patterns.


if naive code is O(n*m), then starting from some values of n and m it is
probably faster if it is based on somewhat optimised regex engine, the
question is, what is the threasold and dataset for fasterness

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #14 (permalink)  
Old 04-12-2008, 02:30 AM
Hannu Krosing
 
Posts: n/a
Default Re: text_position worst case runtime

Ühel kenal päeval, L, 2006-05-20 kell 01:34, kirjutas Hannu Krosing:
> Ühel kenal päeval, R, 2006-05-19 kell 18:18, kirjutas Tom Lane:
> > Hannu Krosing <hannu@skype.net> writes:
> > > I guess our regex implementation already uses boyer-moore or similar.
> > > Why not just expose the match position of substring('text' in 'regex')
> > > using some function, called match_position(int searched_text, int
> > > regex, int matchnum) ?

> >
> > If it did that might be a nice solution, but I'm not sure that it does
> > use B-M ... I can't find either "Boyer" or "Moore" in its source code.

>
> Ok, maybe it is not optimised for finding longish strings inside even
> longers trings.
>
> I had a (false ?) memory that we used some variant of pcre, and that
> pcre uses BM. I may be false on both accounts. (I know that python
> borrowed its re module from pcre).


http://www.mcabee.org/lists/snort-us.../msg00026.html

seems to imply that PCRE uses BM at least for some case, so I might not
have been wrong in case 2


> > There's no particular reason to suppose offhand that a regex engine
> > would be faster than the naive code for fixed patterns.

>
> if naive code is O(n*m), then starting from some values of n and m it is
> probably faster if it is based on somewhat optimised regex engine, the
> question is, what is the threasold and dataset for fasterness
>

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #15 (permalink)  
Old 04-12-2008, 02:30 AM
Alvaro Herrera
 
Posts: n/a
Default Re: text_position worst case runtime

Hannu Krosing wrote:

> I had a (false ?) memory that we used some variant of pcre, and that
> pcre uses BM. I may be false on both accounts. (I know that python
> borrowed its re module from pcre).


Our code is a derivative from Henry Spencer's code found in Tcl. It
certainly isn't Boyer Moore, because it processes chars one at a time,
left to right (BM processes right to left).

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #16 (permalink)  
Old 04-12-2008, 02:31 AM
Greg Stark
 
Posts: n/a
Default Re: text_position worst case runtime


Tom Lane <tgl@sss.pgh.pa.us> writes:

> If it did that might be a nice solution, but I'm not sure that it does
> use B-M ... I can't find either "Boyer" or "Moore" in its source code.
>
> There's no particular reason to suppose offhand that a regex engine
> would be faster than the naive code for fixed patterns.


Well even a lame regexp implementation ought to be O(n+m). The factors will be
less than Boyer-Moore which can skip over substantial sections of the search
space without even looking at the characters. But there's no way it would be
O(n*m) for simple patterns unless the implementation was seriously deficient.

Of course your statement could still be true for particular usage patterns
like searching many different short strings with many different patterns where
the setup time of the regexp tables may dominate.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 03:56 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com