Unix Technical Forum

Speeding up the Postgres lexer

This is a discussion on Speeding up the Postgres lexer within the pgsql Hackers forums, part of the PostgreSQL category; --> I was doing some profiles recently that showed that on simple statements (like INSERT with a few column values) ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 04:02 AM
Tom Lane
 
Posts: n/a
Default Speeding up the Postgres lexer

I was doing some profiles recently that showed that on simple statements
(like INSERT with a few column values) the basic parser (flex/bison) was
taking up a noticeable percentage of the total CPU time. We already
have done some things to speed up the lexer, like use -CF option for
large-but-fast tables. I read in the flex manual about how you can make
it faster yet if you can avoid the need for backup states. It's not
terribly well documented, but I eventually worked out how to do it, and
find that it seems to make the lexer about a third faster. Basically
what you have to do is provide alternate rules that allow the lexer to
match *some* rule on any prefix of a complete token. For instance the
rule for {real} implies that given
123.456E+
if the next character is not a digit, then the {real} rule fails, and
the lexer has to backtrack to
123.456
which it can recognize as a {decimal}. You can add a rule that does
match in this situation, throws back the two unwanted characters with
yyless(), and returns the number as the {decimal} rule would do.
The next call will carry on lexing the "E" as an identifier, etc.
(Note: this is going to end up being a syntax error because the SQL
grammar never allows a number to immediately precede an identifier;
but that's not the lexer's concern, and so I don't think that we should
make the extra rule throw an error immediately.)

What I'm wondering is whether this is really worth doing or not.
There are currently just two parts of the lexer rules that are affected
--- the {real} rule illustrated above, and the rules that allow quoted
strings to be split across lines as the SQL spec requires. But the
patches are still pretty ugly, and what's really annoying is that there
doesn't seem to be any way to get flex to complain if someone later
makes a change that breaks the no-backup-cases property again.

A prototype patch (completely undocumented :-() is attached. Any
thoughts about whether it's worth pressing forward with?

regards, tom lane

*** src/backend/parser/scan.l.orig Fri Mar 11 21:15:20 2005
--- src/backend/parser/scan.l Mon May 23 01:05:54 2005
***************
*** 148,163 ****
* validate the contents.
*/
xbstart [bB]{quote}
! xbstop {quote}
xbinside [^']*
xbcat {quote}{whitespace_with_newline}{quote}

/* Hexadecimal number
*/
xhstart [xX]{quote}
! xhstop {quote}
xhinside [^']*
xhcat {quote}{whitespace_with_newline}{quote}

/* National character
*/
--- 148,165 ----
* validate the contents.
*/
xbstart [bB]{quote}
! xbstop {quote}{whitespace}*
xbinside [^']*
xbcat {quote}{whitespace_with_newline}{quote}
+ xbconfused {quote}{whitespace}*"-"

/* Hexadecimal number
*/
xhstart [xX]{quote}
! xhstop {quote}{whitespace}*
xhinside [^']*
xhcat {quote}{whitespace_with_newline}{quote}
+ xhconfused {quote}{whitespace}*"-"

/* National character
*/
***************
*** 169,180 ****
*/
quote '
xqstart {quote}
! xqstop {quote}
xqdouble {quote}{quote}
xqinside [^\\']+
xqescape [\\][^0-7]
xqoctesc [\\][0-7]{1,3}
xqcat {quote}{whitespace_with_newline}{quote}

/* $foo$ style quotes ("dollar quoting")
* The quoted string starts with $foo$ where "foo" is an optional string
--- 171,183 ----
*/
quote '
xqstart {quote}
! xqstop {quote}{whitespace}*
xqdouble {quote}{quote}
xqinside [^\\']+
xqescape [\\][^0-7]
xqoctesc [\\][0-7]{1,3}
xqcat {quote}{whitespace_with_newline}{quote}
+ xqconfused {quote}{whitespace}*"-"

/* $foo$ style quotes ("dollar quoting")
* The quoted string starts with $foo$ where "foo" is an optional string
***************
*** 185,190 ****
--- 188,194 ----
dolq_start [A-Za-z\200-\377_]
dolq_cont [A-Za-z\200-\377_0-9]
dolqdelim \$({dolq_start}{dolq_cont}*)?\$
+ dolqfailed \${dolq_start}{dolq_cont}*
dolqinside [^$]+

/* Double quote
***************
*** 247,253 ****

integer {digit}+
decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*))
! real ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digi t}+))([Ee][-+]?{digit}+))

param \${integer}

--- 251,259 ----

integer {digit}+
decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*))
! real ({integer}|{decimal})[Ee][-+]?{digit}+
! realfail1 ({integer}|{decimal})[Ee]
! realfail2 ({integer}|{decimal})[Ee][-+]

param \${integer}

***************
*** 310,315 ****
--- 316,325 ----
/* ignore */
}

+ <xc>\*+ {
+ /* ignore */
+ }
+
<xc><<EOF>> { yyerror("unterminated /* comment"); }

{xbstart} {
***************
*** 329,334 ****
--- 339,350 ----
yylval.str = litbufdup();
return BCONST;
}
+ <xb>{xbconfused} {
+ yyless(yyleng-1);
+ BEGIN(INITIAL);
+ yylval.str = litbufdup();
+ return BCONST;
+ }
<xh>{xhinside} |
<xb>{xbinside} {
addlit(yytext, yyleng);
***************
*** 356,361 ****
--- 372,383 ----
yylval.str = litbufdup();
return XCONST;
}
+ <xh>{xhconfused} {
+ yyless(yyleng-1);
+ BEGIN(INITIAL);
+ yylval.str = litbufdup();
+ return XCONST;
+ }
<xh><<EOF>> { yyerror("unterminated hexadecimal string literal"); }

{xnstart} {
***************
*** 385,390 ****
--- 407,418 ----
yylval.str = litbufdup();
return SCONST;
}
+ <xq>{xqconfused} {
+ yyless(yyleng-1);
+ BEGIN(INITIAL);
+ yylval.str = litbufdup();
+ return SCONST;
+ }
<xq>{xqdouble} {
addlitchar('\'');
}
***************
*** 413,418 ****
--- 441,447 ----
BEGIN(xdolq);
startlit();
}
+ {dolqfailed} { yyerror("invalid token"); }
<xdolq>{dolqdelim} {
if (strcmp(yytext, dolqstart) == 0)
{
***************
*** 435,440 ****
--- 464,472 ----
<xdolq>{dolqinside} {
addlit(yytext, yyleng);
}
+ <xdolq>{dolqfailed} {
+ addlit(yytext, yyleng);
+ }
<xdolq>. {
/* This is only needed for $ inside the quoted text */
addlitchar(yytext[0]);
***************
*** 576,582 ****
yylval.str = pstrdup(yytext);
return FCONST;
}
!

{identifier} {
const ScanKeyword *keyword;
--- 608,623 ----
yylval.str = pstrdup(yytext);
return FCONST;
}
! {realfail1} {
! yyless(yyleng-1);
! yylval.str = pstrdup(yytext);
! return FCONST;
! }
! {realfail2} {
! yyless(yyleng-2);
! yylval.str = pstrdup(yytext);
! return FCONST;
! }

{identifier} {
const ScanKeyword *keyword;

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 04:02 AM
Bruce Momjian
 
Posts: n/a
Default Re: Speeding up the Postgres lexer


This seems fine. I don't think the lexer changes enough for us to have
issues with new cases. I think adding some comments to explain why we
are doing it is enough, and perhaps a test case that can be reproduced
later for testing.

---------------------------------------------------------------------------

Tom Lane wrote:
> I was doing some profiles recently that showed that on simple statements
> (like INSERT with a few column values) the basic parser (flex/bison) was
> taking up a noticeable percentage of the total CPU time. We already
> have done some things to speed up the lexer, like use -CF option for
> large-but-fast tables. I read in the flex manual about how you can make
> it faster yet if you can avoid the need for backup states. It's not
> terribly well documented, but I eventually worked out how to do it, and
> find that it seems to make the lexer about a third faster. Basically
> what you have to do is provide alternate rules that allow the lexer to
> match *some* rule on any prefix of a complete token. For instance the
> rule for {real} implies that given
> 123.456E+
> if the next character is not a digit, then the {real} rule fails, and
> the lexer has to backtrack to
> 123.456
> which it can recognize as a {decimal}. You can add a rule that does
> match in this situation, throws back the two unwanted characters with
> yyless(), and returns the number as the {decimal} rule would do.
> The next call will carry on lexing the "E" as an identifier, etc.
> (Note: this is going to end up being a syntax error because the SQL
> grammar never allows a number to immediately precede an identifier;
> but that's not the lexer's concern, and so I don't think that we should
> make the extra rule throw an error immediately.)
>
> What I'm wondering is whether this is really worth doing or not.
> There are currently just two parts of the lexer rules that are affected
> --- the {real} rule illustrated above, and the rules that allow quoted
> strings to be split across lines as the SQL spec requires. But the
> patches are still pretty ugly, and what's really annoying is that there
> doesn't seem to be any way to get flex to complain if someone later
> makes a change that breaks the no-backup-cases property again.
>
> A prototype patch (completely undocumented :-() is attached. Any
> thoughts about whether it's worth pressing forward with?
>
> regards, tom lane
>
> *** src/backend/parser/scan.l.orig Fri Mar 11 21:15:20 2005
> --- src/backend/parser/scan.l Mon May 23 01:05:54 2005
> ***************
> *** 148,163 ****
> * validate the contents.
> */
> xbstart [bB]{quote}
> ! xbstop {quote}
> xbinside [^']*
> xbcat {quote}{whitespace_with_newline}{quote}
>
> /* Hexadecimal number
> */
> xhstart [xX]{quote}
> ! xhstop {quote}
> xhinside [^']*
> xhcat {quote}{whitespace_with_newline}{quote}
>
> /* National character
> */
> --- 148,165 ----
> * validate the contents.
> */
> xbstart [bB]{quote}
> ! xbstop {quote}{whitespace}*
> xbinside [^']*
> xbcat {quote}{whitespace_with_newline}{quote}
> + xbconfused {quote}{whitespace}*"-"
>
> /* Hexadecimal number
> */
> xhstart [xX]{quote}
> ! xhstop {quote}{whitespace}*
> xhinside [^']*
> xhcat {quote}{whitespace_with_newline}{quote}
> + xhconfused {quote}{whitespace}*"-"
>
> /* National character
> */
> ***************
> *** 169,180 ****
> */
> quote '
> xqstart {quote}
> ! xqstop {quote}
> xqdouble {quote}{quote}
> xqinside [^\\']+
> xqescape [\\][^0-7]
> xqoctesc [\\][0-7]{1,3}
> xqcat {quote}{whitespace_with_newline}{quote}
>
> /* $foo$ style quotes ("dollar quoting")
> * The quoted string starts with $foo$ where "foo" is an optional string
> --- 171,183 ----
> */
> quote '
> xqstart {quote}
> ! xqstop {quote}{whitespace}*
> xqdouble {quote}{quote}
> xqinside [^\\']+
> xqescape [\\][^0-7]
> xqoctesc [\\][0-7]{1,3}
> xqcat {quote}{whitespace_with_newline}{quote}
> + xqconfused {quote}{whitespace}*"-"
>
> /* $foo$ style quotes ("dollar quoting")
> * The quoted string starts with $foo$ where "foo" is an optional string
> ***************
> *** 185,190 ****
> --- 188,194 ----
> dolq_start [A-Za-z\200-\377_]
> dolq_cont [A-Za-z\200-\377_0-9]
> dolqdelim \$({dolq_start}{dolq_cont}*)?\$
> + dolqfailed \${dolq_start}{dolq_cont}*
> dolqinside [^$]+
>
> /* Double quote
> ***************
> *** 247,253 ****
>
> integer {digit}+
> decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*))
> ! real ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digi t}+))([Ee][-+]?{digit}+))
>
> param \${integer}
>
> --- 251,259 ----
>
> integer {digit}+
> decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*))
> ! real ({integer}|{decimal})[Ee][-+]?{digit}+
> ! realfail1 ({integer}|{decimal})[Ee]
> ! realfail2 ({integer}|{decimal})[Ee][-+]
>
> param \${integer}
>
> ***************
> *** 310,315 ****
> --- 316,325 ----
> /* ignore */
> }
>
> + <xc>\*+ {
> + /* ignore */
> + }
> +
> <xc><<EOF>> { yyerror("unterminated /* comment"); }
>
> {xbstart} {
> ***************
> *** 329,334 ****
> --- 339,350 ----
> yylval.str = litbufdup();
> return BCONST;
> }
> + <xb>{xbconfused} {
> + yyless(yyleng-1);
> + BEGIN(INITIAL);
> + yylval.str = litbufdup();
> + return BCONST;
> + }
> <xh>{xhinside} |
> <xb>{xbinside} {
> addlit(yytext, yyleng);
> ***************
> *** 356,361 ****
> --- 372,383 ----
> yylval.str = litbufdup();
> return XCONST;
> }
> + <xh>{xhconfused} {
> + yyless(yyleng-1);
> + BEGIN(INITIAL);
> + yylval.str = litbufdup();
> + return XCONST;
> + }
> <xh><<EOF>> { yyerror("unterminated hexadecimal string literal"); }
>
> {xnstart} {
> ***************
> *** 385,390 ****
> --- 407,418 ----
> yylval.str = litbufdup();
> return SCONST;
> }
> + <xq>{xqconfused} {
> + yyless(yyleng-1);
> + BEGIN(INITIAL);
> + yylval.str = litbufdup();
> + return SCONST;
> + }
> <xq>{xqdouble} {
> addlitchar('\'');
> }
> ***************
> *** 413,418 ****
> --- 441,447 ----
> BEGIN(xdolq);
> startlit();
> }
> + {dolqfailed} { yyerror("invalid token"); }
> <xdolq>{dolqdelim} {
> if (strcmp(yytext, dolqstart) == 0)
> {
> ***************
> *** 435,440 ****
> --- 464,472 ----
> <xdolq>{dolqinside} {
> addlit(yytext, yyleng);
> }
> + <xdolq>{dolqfailed} {
> + addlit(yytext, yyleng);
> + }
> <xdolq>. {
> /* This is only needed for $ inside the quoted text */
> addlitchar(yytext[0]);
> ***************
> *** 576,582 ****
> yylval.str = pstrdup(yytext);
> return FCONST;
> }
> !
>
> {identifier} {
> const ScanKeyword *keyword;
> --- 608,623 ----
> yylval.str = pstrdup(yytext);
> return FCONST;
> }
> ! {realfail1} {
> ! yyless(yyleng-1);
> ! yylval.str = pstrdup(yytext);
> ! return FCONST;
> ! }
> ! {realfail2} {
> ! yyless(yyleng-2);
> ! yylval.str = pstrdup(yytext);
> ! return FCONST;
> ! }
>
> {identifier} {
> const ScanKeyword *keyword;
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>


--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-11-2008, 04:02 AM
Simon Riggs
 
Posts: n/a
Default Re: Speeding up the Postgres lexer

On Mon, 2005-05-23 at 12:31 -0400, Tom Lane wrote:
> doesn't seem to be any way to get flex to complain if someone later
> makes a change that breaks the no-backup-cases property again.


After some digging, there is a -b option will generate a file called
lex.backup if any backup-states exist. The file is said to contain
information that would help you remove backup-states.

It seems straightforward to test for the existence of that file in the
build process? Or perhaps add a special test state --enable-flextest to
perform the test during the build.

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-11-2008, 04:02 AM
Andrew Dunstan
 
Posts: n/a
Default Re: Speeding up the Postgres lexer



Tom Lane wrote:

>[snip - flex is slowed down by backtracking - how to fix ]
>
>What I'm wondering is whether this is really worth doing or not.
>There are currently just two parts of the lexer rules that are affected
>--- the {real} rule illustrated above, and the rules that allow quoted
>strings to be split across lines as the SQL spec requires. But the
>patches are still pretty ugly, and what's really annoying is that there
>doesn't seem to be any way to get flex to complain if someone later
>makes a change that breaks the no-backup-cases property again.
>
>
>



I would be more concerned if there were not reasonable alternatives to
many bulk parsed inserts (COPY, prepared statement).

But I do think it's worth it, even so ... not all client interfaces
support prepared statements (notoriously PHP, although I understand KL
has sent patches to fix that) and not all inserts are suitable for COPY.

cheers

andrew

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-11-2008, 04:03 AM
Christopher Kings-Lynne
 
Posts: n/a
Default Re: Speeding up the Postgres lexer

> What I'm wondering is whether this is really worth doing or not.
> There are currently just two parts of the lexer rules that are affected
> --- the {real} rule illustrated above, and the rules that allow quoted
> strings to be split across lines as the SQL spec requires. But the
> patches are still pretty ugly, and what's really annoying is that there
> doesn't seem to be any way to get flex to complain if someone later
> makes a change that breaks the no-backup-cases property again.


I was just thinking that if there's not a switch, it's prone to error again.

However, the lexer isn't touched anywhere near as much as the grammar is
right? So just put a large comment/warning/reminder at the top to
test for non-backup states.

I'm definitely in favour of a 1/3 speedup of the lexer.

Chris

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-11-2008, 04:03 AM
Christopher Kings-Lynne
 
Posts: n/a
Default Re: Speeding up the Postgres lexer

> But I do think it's worth it, even so ... not all client interfaces
> support prepared statements (notoriously PHP, although I understand KL
> has sent patches to fix that) and not all inserts are suitable for COPY.


There is now pg_prepare/pg_execute/pg_query_params in PHP, however you
could always have just used straight SQL PREPARE and EXECUTE commands.

Chris

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-11-2008, 04:03 AM
Tom Lane
 
Posts: n/a
Default Re: Speeding up the Postgres lexer

Simon Riggs <simon@2ndquadrant.com> writes:
> On Mon, 2005-05-23 at 12:31 -0400, Tom Lane wrote:
>> doesn't seem to be any way to get flex to complain if someone later
>> makes a change that breaks the no-backup-cases property again.


> After some digging, there is a -b option will generate a file called
> lex.backup if any backup-states exist. The file is said to contain
> information that would help you remove backup-states.


Right, reading that file is how I learned to fix it ...

> It seems straightforward to test for the existence of that file in the
> build process? Or perhaps add a special test state --enable-flextest to
> perform the test during the build.


Well, the problem is that the "success" state isn't that the file
doesn't exist or is empty or anything so easy as that. What you
want is for it to say

No backing up.

and it seems a tad too ugly to code such a test into the makefiles.
I'd not want to bet that the success message is spelled the same way
in every flex version, or is locale-independent.

Based on the comments so far in this thread, I'll go ahead and commit
the patch, with some comments attached of course --- in particular a big
head comment to run flex with -b and see that lex.backup says something
to this effect.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-11-2008, 04:03 AM
Christopher Kings-Lynne
 
Posts: n/a
Default Re: Speeding up the Postgres lexer

> Based on the comments so far in this thread, I'll go ahead and commit
> the patch, with some comments attached of course --- in particular a big
> head comment to run flex with -b and see that lex.backup says something
> to this effect.


Add it to the release check-list.

Chris

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-11-2008, 04:03 AM
Tom Lane
 
Posts: n/a
Default Re: Speeding up the Postgres lexer

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> However, the lexer isn't touched anywhere near as much as the grammar is
> right?


Yeah --- if you look at the CVS history, changes that affect the flex
rules (and not just the text of the C-code actions) are really rare
these days. If there were a lot of churn I'd think this proposal was
completely impractical, but there's not much anymore.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-11-2008, 04:03 AM
Simon Riggs
 
Posts: n/a
Default Re: Speeding up the Postgres lexer

On Mon, 2005-05-23 at 22:43 -0400, Tom Lane wrote:
> Based on the comments so far in this thread, I'll go ahead and commit
> the patch, with some comments attached of course --- in particular a big
> head comment to run flex with -b and see that lex.backup says something
> to this effect.


flex counts the number of times it needs to back up in a variable called
num_backing_up. It tests this to see whether to generate code
differently. ISTM you could have a -x switch to make it generate an
error if num_backing_up > 0.

Would we use the -x switch if we had it? If so I'll submit a patch here
first, then on to GNU. It'll take a while before it comes back round to
us though.

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 03:58 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com