vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> do { (t)++; (tlen)--} while ((*(t) & 0xC0) == 0x80 && tlen > 0) >> > > The while *must* test those two conditions in the other order. > (Don't laugh --- we've had reproducible bugs before in which the backend > dumped core because of running off the end of memory due to this type > of mistake.) > > >> In fact, I'm wondering if that might make the other UTF8 stuff redundant >> - the whole point of what we're doing is to avoid expensive calls to >> NextChar; >> > > +1 I think. This test will be approximately the same expense as what > the outer loop would otherwise be (tlen > 0 and *t != firstpat), and > doing it this way removes an entire layer of intellectual complexity. > Even though the code is hardly different, we are no longer dealing in > misaligned pointers anywhere in the match algorithm. > > > OK, here is a patch that I think incorporates all the ideas discussed (including part of Mark Mielke's suggestion about optimising %_). There is now no special treatment of UTF8 other than its use of a faster NextChar macro. cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Andrew Dunstan <andrew@dunslane.net> wrote: > OK, here is a patch that I think incorporates all the ideas discussed > (including part of Mark Mielke's suggestion about optimising %_). There > is now no special treatment of UTF8 other than its use of a faster > NextChar macro. This is a benchmark result of 1000 loops of SELECT count(*) INTO cnt FROM item WHERE i_title LIKE '%BABABABABARIBA%' on the table with 10000 rows. | SQL_ASCII | LATIN1 | UTF8 | EUC_JP ---------+-----------+--------+-------+--------- HEAD | 8017 | 8029 | 16928 | 18213 Patched | 7899 | 7887 | 9985 | 10370 [ms] It improved the performance not only for UTF8, but also for other multi-byte encodings and a bit for single-byte encodings. Thanks for the good work --- ITAGAKI Takahiro NTT Open Source Software Center ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| ITAGAKI Takahiro wrote: > Andrew Dunstan <andrew@dunslane.net> wrote: > > >> OK, here is a patch that I think incorporates all the ideas discussed >> (including part of Mark Mielke's suggestion about optimising %_). There >> is now no special treatment of UTF8 other than its use of a faster >> NextChar macro. >> > > This is a benchmark result of 1000 loops of > SELECT count(*) INTO cnt FROM item WHERE i_title LIKE '%BABABABABARIBA%' > on the table with 10000 rows. > > | SQL_ASCII | LATIN1 | UTF8 | EUC_JP > ---------+-----------+--------+-------+--------- > HEAD | 8017 | 8029 | 16928 | 18213 > Patched | 7899 | 7887 | 9985 | 10370 [ms] > > It improved the performance not only for UTF8, but also for other > multi-byte encodings and a bit for single-byte encodings. > > > Interesting. I infer from these results that the biggest bang here comes from abandoning CHAREQ and doing all comparisons byte-wise. cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Andrew Dunstan <andrew@dunslane.net> writes: > OK, here is a patch that I think incorporates all the ideas discussed > (including part of Mark Mielke's suggestion about optimising %_). There > is now no special treatment of UTF8 other than its use of a faster > NextChar macro. Looks mostly pretty good. I would suggest replacing tests "tlen == 0" and "plen == 0" with "<= 0", just so the code doesn't go completely insane if presented with invalidly-encoded data that causes it to step beyond the end of data. Also, this comment is not really good enough: > ! /* > ! * It is safe to use NextByte instead of NextChar here, even for > ! * multi-byte character sets, because we are not following > ! * immediately after a wildcard character. > ! */ > ! NextByte(t, tlen); > ! NextByte(p, plen); > } I'd suggest adding something like "If we are in the middle of a multibyte character, we must already have matched at least one byte of the character from both text and pattern; so we cannot get out-of-sync on character boundaries. And we know that no backend-legal encoding allows ASCII characters such as '%' to appear as non-first bytes of characters, so we won't mistakenly detect a new wildcard." regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Andrew Dunstan <andrew@dunslane.net> writes: > ITAGAKI Takahiro wrote: >> | SQL_ASCII | LATIN1 | UTF8 | EUC_JP >> ---------+-----------+--------+-------+--------- >> HEAD | 8017 | 8029 | 16928 | 18213 >> Patched | 7899 | 7887 | 9985 | 10370 [ms] >> >> It improved the performance not only for UTF8, but also for other >> multi-byte encodings and a bit for single-byte encodings. > Interesting. I infer from these results that the biggest bang here comes > from abandoning CHAREQ and doing all comparisons byte-wise. It looks like CHAREQ and NextChar are both pretty expensive, no doubt due to having to drill down through the MB encoding vectoring mechanism to find out what to do. A technique we might want to apply in future patches is to have an API whereby we can get a direct function pointer to the appropriate mblen or other encoding-dependent function, and then call directly to the right place in the inner loops instead of having to go through the intermediate vectoring function every time. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> OK, here is a patch that I think incorporates all the ideas discussed >> (including part of Mark Mielke's suggestion about optimising %_). There >> is now no special treatment of UTF8 other than its use of a faster >> NextChar macro. >> > > Looks mostly pretty good. I would suggest replacing tests "tlen == 0" > and "plen == 0" with "<= 0", just so the code doesn't go completely > insane if presented with invalidly-encoded data that causes it to step > beyond the end of data. Also, this comment is not really good enough: > > >> ! /* >> ! * It is safe to use NextByte instead of NextChar here, even for >> ! * multi-byte character sets, because we are not following >> ! * immediately after a wildcard character. >> ! */ >> ! NextByte(t, tlen); >> ! NextByte(p, plen); >> } >> > > I'd suggest adding something like "If we are in the middle of a > multibyte character, we must already have matched at least one byte of > the character from both text and pattern; so we cannot get out-of-sync > on character boundaries. And we know that no backend-legal encoding > allows ASCII characters such as '%' to appear as non-first bytes of > characters, so we won't mistakenly detect a new wildcard." > > > Done, and committed. cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| ||||
| Andrew Dunstan wrote: > > I'd suggest adding something like "If we are in the middle of a > > multibyte character, we must already have matched at least one byte of > > the character from both text and pattern; so we cannot get out-of-sync > > on character boundaries. And we know that no backend-legal encoding > > allows ASCII characters such as '%' to appear as non-first bytes of > > characters, so we won't mistakenly detect a new wildcard." > > > > > > > > Done, and committed. Woohoo, that's a big one off the plate! -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| Thread Tools | |
| Display Modes | |
|
|