This is a discussion on Re: UTF8MatchText within the Pgsql Patches forums, part of the PostgreSQL category; --> I do not understand this patch. You have defined two functions, UTF8MatchText() and UTF8MatchTextIC(), and the difference between them ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I do not understand this patch. You have defined two functions, UTF8MatchText() and UTF8MatchTextIC(), and the difference between them is that one calls CHAREQ and the other calls ICHAREQ, but just above those two functions you define the macros identically: #define CHAREQ(p1, p2) wchareq(p1, p2) #define ICHAREQ(p1, p2) wchareq(p1, p2) Why are there two functions? Also, can't you use one function and just pass a boolean to indicate whether case it be ignored? --------------------------------------------------------------------------- ITAGAKI Takahiro wrote: > "Andrew - Supernews" <andrew@supernews.net> wrote: > > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > > ITAGAKI> by bytes, not only UTF-8. > > > > This is false, particularly for EUC. > > Umm, I see. I updated the optimization to be used only for UTF8 case. > I also added some inlining hints that are useful on my machine (Pentium 4). > > > x1000 of LIKE '%foo% on 10000 rows tables [ms] > encoding | HEAD | P1 | P2 | P3 > -----------+-------+-------+-------+------- > SQL_ASCII | 7094 | 7120 | 7063 | 7031 > LATIN1 | 7083 | 7130 | 7057 | 7031 > UTF8 | 17974 | 10859 | 10839 | 9682 > EUC_JP | 17032 | 17557 | 17599 | 15240 > > - P1: UTF8MatchText() > - P2: P1 + __inline__ GenericMatchText() > - P3: P2 + __inline__ wchareq() > (The attached patch is P3.) > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| Bruce Momjian wrote: > > I do not understand this patch. You have defined two functions, > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them > is that one calls CHAREQ and the other calls ICHAREQ, but just above > those two functions you define the macros identically: > > #define CHAREQ(p1, p2) wchareq(p1, p2) > #define ICHAREQ(p1, p2) wchareq(p1, p2) > > Why are there two functions? Also, can't you use one function and just > pass a boolean to indicate whether case it be ignored? Sorry, typo: Why are there two functions? Also, can't you use one function and just pass a boolean to indicate whether case should be ignored? ------ > > --------------------------------------------------------------------------- > > ITAGAKI Takahiro wrote: > > "Andrew - Supernews" <andrew@supernews.net> wrote: > > > > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable > > > ITAGAKI> by bytes, not only UTF-8. > > > > > > This is false, particularly for EUC. > > > > Umm, I see. I updated the optimization to be used only for UTF8 case. > > I also added some inlining hints that are useful on my machine (Pentium 4). > > > > > > x1000 of LIKE '%foo% on 10000 rows tables [ms] > > encoding | HEAD | P1 | P2 | P3 > > -----------+-------+-------+-------+------- > > SQL_ASCII | 7094 | 7120 | 7063 | 7031 > > LATIN1 | 7083 | 7130 | 7057 | 7031 > > UTF8 | 17974 | 10859 | 10839 | 9682 > > EUC_JP | 17032 | 17557 | 17599 | 15240 > > > > - P1: UTF8MatchText() > > - P2: P1 + __inline__ GenericMatchText() > > - P3: P2 + __inline__ wchareq() > > (The attached patch is P3.) > > > > Regards, > > --- > > ITAGAKI Takahiro > > NTT Open Source Software Center > > > > [ Attachment, skipping... ] > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 7: You can help support the PostgreSQL project by donating at > > > > http://www.postgresql.org/about/donate > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://www.enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Itagaki, I find this still fairly unclean. It certainly took me some time to get me head around what's going on. ISTM we should generate all these match functions from one body of code plus some #define magic. As I understand it, we have three possible encoding switches: Single Byte, UTF8 and other Multi Byte Charsets, and two possible case settings: case Sensitive and Case Insensitive. That would make for a total of six functions, but in the case of both UTF8 and other MBCS we don't need a special Case Insensitive function - instead we downcase both the string and the pattern and then use the Case Sensitive function. That leaves a total of four functions. What is not clear to me is why the UTF8 optimisation work, and why it doesn't apply to other MBCS. At the very least we need a comment on that. I also find the existing function naming convention somewhat annoying - having foo() and MB_foo() is less than clear. I'd rather have SB_foo() and MB_foo(). That's not your fault, of course. If you supply me with some explanation on the UTF8 optimisation issue, I'll prepare a revised patch along these lines. cheers andrew ITAGAKI Takahiro wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > >>> I do not understand this patch. You have defined two functions, >>> UTF8MatchText() and UTF8MatchTextIC(), and the difference between them >>> is that one calls CHAREQ and the other calls ICHAREQ, but just above >>> those two functions you define the macros identically: >>> >> Why are there two functions? Also, can't you use one function and just >> pass a boolean to indicate whether case should be ignored? >> > > The same is true of MBMatchText() and MBMatchTextIC(). > Now, I'll split the patch into two changes. > > 1. DropMBMatchTextIC.patch > Drop MBMatchTextIC() and use MBMatchText() instead. > > 2. UTF8MatchText.patch > Add UTF8MatchText() as a specialized version of MBMatchText(). > > > As a future work, it might be good to research the performance of rewriting > "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that > we can avoid to call lower() for constant pattern in the right-hand side and > can use functional indexes (lower(col)). I think we never need MBMatchTextIC() > in the future unless we move to wide-character server encoding > > Regards, > --- > ITAGAKI Takahiro > NTT Open Source Software Center > > > ------------------------------------------------------------------------ > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| I wrote: > > > ISTM we should generate all these match functions from one body of > code plus some #define magic. > > As I understand it, we have three possible encoding switches: Single > Byte, UTF8 and other Multi Byte Charsets, and two possible case > settings: case Sensitive and Case Insensitive. That would make for a > total of six functions, but in the case of both UTF8 and other MBCS we > don't need a special Case Insensitive function - instead we downcase > both the string and the pattern and then use the Case Sensitive > function. That leaves a total of four functions. > > What is not clear to me is why the UTF8 optimisation work, and why it > doesn't apply to other MBCS. At the very least we need a comment on that. > > I also find the existing function naming convention somewhat annoying > - having foo() and MB_foo() is less than clear. I'd rather have > SB_foo() and MB_foo(). That's not your fault, of course. > > If you supply me with some explanation on the UTF8 optimisation issue, > I'll prepare a revised patch along these lines. > > Ok, I have studied some more and I think I understand what's going on. AIUI, we are switching from some expensive char-wise comparisons to cheap byte-wise comparisons in the UTF8 case because we know that in UTF8 the magic characters ('_', '%' and '\') aren't a part of any other character sequence. Is that putting it too mildly? Do we need stronger conditions than that? If it's correct, are there other MBCS for which this is true? cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Andrew Dunstan <andrew@dunslane.net> writes: > Ok, I have studied some more and I think I understand what's going on. > AIUI, we are switching from some expensive char-wise comparisons to > cheap byte-wise comparisons in the UTF8 case because we know that in > UTF8 the magic characters ('_', '%' and '\') aren't a part of any other > character sequence. Is that putting it too mildly? Do we need stronger > conditions than that? If it's correct, are there other MBCS for which > this is true? I don't think this is a correct analysis. If it were correct then we could use the optimization for all backend charsets because none of them allow MB characters to contain non-high-bit-set bytes. But it was stated somewhere upthread that that doesn't actually work. Clearly it's a necessary property that we not falsely detect the magic pattern characters, but that's not sufficient. I think the real issue is that UTF8 has disjoint representations for first-bytes and not-first-bytes of MB characters, and thus it is impossible to make a false match in which an MB pattern character is matched to the end of one data character plus the start of another. In character sets without that property, we have to use the slow way to ensure we don't make out-of-sync matches. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Wait a second ... I just thought of a counterexample that destroys the entire concept. Consider the pattern 'A__B', which clearly is supposed to match strings of four *characters*. With the proposed patch in place, it would match strings of four *bytes*. Which is not the correct behavior. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Tom Lane wrote: > UTF8 has disjoint representations for > first-bytes and not-first-bytes of MB characters, and thus it is > impossible to make a false match in which an MB pattern character is > matched to the end of one data character plus the start of another. > In character sets without that property, we have to use the slow way to > ensure we don't make out-of-sync matches. > > > Thanks. I will include this info in the comments. cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| Tom Lane wrote: > Wait a second ... I just thought of a counterexample that destroys the > entire concept. Consider the pattern 'A__B', which clearly is supposed > to match strings of four *characters*. With the proposed patch in > place, it would match strings of four *bytes*. Which is not the correct > behavior. > > From what I can see the code is quite careful about when it calls NextByte vs NextChar, and after _ it calls NextChar. So I'll test for this, but I think it's safe. cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Andrew Dunstan <andrew@dunslane.net> writes: > Tom Lane wrote: >> Wait a second ... I just thought of a counterexample that destroys the >> entire concept. Consider the pattern 'A__B', which clearly is supposed >> to match strings of four *characters*. With the proposed patch in >> place, it would match strings of four *bytes*. Which is not the correct >> behavior. > From what I can see the code is quite careful about when it calls > NextByte vs NextChar, and after _ it calls NextChar. Except that the entire point of this patch is to dumb down NextChar to be the same as NextByte for UTF8 strings. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| ||||
| Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> Tom Lane wrote: >> >>> Wait a second ... I just thought of a counterexample that destroys the >>> entire concept. Consider the pattern 'A__B', which clearly is supposed >>> to match strings of four *characters*. With the proposed patch in >>> place, it would match strings of four *bytes*. Which is not the correct >>> behavior. >>> > > >> From what I can see the code is quite careful about when it calls >> NextByte vs NextChar, and after _ it calls NextChar. >> > > Except that the entire point of this patch is to dumb down NextChar to > be the same as NextByte for UTF8 strings. > > > That's not what I see in (what I think is) the latest submission, which includes this snippet: + /* Set up for utf8 characters */ + #define CHAREQ(p1, p2) wchareq(p1, p2) + #define NextChar(p, plen) \ + do { int __l = pg_utf_mblen(p); (p) +=__l; (plen) -=__l; } while (0) + + /* + * UTF8MatchText -- specialized version of MBMatchText for UTF8 + */ + static int + UTF8MatchText(char *t, int tlen, char *p, int plen) Am I looking at the wrong thing? This is from around April 9th I think. cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |