Unix Technical Forum

Re: UTF8MatchText

This is a discussion on Re: UTF8MatchText within the Pgsql Patches forums, part of the PostgreSQL category; --> I do not understand this patch. You have defined two functions, UTF8MatchText() and UTF8MatchTextIC(), and the difference between them ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Patches

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-18-2008, 09:53 AM
Bruce Momjian
 
Posts: n/a
Default Re: UTF8MatchText


I do not understand this patch. You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:

#define CHAREQ(p1, p2) wchareq(p1, p2)
#define ICHAREQ(p1, p2) wchareq(p1, p2)

Why are there two functions? Also, can't you use one function and just
pass a boolean to indicate whether case it be ignored?

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> "Andrew - Supernews" <andrew@supernews.net> wrote:
>
> > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> > ITAGAKI> by bytes, not only UTF-8.
> >
> > This is false, particularly for EUC.

>
> Umm, I see. I updated the optimization to be used only for UTF8 case.
> I also added some inlining hints that are useful on my machine (Pentium 4).
>
>
> x1000 of LIKE '%foo% on 10000 rows tables [ms]
> encoding | HEAD | P1 | P2 | P3
> -----------+-------+-------+-------+-------
> SQL_ASCII | 7094 | 7120 | 7063 | 7031
> LATIN1 | 7083 | 7130 | 7057 | 7031
> UTF8 | 17974 | 10859 | 10839 | 9682
> EUC_JP | 17032 | 17557 | 17599 | 15240
>
> - P1: UTF8MatchText()
> - P2: P1 + __inline__ GenericMatchText()
> - P3: P2 + __inline__ wchareq()
> (The attached patch is P3.)
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>


[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate


--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-18-2008, 09:53 AM
Bruce Momjian
 
Posts: n/a
Default Re: UTF8MatchText

Bruce Momjian wrote:
>
> I do not understand this patch. You have defined two functions,
> UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
> is that one calls CHAREQ and the other calls ICHAREQ, but just above
> those two functions you define the macros identically:
>
> #define CHAREQ(p1, p2) wchareq(p1, p2)
> #define ICHAREQ(p1, p2) wchareq(p1, p2)
>
> Why are there two functions? Also, can't you use one function and just
> pass a boolean to indicate whether case it be ignored?


Sorry, typo:

Why are there two functions? Also, can't you use one function and just
pass a boolean to indicate whether case should be ignored?
------

>
> ---------------------------------------------------------------------------
>
> ITAGAKI Takahiro wrote:
> > "Andrew - Supernews" <andrew@supernews.net> wrote:
> >
> > > ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> > > ITAGAKI> by bytes, not only UTF-8.
> > >
> > > This is false, particularly for EUC.

> >
> > Umm, I see. I updated the optimization to be used only for UTF8 case.
> > I also added some inlining hints that are useful on my machine (Pentium 4).
> >
> >
> > x1000 of LIKE '%foo% on 10000 rows tables [ms]
> > encoding | HEAD | P1 | P2 | P3
> > -----------+-------+-------+-------+-------
> > SQL_ASCII | 7094 | 7120 | 7063 | 7031
> > LATIN1 | 7083 | 7130 | 7057 | 7031
> > UTF8 | 17974 | 10859 | 10839 | 9682
> > EUC_JP | 17032 | 17557 | 17599 | 15240
> >
> > - P1: UTF8MatchText()
> > - P2: P1 + __inline__ GenericMatchText()
> > - P3: P2 + __inline__ wchareq()
> > (The attached patch is P3.)
> >
> > Regards,
> > ---
> > ITAGAKI Takahiro
> > NTT Open Source Software Center
> >

>
> [ Attachment, skipping... ]
>
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 7: You can help support the PostgreSQL project by donating at
> >
> > http://www.postgresql.org/about/donate

>
> --
> Bruce Momjian <bruce@momjian.us> http://momjian.us
> EnterpriseDB http://www.enterprisedb.com
>
> + If your life is a hard drive, Christ can be your backup. +
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings


--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-18-2008, 10:05 AM
Andrew Dunstan
 
Posts: n/a
Default Re: UTF8MatchText


Itagaki,

I find this still fairly unclean. It certainly took me some time to get
me head around what's going on.

ISTM we should generate all these match functions from one body of code
plus some #define magic.

As I understand it, we have three possible encoding switches: Single
Byte, UTF8 and other Multi Byte Charsets, and two possible case
settings: case Sensitive and Case Insensitive. That would make for a
total of six functions, but in the case of both UTF8 and other MBCS we
don't need a special Case Insensitive function - instead we downcase
both the string and the pattern and then use the Case Sensitive
function. That leaves a total of four functions.

What is not clear to me is why the UTF8 optimisation work, and why it
doesn't apply to other MBCS. At the very least we need a comment on that.

I also find the existing function naming convention somewhat annoying -
having foo() and MB_foo() is less than clear. I'd rather have SB_foo()
and MB_foo(). That's not your fault, of course.

If you supply me with some explanation on the UTF8 optimisation issue,
I'll prepare a revised patch along these lines.

cheers

andrew



ITAGAKI Takahiro wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>
>
>>> I do not understand this patch. You have defined two functions,
>>> UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
>>> is that one calls CHAREQ and the other calls ICHAREQ, but just above
>>> those two functions you define the macros identically:
>>>

>> Why are there two functions? Also, can't you use one function and just
>> pass a boolean to indicate whether case should be ignored?
>>

>
> The same is true of MBMatchText() and MBMatchTextIC().
> Now, I'll split the patch into two changes.
>
> 1. DropMBMatchTextIC.patch
> Drop MBMatchTextIC() and use MBMatchText() instead.
>
> 2. UTF8MatchText.patch
> Add UTF8MatchText() as a specialized version of MBMatchText().
>
>
> As a future work, it might be good to research the performance of rewriting
> "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that
> we can avoid to call lower() for constant pattern in the right-hand side and
> can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
> in the future unless we move to wide-character server encoding
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-18-2008, 10:05 AM
Andrew Dunstan
 
Posts: n/a
Default Re: UTF8MatchText


I wrote:
>
>
> ISTM we should generate all these match functions from one body of
> code plus some #define magic.
>
> As I understand it, we have three possible encoding switches: Single
> Byte, UTF8 and other Multi Byte Charsets, and two possible case
> settings: case Sensitive and Case Insensitive. That would make for a
> total of six functions, but in the case of both UTF8 and other MBCS we
> don't need a special Case Insensitive function - instead we downcase
> both the string and the pattern and then use the Case Sensitive
> function. That leaves a total of four functions.
>
> What is not clear to me is why the UTF8 optimisation work, and why it
> doesn't apply to other MBCS. At the very least we need a comment on that.
>
> I also find the existing function naming convention somewhat annoying
> - having foo() and MB_foo() is less than clear. I'd rather have
> SB_foo() and MB_foo(). That's not your fault, of course.
>
> If you supply me with some explanation on the UTF8 optimisation issue,
> I'll prepare a revised patch along these lines.
>
>


Ok, I have studied some more and I think I understand what's going on.
AIUI, we are switching from some expensive char-wise comparisons to
cheap byte-wise comparisons in the UTF8 case because we know that in
UTF8 the magic characters ('_', '%' and '\') aren't a part of any other
character sequence. Is that putting it too mildly? Do we need stronger
conditions than that? If it's correct, are there other MBCS for which
this is true?

cheers

andrew



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-18-2008, 10:05 AM
Tom Lane
 
Posts: n/a
Default Re: UTF8MatchText

Andrew Dunstan <andrew@dunslane.net> writes:
> Ok, I have studied some more and I think I understand what's going on.
> AIUI, we are switching from some expensive char-wise comparisons to
> cheap byte-wise comparisons in the UTF8 case because we know that in
> UTF8 the magic characters ('_', '%' and '\') aren't a part of any other
> character sequence. Is that putting it too mildly? Do we need stronger
> conditions than that? If it's correct, are there other MBCS for which
> this is true?


I don't think this is a correct analysis. If it were correct then we
could use the optimization for all backend charsets because none of them
allow MB characters to contain non-high-bit-set bytes. But it was
stated somewhere upthread that that doesn't actually work. Clearly
it's a necessary property that we not falsely detect the magic pattern
characters, but that's not sufficient.

I think the real issue is that UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that property, we have to use the slow way to
ensure we don't make out-of-sync matches.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-18-2008, 10:05 AM
Tom Lane
 
Posts: n/a
Default Re: UTF8MatchText

Wait a second ... I just thought of a counterexample that destroys the
entire concept. Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*. With the proposed patch in
place, it would match strings of four *bytes*. Which is not the correct
behavior.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-18-2008, 10:05 AM
Andrew Dunstan
 
Posts: n/a
Default Re: UTF8MatchText



Tom Lane wrote:
> UTF8 has disjoint representations for
> first-bytes and not-first-bytes of MB characters, and thus it is
> impossible to make a false match in which an MB pattern character is
> matched to the end of one data character plus the start of another.
> In character sets without that property, we have to use the slow way to
> ensure we don't make out-of-sync matches.
>
>
>


Thanks. I will include this info in the comments.

cheers

andrew

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-18-2008, 10:05 AM
Andrew Dunstan
 
Posts: n/a
Default Re: UTF8MatchText



Tom Lane wrote:
> Wait a second ... I just thought of a counterexample that destroys the
> entire concept. Consider the pattern 'A__B', which clearly is supposed
> to match strings of four *characters*. With the proposed patch in
> place, it would match strings of four *bytes*. Which is not the correct
> behavior.
>
>


From what I can see the code is quite careful about when it calls
NextByte vs NextChar, and after _ it calls NextChar.

So I'll test for this, but I think it's safe.

cheers

andrew



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-18-2008, 10:05 AM
Tom Lane
 
Posts: n/a
Default Re: UTF8MatchText

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Wait a second ... I just thought of a counterexample that destroys the
>> entire concept. Consider the pattern 'A__B', which clearly is supposed
>> to match strings of four *characters*. With the proposed patch in
>> place, it would match strings of four *bytes*. Which is not the correct
>> behavior.


> From what I can see the code is quite careful about when it calls
> NextByte vs NextChar, and after _ it calls NextChar.


Except that the entire point of this patch is to dumb down NextChar to
be the same as NextByte for UTF8 strings.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-18-2008, 10:06 AM
Andrew Dunstan
 
Posts: n/a
Default Re: UTF8MatchText



Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>
>> Tom Lane wrote:
>>
>>> Wait a second ... I just thought of a counterexample that destroys the
>>> entire concept. Consider the pattern 'A__B', which clearly is supposed
>>> to match strings of four *characters*. With the proposed patch in
>>> place, it would match strings of four *bytes*. Which is not the correct
>>> behavior.
>>>

>
>
>> From what I can see the code is quite careful about when it calls
>> NextByte vs NextChar, and after _ it calls NextChar.
>>

>
> Except that the entire point of this patch is to dumb down NextChar to
> be the same as NextByte for UTF8 strings.
>
>
>


That's not what I see in (what I think is) the latest submission, which
includes this snippet:

+ /* Set up for utf8 characters */
+ #define CHAREQ(p1, p2) wchareq(p1, p2)
+ #define NextChar(p, plen) \
+ do { int __l = pg_utf_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
+
+ /*
+ * UTF8MatchText -- specialized version of MBMatchText for UTF8
+ */
+ static int
+ UTF8MatchText(char *t, int tlen, char *p, int plen)

Am I looking at the wrong thing? This is from around April 9th I think.


cheers

andrew


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 01:45 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com