Unix Technical Forum

Re: Patch for collation using ICU

This is a discussion on Re: Patch for collation using ICU within the pgsql Hackers forums, part of the PostgreSQL category; --> > -----Original Message----- > From: Palle Girgensohn [mailto:girgen@pingpong.net] > Sent: Saturday, March 26, 2005 1:10 PM > To: pgsql-hackers@postgresql.org ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 04:14 AM
John Hansen
 
Posts: n/a
Default Re: Patch for collation using ICU



> -----Original Message-----
> From: Palle Girgensohn [mailto:girgen@pingpong.net]
> Sent: Saturday, March 26, 2005 1:10 PM
> To: pgsql-hackers@postgresql.org
> Cc: John Hansen; Andrew Dunstan
> Subject: Re: [HACKERS] Patch for collation using ICU
>
> --On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn
> <girgen@pingpong.net> wrote:
>
> > Hi!
> >
> > I've put together a patch for using IBM's ICU package for collation.
> >
> > If your OS does not have full support for collation ur
> > uppercase/lowercase in multibyte locales, this might be

> useful. If you
> > are using a multibyte character encoding in your database and want
> > collation, i.e. order by, and also lower(), upper() and

> initcap() to
> > work properly, this patch will do just that.
> >
> > This patch is needed for FreeBSD, since this OS has no support for
> > collation of for example unicode locales (that is,

> wcscoll(3) does not
> > do what you expect if you set LC_ALL=sv_SE.UTF-8, for

> example). AFAIK
> > the patch is *not* necessary for Linux, although IBM claims ICU
> > collation to be about twice as fast as glibc for simple

> western locales.
> >
> > It adds a configure switch, `--with-icu', which will set up

> the code
> > to use ICU instead of wchar_t and wcscoll.
> >
> > This has been tested only on FreeBSD-4.11 &

> FreeBSD-5-stable, where it
> > seems to run well. I've not had the time to do any comparative
> > performance tests yet, but it seems it is at least not slower than
> > using
> > LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster.
> >
> > I'd be delighted if some more experienced postgresql hackers would
> > review this stuff. The patch is pretty compact, so it's

> fast reading
> > I'm planning to add this patch as an option (tagged
> > "experimental") to FreeBSD's postgresql port. Any ideas

> about whether
> > this is a good idea or not?
> >
> > Any thoughts or ideas are welcome!
> >
> > Cheers,
> > Palle
> >
> > Patch at:
> >

> <http://people.freebsd.org/~girgen/po...-icu-2005-03-1
> > 4.d
> > iff>
> >
> > ICU at sourceforge: <http://icu.sf.net/>

>
>
> Hi!
>
> There's a new patch to fix some reported problems.
>
> <http://people.freebsd.org/~girgen/po...u/pg-801-icu-2

005-03-26.diff>
>
> This version uses the DatabaseEncoding and sets the ICU
> encoding at the same time. I had to create a conversion table
> from PostgreSQL's own, somewhat odd and non-standard, names
> of encodings, into the prefered IANA names. On or two of the
> more odd ones might be slightly incorrect, hopefully not too
> far off anyway?
>
> I've noticed a couple of things about using the ICU patch vs. pristine
> pg-8.0.1:
>
> - ORDER BY is case insensitive when using ICU. This might
> break the SQL standard (?), but sure is nice


This would mean that indexes are also case insensitive right?
Which makes it a Bad Thing(tm).

> - When the database is initialized using the C locale,
> upper() and lower() normally does not work at all for
> non-ASCII characters even if the database's encoding is say
> LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD,
> and this is probably correct since the locale is still `C', I
> believe?). The ICU patch changes nothing for the LATIN1 case,
> since it does not act on single byte encodings, but for the
> UNICODE representation, it works and does what I expect it
> to, namely upper() and lower() neatly
> upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ')
> -> 'åäö'.
> This is a good thing, although I'm surprised that upper/lower
> is dragged along with the LC_COLLATE fixation at initdb. I
> never run initdb in the C locale, but only now do I realize
> how broken that really is if you need to store anything else
> than English :-)


That is what I would have expected. However, it probably won't work for the more exotic cases, like turkish I, which depends on the locale.

>
> I'd be delighted to get more feedback about this stuff.
>
> Thanks,
> Palle
>
>
>


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 04:14 AM
Palle Girgensohn
 
Posts: n/a
Default Re: Patch for collation using ICU



--On lördag, mars 26, 2005 13.59.19 +1100 John Hansen <john@geeknet.com.au>
wrote:

>> - ORDER BY is case insensitive when using ICU. This might
>> break the SQL standard (?), but sure is nice

>
> This would mean that indexes are also case insensitive right?
> Which makes it a Bad Thing(tm).


Well, no, not really. Indices use collation rules, yes, but upper and lower
case strings are not considered *equal*, just "closer related". In
collation, characters are compared at four levels. See [1] for a good
explaination. This means that indices will use a case insensitive sort
order, but equality will not be different, so it shouldn't break anything.

>> - When the database is initialized using the C locale,
>> upper() and lower() normally does not work at all for
>> non-ASCII characters even if the database's encoding is say
>> LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD,
>> and this is probably correct since the locale is still `C', I
>> believe?). The ICU patch changes nothing for the LATIN1 case,
>> since it does not act on single byte encodings, but for the
>> UNICODE representation, it works and does what I expect it
>> to, namely upper() and lower() neatly
>> upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ')
>> -> 'åäö'.
>> This is a good thing, although I'm surprised that upper/lower
>> is dragged along with the LC_COLLATE fixation at initdb. I
>> never run initdb in the C locale, but only now do I realize
>> how broken that really is if you need to store anything else
>> than English :-)

>
> That is what I would have expected. However, it probably won't work for
> the more exotic cases, like turkish I, which depends on the locale.


Nope, Turkish must of course have its locale to for example handle their
special capital "i". Let's just say it is less broken

/Palle


[1]
<http://icu.sourceforge.net/userguide/Collate_Concepts.html#Comparison_Levels>


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 11:41 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com