This is a discussion on Patch for collation using ICU within the pgsql Hackers forums, part of the PostgreSQL category; --> Hi! I've put together a patch for using IBM's ICU package for collation. If your OS does not have ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi! I've put together a patch for using IBM's ICU package for collation. If your OS does not have full support for collation ur uppercase/lowercase in multibyte locales, this might be useful. If you are using a multibyte character encoding in your database and want collation, i.e. order by, and also lower(), upper() and initcap() to work properly, this patch will do just that. This patch is needed for FreeBSD, since this OS has no support for collation of for example unicode locales (that is, wcscoll(3) does not do what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the patch is *not* necessary for Linux, although IBM claims ICU collation to be about twice as fast as glibc for simple western locales. It adds a configure switch, `--with-icu', which will set up the code to use ICU instead of wchar_t and wcscoll. This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it seems to run well. I've not had the time to do any comparative performance tests yet, but it seems it is at least not slower than using LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. I'd be delighted if some more experienced postgresql hackers would review this stuff. The patch is pretty compact, so it's fast reading planning to add this patch as an option (tagged "experimental") to FreeBSD's postgresql port. Any ideas about whether this is a good idea or not? Any thoughts or ideas are welcome! Cheers, Palle Patch at: <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff> ICU at sourceforge: <http://icu.sf.net/> ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| |||
| --On fredag, mars 25, 2005 00.40.04 +0100 Palle Girgensohn <girgen@pingpong.net> wrote: > Hi! > > I've put together a patch for using IBM's ICU package for collation. > > If your OS does not have full support for collation ur > uppercase/lowercase in multibyte locales, this might be useful. If you > are using a multibyte character encoding in your database and want > collation, i.e. order by, and also lower(), upper() and initcap() to work > properly, this patch will do just that. > > This patch is needed for FreeBSD, since this OS has no support for > collation of for example unicode locales (that is, wcscoll(3) does not do > what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the > patch is *not* necessary for Linux, although IBM claims ICU collation to > be about twice as fast as glibc for simple western locales. > > It adds a configure switch, `--with-icu', which will set up the code to > use ICU instead of wchar_t and wcscoll. > > This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it > seems to run well. I've not had the time to do any comparative > performance tests yet, but it seems it is at least not slower than using > LATIN1 with sv_SE.ISO8859-1 locale, perhaps even faster. > > I'd be delighted if some more experienced postgresql hackers would review > this stuff. The patch is pretty compact, so it's fast reading > planning to add this patch as an option (tagged "experimental") to > FreeBSD's postgresql port. Any ideas about whether this is a good idea or > not? > > Any thoughts or ideas are welcome! > > Cheers, > Palle > > Patch at: > <http://people.freebsd.org/~girgen/po...u-2005-03-14.d > iff> > > ICU at sourceforge: <http://icu.sf.net/> Hi! There's a new patch to fix some reported problems. <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-26.diff> This version uses the DatabaseEncoding and sets the ICU encoding at the same time. I had to create a conversion table from PostgreSQL's own, somewhat odd and non-standard, names of encodings, into the prefered IANA names. On or two of the more odd ones might be slightly incorrect, hopefully not too far off anyway? I've noticed a couple of things about using the ICU patch vs. pristine pg-8.0.1: - ORDER BY is case insensitive when using ICU. This might break the SQL standard (?), but sure is nice - When the database is initialized using the C locale, upper() and lower() normally does not work at all for non-ASCII characters even if the database's encoding is say LATIN1 or UNICODE. (does not work for me anyway, on FreeBSD, and this is probably correct since the locale is still `C', I believe?). The ICU patch changes nothing for the LATIN1 case, since it does not act on single byte encodings, but for the UNICODE representation, it works and does what I expect it to, namely upper() and lower() neatly upper- or lowercase diacritical characters, i.e. lower('ÅÄÖ') -> 'åäö'. This is a good thing, although I'm surprised that upper/lower is dragged along with the LC_COLLATE fixation at initdb. I never run initdb in the C locale, but only now do I realize how broken that really is if you need to store anything else than English :-) I'd be delighted to get more feedback about this stuff. Thanks, Palle ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Sat, 26 Mar 2005, Palle Girgensohn wrote: > I've noticed a couple of things about using the ICU patch vs. pristine > pg-8.0.1: > > - ORDER BY is case insensitive when using ICU. This might break the SQL > standard (?), but sure is nice Err, I think if your system implements strcoll correctly 8.0.1 can do this if the chosen collation is set up that way (or at least naive tests I've done seem to imply that). Or are you speaking about C locale? ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org |
| |||
| --On lördag, mars 26, 2005 08.16.01 -0800 Stephan Szabo <sszabo@megazone.bigpanda.com> wrote: > On Sat, 26 Mar 2005, Palle Girgensohn wrote: >> I've noticed a couple of things about using the ICU patch vs. pristine >> pg-8.0.1: >> >> - ORDER BY is case insensitive when using ICU. This might break the SQL >> standard (?), but sure is nice > > Err, I think if your system implements strcoll correctly 8.0.1 can do this > if the chosen collation is set up that way (or at least naive tests I've > done seem to imply that). Or are you speaking about C locale? No, I doubt this. Example: set up a cluster: $ initdb -E LATIN1 --locale=sv_SE.ISO8859-1 $ createdb foo CREATE DATABASE $ psql foo foo=# create table bar (val text); CREATE TABLE foo=# insert into bar values ('aaa'); INSERT 18354409 1 foo=# insert into bar values ('BBB'); INSERT 18354412 1 foo=# select val from bar order by val; val ----- BBB aaa (2 rows) Order by is not case insensitive. It shouldn't be for any system, AFAIK. As John Hansen noted, this might be a bad thing. I'm not sure about that, though... As for general collation of unicode, the reason for me to use ICU is that my system does not support strcoll correctly for multibyte locales, as I mentioned earlier. I also noted that even for systems that do handle strcoll correctly for unicode, ICU claims to be a couple of magnitudes faster, so this patch might be useful for other systems (read Linux) as well. See previous emails for details. Regards, Palle ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| |||
| On Sun, 27 Mar 2005, Palle Girgensohn wrote: > > > --On lördag, mars 26, 2005 08.16.01 -0800 Stephan Szabo > <sszabo@megazone.bigpanda.com> wrote: > > > On Sat, 26 Mar 2005, Palle Girgensohn wrote: > >> I've noticed a couple of things about using the ICU patch vs. pristine > >> pg-8.0.1: > >> > >> - ORDER BY is case insensitive when using ICU. This might break the SQL > >> standard (?), but sure is nice > > > > Err, I think if your system implements strcoll correctly 8.0.1 can do this > > if the chosen collation is set up that way (or at least naive tests I've > > done seem to imply that). Or are you speaking about C locale? > > No, I doubt this. > > Example: set up a cluster: > $ initdb -E LATIN1 --locale=sv_SE.ISO8859-1 > $ createdb foo > CREATE DATABASE > $ psql foo > foo=# create table bar (val text); > CREATE TABLE > foo=# insert into bar values ('aaa'); > INSERT 18354409 1 > foo=# insert into bar values ('BBB'); > INSERT 18354412 1 > foo=# select val from bar order by val; > val > ----- > BBB > aaa > (2 rows) > > > Order by is not case insensitive. It shouldn't be for any system, AFAIK. As It is on my machine... for the same test: foo=# select val from bar order by val; val ----- aaa BBB (2 rows) I think this just implies even greater breakage of either the collation or strcoll on the system you're trying on. reasonable reason to offer an alternative. Especially if it's generically useful. ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) |
| |||
| --On lördag, mars 26, 2005 17.40.01 -0800 Stephan Szabo <sszabo@megazone.bigpanda.com> wrote: > > On Sun, 27 Mar 2005, Palle Girgensohn wrote: > >> >> >> --On lördag, mars 26, 2005 08.16.01 -0800 Stephan Szabo >> <sszabo@megazone.bigpanda.com> wrote: >> >> > On Sat, 26 Mar 2005, Palle Girgensohn wrote: >> >> I've noticed a couple of things about using the ICU patch vs. pristine >> >> pg-8.0.1: >> >> >> >> - ORDER BY is case insensitive when using ICU. This might break the >> >> SQL standard (?), but sure is nice >> > >> > Err, I think if your system implements strcoll correctly 8.0.1 can do >> > this if the chosen collation is set up that way (or at least naive >> > tests I've done seem to imply that). Or are you speaking about C >> > locale? >> >> No, I doubt this. >> >> Example: set up a cluster: >> $ initdb -E LATIN1 --locale=sv_SE.ISO8859-1 >> $ createdb foo >> CREATE DATABASE >> $ psql foo >> foo=# create table bar (val text); >> CREATE TABLE >> foo=# insert into bar values ('aaa'); >> INSERT 18354409 1 >> foo=# insert into bar values ('BBB'); >> INSERT 18354412 1 >> foo=# select val from bar order by val; >> val >> ----- >> BBB >> aaa >> (2 rows) >> >> >> Order by is not case insensitive. It shouldn't be for any system, AFAIK. >> As > > It is on my machine... for the same test: > > foo=# select val from bar order by val; > val > ----- > aaa > BBB > (2 rows) > > I think this just implies even greater breakage of either the collation or > strcoll on the system you're trying on. > reasonable reason to offer an alternative. Especially if it's generically > useful. Interesting! Indeed, just tried on an old Linux Redhat system... BTW, that's pretty odd for a unix system. "ls -l" sorts aaa before BBB, I've never seen the likes of it! Call me old fashion if you like ;-) Still, as you say, FreeBSD does it capital letters first, and does not handle unicode locales' collation, so I need an alternative. Perhaps the best way would be to inject ICU into BSD instead :-) /Palle ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) |
| |||
| On L, 2005-03-26 at 03:09 +0100, Palle Girgensohn wrote: > Hi! > .... > I've noticed a couple of things about using the ICU patch vs. pristine > pg-8.0.1: > > - ORDER BY is case insensitive when using ICU. This might break the SQL > standard (?), but sure is nice How does your patch interact with the ability to use indexes for anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use index) ? -- Hannu Krosing <hannu@tm.ee> ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| |||
| --On söndag, mars 27, 2005 04.34.03 +0300 Hannu Krosing <hannu@tm.ee> wrote: > On L, 2005-03-26 at 03:09 +0100, Palle Girgensohn wrote: > >> Hi! >> > ... >> I've noticed a couple of things about using the ICU patch vs. pristine >> pg-8.0.1: >> >> - ORDER BY is case insensitive when using ICU. This might break the SQL >> standard (?), but sure is nice Just a comment: ORDER BY *is* already case sensitive on Linux, since its strcoll ignores case. I doubt very much it violates SQL standards. > How does your patch interact with the ability to use indexes for > anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use index) ? I don't think it matters. You still need to use the special non-locale index functions described in the handbook to get anchored like queries use indices. My ICU patch does not alter this. ICU is "injected" where strings are compared and the database cluster was initialized with a multibyte character encoding. The problem, AFAIK, has to do with the nature of (some) locales, not with a specific implementation of collation. Regards, Palle ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Palle Girgensohn wrote: > Just a comment: ORDER BY *is* already case sensitive on Linux, since > its strcoll ignores case. I doubt very much it violates SQL > standards. The behavior of collation sequences is implementation-defined. So as long as you can put the behavior in words, it should be OK. It would seem, however, that the behavior of a certain locale name should be the same with or without ICU, so perhaps some locale renaming might be needed, but that is speculation on my part. > > How does your patch interact with the ability to use indexes for > > anchored LIKE or regex (i.e. can "name LIKE 'start%'" still use > > index) ? > The problem, AFAIK, has to do with the nature of (some) locales, not > with a specific implementation of collation. Yeah, pretty much the whole point of that code is to avoid collating stuff. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| ||||
| Is this patch ready for application? http://people.freebsd.org/~girgen/po...-05-06.diff.gz The web site is: http://people.freebsd.org/~girgen/po...cu/readme.html I do have a few questions: Why don't you use the lc_ctype_is_c() part of this test? if (pg_database_encoding_max_length() > 1 && !lc_ctype_is_c()) Why is so much code added, for example, in lower()? The existing multibyte code is much smaller, and lots of code is added in other places too. Why do you need to add a mapping of encoding names from iana to our names? --------------------------------------------------------------------------- Palle Girgensohn wrote: > Hi! > > I've put together a patch for using IBM's ICU package for collation. > > If your OS does not have full support for collation ur uppercase/lowercase > in multibyte locales, this might be useful. If you are using a multibyte > character encoding in your database and want collation, i.e. order by, and > also lower(), upper() and initcap() to work properly, this patch will do > just that. > > This patch is needed for FreeBSD, since this OS has no support for > collation of for example unicode locales (that is, wcscoll(3) does not do > what you expect if you set LC_ALL=sv_SE.UTF-8, for example). AFAIK the > patch is *not* necessary for Linux, although IBM claims ICU collation to be > about twice as fast as glibc for simple western locales. > > It adds a configure switch, `--with-icu', which will set up the code to use > ICU instead of wchar_t and wcscoll. > > This has been tested only on FreeBSD-4.11 & FreeBSD-5-stable, where it > seems to run well. I've not had the time to do any comparative performance > tests yet, but it seems it is at least not slower than using LATIN1 with > sv_SE.ISO8859-1 locale, perhaps even faster. > > I'd be delighted if some more experienced postgresql hackers would review > this stuff. The patch is pretty compact, so it's fast reading > planning to add this patch as an option (tagged "experimental") to > FreeBSD's postgresql port. Any ideas about whether this is a good idea or > not? > > Any thoughts or ideas are welcome! > > Cheers, > Palle > > Patch at: > <http://people.freebsd.org/~girgen/postgresql-icu/pg-801-icu-2005-03-14.diff> > > ICU at sourceforge: <http://icu.sf.net/> > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |