This is a discussion on Proof of concept COLLATE support with patch within the pgsql Hackers forums, part of the PostgreSQL category; --> Supports any glibc platform and possibly Win32. Adds: SELECT ... ORDER BY expr COLLATE 'locale' CREATE INDEX locale_index ON ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Supports any glibc platform and possibly Win32. Adds: SELECT ... ORDER BY expr COLLATE 'locale' CREATE INDEX locale_index ON table(expr COLLATE 'locale') Index scan used when COLLATE order permits This is just a proof of concept patch. I didn't send it to -patches because as Tom pointed out, there's no hope of it getting in due to platform dependant behaviour. This patch does not use setlocale and is completely orthoganal to any locale support already in the backend. As it turns out, meaningful locale support only needs a handful of support functions to work. These are listed at the bottom. My patch only uses the first two, but the third will be needed at some stage. The use of the last one depends on how the backend ends up support locales. Both glibc and wine32 have locale sensetive versions of many functions including: toupper_l, tolower_l, strfmon_l, strtoul_l, strtof_l, strftime_l, is*_l A windows function list is at: http://msdn2.microsoft.com/library/w...us,vs.80).aspx Patch available here: http://svana.org/kleptog/pgsql/collate1.patch Implementation notes follow and table of functions is at the bottom. I hope this helps whenever someone gets around to full COLLATE support. Have a nice day, Notes: * It works by replacing (expr COLLATE 'locale') with pg_strxfrm(expr, pg_findlocale(locale)) in the parsetree. pg_findlocale returns an opaque pointer to the locale. It is STRICT IMMUTABLE and is optimised away in the final query. pg_strxfrm takes the string and the locale and returns a bytea. bytea comparison uses memcmp so is safe from other locale effects in the backend. * Use of COLLATE for an index will probably double the diskspace required for that index due to the strxfrm. * I had to add the functions to pg_proc.h because CREATE FUNCTION couldn't find them. So they have OIDs I made up. You may need to initdb, I'm not sure. You can compile pg_xlocale.c as an shared object and load them that way too if you want to avoid the initdb. * Internally they are defined as taking and returning "internal". CREATE FUNCTION doesn't like that so specify opaque or oid instead. The declarations are: create function pg_findlocale(text) returns oid as 'pg_findlocale' language internal strict immutable; create function pg_strxfrm(text,oid) returns bytea as 'pg_strxfrm' language internal strict immutable; * The clause ORDER BY 1 COLLATE 'en_AU' breaks, it treats the 1 like a constant. I couldn't quickly work out how to reference the columns the right way. Long term that code should be in the sorting code anyway. * The locale needs to be in quotes, otherwise the parser converts it to lower-case. Locale names are case-sensetive on many systems. * There is a text function strcoll_l for testing collation: create function pg_strcoll_l(text,text,text) returns int4 as 'pg_strcoll_l' language internal strict immutable; * Yes this is the easy way out, implementing the inheritence of the COLLATE attribute will be much more invasive. This gives most people what they want though. * Although these functions are documented on Windows, they are not for glibc, so it is an unstable insterface. Function Needed glibc Win32 --------------------------------------------------------------------- Function returing opaque newlocale _create_locale pointer to locale data strxfrm with locale parameter strxfrm_l _strxfrm_l Method finding encoding for nl_langinfo_l ??? locale strcoll with locale parameter strcoll_l _strcoll_l -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDGE3UIB7bNG8LQkwRAnixAJ4gqdbXsozw5rxr1E3ETD pQz4X/EgCfQ7LH gYZ5cGFCB8DTAl2f7oIcR6w= =rfAZ -----END PGP SIGNATURE----- |
| |||
| Martijn van Oosterhout <kleptog@svana.org> writes: > Supports any glibc platform and possibly Win32. > > Adds: > SELECT ... ORDER BY expr COLLATE 'locale' > CREATE INDEX locale_index ON table(expr COLLATE 'locale') > Index scan used when COLLATE order permits > > This is just a proof of concept patch. I didn't send it to -patches > because as Tom pointed out, there's no hope of it getting in due to > platform dependant behaviour. > > This patch does not use setlocale and is completely orthoganal to any > locale support already in the backend. I still doesn't get where the hostility towards this functionality comes from. Just because some platforms provide a better interface than others doesn't mean Postgres shouldn't do the best it can with what's available. If there were an autoconf test for the *_l functions and a failover to calling setlocale (safely protected) then it's just an issue that the feature will be faster on some platforms than others. It'll still be the same behaviour on all platforms. So there's no actual platform dependent Postgres behaviour. Should readline support be ripped out because not every platform will have readline? Or O_DIRECT support? Or unix domain socket support? -- greg ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Greg Stark <gsstark@mit.edu> writes: > I still doesn't get where the hostility towards this functionality comes from. We're not really willing to say "here is a piece of syntax REQUIRED BY THE SQL SPEC which we only support on some platforms". readline, O_DIRECT, and the like are a completely inappropriate analogy, because those are inherently platform-dependent (and not in the spec). The objection is fundamentally that a platform-specific implementation cannot be our long-term goal, and so expending effort on creating one seems like a diversion. If there were a plan put forward showing how this is just a useful way-station, and we could see how we'd later get rid of the glibc dependency without throwing away the work already done, then it would be a different story. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Fri, Sep 02, 2005 at 03:04:20PM +0200, Martijn van Oosterhout wrote: > Supports any glibc platform and possibly Win32. MacOS X [1] supports this also apparently. And for glibc it appears to have been accepted as part of the API since 2.3.2 and formally accepted into LSB3.0. Win32 claims to have supported this since '98. But even though the MacOS X manpage says "BSD Library Functions" at the top of the page, neither FreeBSD or OpenBSD doesn't appear to have it at all. Not really a lot of chance that we could pull portions of the Darwin libc into PostgreSQL, huh? Maybe the easiest thing would be to download the libc locale support of one of the BSDs, remove the global variable and use that... [1] http://www.hmug.org/man/3/newlocale.php Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDGHVXIB7bNG8LQkwRAjHpAJwMfVKpOYurWhXO5tmI86 2mjWtOEQCfc92j WvrKnLDJoEq+wWfyHso7E58= =RJAH -----END PGP SIGNATURE----- |
| |||
| Martijn van Oosterhout <kleptog@svana.org> writes: > [1] http://www.hmug.org/man/3/newlocale.php Hmm, the more general page seems to be http://www.hmug.org/man/3/xlocale.php This seems to be pretty much exactly what we want, at least API-wise. Now, if we can find an implementation of this with a BSD license ;-) ... [ I don't recall at the moment whether Apple publishes all of Darwin under a straight BSD license, but that would surely be a good place to look first. ] regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Fri, Sep 02, 2005 at 12:44:00PM -0400, Tom Lane wrote: > > Hmm, the more general page seems to be > > http://www.hmug.org/man/3/xlocale.php > > This seems to be pretty much exactly what we want, at least API-wise. > Now, if we can find an implementation of this with a BSD license ;-) ... Yes it is, it's exactly the same interface as glibc. Windows has them all with an underscore prefix. > [ I don't recall at the moment whether Apple publishes all of Darwin > under a straight BSD license, but that would surely be a good place to > look first. ] libc is listed as APSL licence, whatever that means. Something with that many clauses can't be BSD compatable. What I wonder is how come Apple implemented all this in their version yet none of the BSDs got around to it. I've looked around for Citrus, it appears that NetBSD contains the latest version and while there's a lot of stuff for LC_CTYPE and charset conversion, LC_COLLATE didn't appear to be high on their priorities. I especially liked these fragments from the OpenBSD and NetBSD CVS repositories. Tom, you've comvinced me, relying on the platform is silly. We have platforms that don't support LC_COLLATE in one locale, let alone multiple. FreeBSD thankfully does support it. http://cvsweb.netbsd.org/bsdweb.cgi/...oll.c?rev=HEAD http://www.openbsd.org/cgi-bin/cvswe...oll.c?rev=HEAD --- snip --- /* * Compare strings according to LC_COLLATE category of current locale. */ int strcoll(s1, s2) const char *s1, *s2; { _DIAGASSERT(s1 != NULL); _DIAGASSERT(s2 != NULL); /* LC_COLLATE is unimplemented, hence always "C" */ return (strcmp(s1, s2)); } -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDGIYlIB7bNG8LQkwRAj9mAJ9yyAqU4+x28fL9HYFFOE WYaDFPEgCfbHwU 9rfl2+RN6pXnwhHJSFyhot0= =ijEF -----END PGP SIGNATURE----- |
| |||
| The sources can be found here: http://darwinsource.opendarwin.org/1...cale/xlocale.c The Apple License *is* necessarily compatible with the BSD License. http://www.gnu.org/philosophy/apsl.html On Sep 2, 2005, at 11:44 AM, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > >> [1] http://www.hmug.org/man/3/newlocale.php >> > > Hmm, the more general page seems to be > > http://www.hmug.org/man/3/xlocale.php > > This seems to be pretty much exactly what we want, at least API-wise. > Now, if we can find an implementation of this with a BSD > license ;-) ... > > [ I don't recall at the moment whether Apple publishes all of Darwin > under a straight BSD license, but that would surely be a good place to > look first. ] > > regards, tom lane |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|- AgentM agentm@themactionfaction.com |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|- ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| AgentM wrote: > The sources can be found here: > http://darwinsource.opendarwin.org/1...cale/xlocale.c > > The Apple License *is* necessarily compatible with the BSD License. > http://www.gnu.org/philosophy/apsl.html Does compatibile mean our combined work is still BSD licensed? --------------------------------------------------------------------------- > > On Sep 2, 2005, at 11:44 AM, Tom Lane wrote: > > > Martijn van Oosterhout <kleptog@svana.org> writes: > > > >> [1] http://www.hmug.org/man/3/newlocale.php > >> > > > > Hmm, the more general page seems to be > > > > http://www.hmug.org/man/3/xlocale.php > > > > This seems to be pretty much exactly what we want, at least API-wise. > > Now, if we can find an implementation of this with a BSD > > license ;-) ... > > > > [ I don't recall at the moment whether Apple publishes all of Darwin > > under a straight BSD license, but that would surely be a good place to > > look first. ] > > > > regards, tom lane > > |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|- > AgentM > agentm@themactionfaction.com > |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|- > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On Fri, Sep 02, 2005 at 04:49:03PM -0400, Bruce Momjian wrote: > AgentM wrote: > > The sources can be found here: > > http://darwinsource.opendarwin.org/1...cale/xlocale.c > > > > The Apple License *is* necessarily compatible with the BSD License. > > http://www.gnu.org/philosophy/apsl.html > > Does compatibile mean our combined work is still BSD licensed? No, because of clause 2.2 (c) of the APSL, at least. (Must distribute source code if modified.) -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com Dios hizo a Adán, pero fue Eva quien lo hizo hombre. ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| ||||
| On Fri, Sep 02, 2005 at 11:42:21AM -0400, Tom Lane wrote: > The objection is fundamentally that a platform-specific implementation > cannot be our long-term goal, and so expending effort on creating one > seems like a diversion. If there were a plan put forward showing how > this is just a useful way-station, and we could see how we'd later get > rid of the glibc dependency without throwing away the work already done, > then it would be a different story. Well, my patch showed that useful locale work can be acheived with precisely two functions: newlocale and strxfrm_l. I'm going to talk about two things: one, the code from Apple. Two, how we present locale support to users. --- Now, it would be really nice to take Apple's implementation in Darwin and use that. What I don't understand is the licence of the code in Darwin. My interpretation is that stuff in: http://darwinsource.opendarwin.org/1...bc-391/locale/ is Apple stuff under APSL, useless to us. And that stuff in: http://darwinsource.opendarwin.org/1...ocale/FreeBSD/ are just patches to FreeBSD and this under the normal BSD license (no big header claiming the licence change). The good news is that the majority of what we need is in patch form. The bad news is that the hub of the good stuff (newlocale, duplocale, freelocale) is under a big fat APSL licence. Does anyone know if this code can be used at all by BSD projects or did they blanket relicence everything? --- Now, I want to bring up some points relating to including a locale library in PostgreSQL. Given that none of the BSDs seem really interested in fixing the issue we'll have to do it ourselves (I don't see anyone else doing it). We can save ourselves effort by basing it on FreeBSDs locale code, because then we can use their datafiles, which we *definitly* don't want to maintain ourselves. Now: 1. FreeBSDs locale list is short, some 48 compared with glibc's 217. Hopefully Apple can expand on that in a way we can use. But given the difference we should probably give people a way of falling back to the system libraries in case there's a locale we don't support. On the other hand, lots of locales are similar so maybe people can find ones close enough to work. No, glibc and FreeBSD use different file formats, so you can't copy them. Do we want this locale data just for collation, or do we want to be able to use it for formatting monetary amounts too? This is even more info to store. Lots of languages use ISO/IEC 14651 for order. 2. Locale data needs to be combined with a charset and compiled to work with the library. PostgreSQL supports at least 15 charsets but we don't want to ship compiled versions of all of these (Debian learnt that the hard way). So, how do we generate the files people need. a. Auto-compile on demand. First time a locale is referenced spawn the compiler to create the locale, then continue. (Ugh) b. Add a CREATE LOCALE english AS 'en_US' WITH CHARSET 'utf8'. Then require the COLLATE clause to refer to this identifier. This has some appeal, seperating the system names from the PostgreSQL names. It also gives some info regarding charsets. c. Should users be allowed to define new locales? d. Should admins be required to create the external files using a program, say pg_createlocale. Remember, if you use a latin1 locale to sort utf8 you'll get the wrong result, so we want to avoid that. 3. Compiled locale files are large. One UTF-8 locale datafile can exceed a megabyte. Do we want the option of disabling it for small systems? 4. Do we want the option of running system locale in parallel with the internal ones? 5. I think we're going to have to deal with the very real possibility that our locale database will not be as good as some of the system provided ones. The question is how. This is quite unlike timezones which are quite standardized and rarely change. That database is quite well maintained. Would people object to a configure option that selected: --with-locales=internal (use pg database) --with-locales=system (use system database for win32, glibc or MacOS X) --with-locales=none (what we support now, which is neither) I don't think it will be much of an issue to support this, all the functions take the same parameters and have almost the same names. 6. Locales for SQL_ASCII. Seems to me you have two options, either reject COLLATE altogether unless they specify a charset, or don't care and let the user shoot themselves in the foot if they wish... BTW, this MacOS locale supports seems to be new for 10.4.2 according to the CVS log info, can anyone confirm this? Anyway, I hope this post didn't bore too much. Locale support has been one of those things that has bugged me for a long time and it would be nice if there could be some real movement. Have a nice weekend, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDGgjYIB7bNG8LQkwRAr6OAJ9uqNDKDQKWUAY4KiPAaz HJ1TsVWwCeJ7sq 7hcILjdgZTQ4LjyPAhWnJwQ= =5MT9 -----END PGP SIGNATURE----- |