vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I've been going through the docs and list archives trying to get clear on encoding issues, but still have a few question. Do I have these statements correct? - LC_COLLATE is set on the cluster at initdb time. From that point on all database text is sorted based on that *regardless* of the encoding set on an individual database. - So for lc_collate="C" sorting is just based on the byte values, and if lc_collate="en_US" then sorting is based on the us_US order, and the bytes are assumed to be 8859-1 (if that matters). - To clarify the first point, if the database is encoded utf-8 and lc_collate is en_US then Postgresql does NOT try to convert utf-8 to 8859-1 before sorting. - If the "client encoding" and the database encoding differ then Postgresql will convert between the two encodings during I/O. - The exception is if *either* the client or the server's encoding is "SQL_ASCII" then no client<=>server conversion is done. Sound about right? 1) What else is the database's encoding used for besides to determine how to convert text in input and output based on the client encoding? 2) What client encoding is used if the client does not specify one? For example, I'm looking through Perl's DBD::Pg and I don't see any place where it calls PQsetClientEncoding(). http://search.cpan.org/src/DBDPG/DBD-Pg-1.41/ http://www.postgresql.org/docs/7.4/i...NSLATION-TABLE 3) The vast majority of my utf-8 encoded text that I need to display sorted probably maps to 8859-1 characters. I think I already answered this above, but: Am I correct that Postgresql is *not* converting text from the database encoding to the cluster encoding before sorting? That is with "C" it's just sorting in byte order, and with en_US it's just assuming that the bytes are 8859-1 and ignoring that it's really utf-8? That is, if I have text that's in utf-8 but includes characters that would map to 8859-1 (say accented chars), that sorting will not be correct because it's not converted to 8859-1 when sorting? 4) If the above is true, then if I wanted my utf-8 encoded text to be sorted correctly then I'd need to re-initdb using --encoding=en_US.UTF-8, correct? 5) I suppose there's not way to answer this, short of running benchmarks, but any ideas what using a lc_collate with utf-8 would do to performance? Is it a big hit? Not related to Postgresql, but testing some of this is confusing due to my environment. How do I get my xterm to work with utf8? Does ssh do something with encoding? If I have a utf8 xterm window open on my machine, then ssh to the server running postgresql where the default locale is "POSIX" Then running: LANG=en_US.utf8 psql utf8test utf8test=> \encoding UNICODE utf8test=> select first_name from person where last_name = 'Anderson'; Then I see: Zo<C3><AB> But, if on that same remote machine I run a unicode xterm (uxterm in Debian) then in that xterm window I do: utf8test=> \encoding UNICODE utf8test=> select first_name from person where last_name = 'Anderson'; Zoë (correct) It's must slower running xterm remotely than using my local xterm and ssh, so it would be nice to be able to display the utf8. -- Bill Moseley moseley@hank.org ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Bill Moseley wrote: > Do I have these statements correct? yes > 1) What else is the database's encoding used for besides to determine > how to convert text in input and output based on the client encoding? nothing > 2) What client encoding is used if the client does not specify one? the server encoding > 3) The vast majority of my utf-8 encoded text that I need to display > sorted probably maps to 8859-1 characters. probably not > That is, if I have text that's in utf-8 but includes characters that > would map to 8859-1 (say accented chars), that sorting will not be > correct because it's not converted to 8859-1 when sorting? right > 4) If the above is true, then if I wanted my utf-8 encoded text to be > sorted correctly then I'd need to re-initdb using > --encoding=en_US.UTF-8, correct? right > 5) I suppose there's not way to answer this, short of running > benchmarks, but any ideas what using a lc_collate with utf-8 would do > to performance? Is it a big hit? I don't know why that would be a problem. > Not related to Postgresql, but testing some of this is confusing > due to my environment. How do I get my xterm to work with utf8? > Does ssh do something with encoding? I don't use xterm so I'll skip the rest. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| On Sat, Feb 18, 2006 at 05:20:19PM +0100, Peter Eisentraut wrote: > > 2) What client encoding is used if the client does not specify one? > > the server encoding What's the server encoding? The environment when the cluster is started? How do you find out what it's running as? Does that mean if the encoding is anything other than "C" then Postgresql will convert? That is, if my database is utf8 and the server is en_US then text will be sent to the client as 8859-1? Not, that's not correct as I'm not seeing that. So I guess I'm not clear on that point. > > 5) I suppose there's not way to answer this, short of running > > benchmarks, but any ideas what using a lc_collate with utf-8 would do > > to performance? Is it a big hit? > > I don't know why that would be a problem. Just that sorting utf8 is a bit more work that sorting raw bytes. Thanks for the help, -- Bill Moseley moseley@hank.org ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Bill Moseley wrote: > What's the server encoding? When you say "My database is in utf8", then "utf8" is the server encoding. > Does that mean if the encoding is anything other than "C" C is a locale, not an encoding. > Just that sorting utf8 is a bit more work that sorting raw bytes. Sorting in C locale is certainly faster, but for anything else, there won't be any noticeable difference I would think. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Bill Moseley <moseley@hank.org> writes: > - To clarify the first point, if the database is encoded utf-8 and > lc_collate is en_US then Postgresql does NOT try to convert utf-8 to > 8859-1 before sorting. Basically, this is a horribly bad idea and you should never do it. The database encoding should always match what the locale assumes for its character set (unless the locale is "C", which doesn't care). We'd enforce that you never do it if we knew a portable way to determine the character set assumed by an LC_COLLATE setting. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Sat, Feb 18, 2006 at 01:40:09PM -0500, Tom Lane wrote: > Bill Moseley <moseley@hank.org> writes: > > - To clarify the first point, if the database is encoded utf-8 and > > lc_collate is en_US then Postgresql does NOT try to convert utf-8 to > > 8859-1 before sorting. > > Basically, this is a horribly bad idea and you should never do it. > The database encoding should always match what the locale assumes > for its character set (unless the locale is "C", which doesn't care). What's a bad idea? Having a lc_collate on the cluster that doesn't support the encodings in the databases? > We'd enforce that you never do it if we knew a portable way to determine > the character set assumed by an LC_COLLATE setting. Again, not sure what "it" is, but I do find it confusing when the cluster can have only one lc_collate, but the databases on that cluster can have more than one encoding. That's why I was asking how postgresql handles (possibly) different encodings. Are you saying that if a database is encoded as utf8 then the cluster should be initiated with something like en_US.utf8? And then all databaes on that cluster should be encoded the same? I suspect I don't understand how LC_COLLATE works that well. I thought the locale defines the order of the characters, but not the encoding of those characters. Maybe that's not correct. I assumed the same locale should sort the same chars represented in different encodings the same way. Maybe that's not the case: $ LC_ALL=en_US.UTF-8 locale charmap UTF-8 $ LC_ALL=en_US locale charmap ISO-8859-1 $ LC_ALL=C locale charmap ANSI_X3.4-1968 -- Bill Moseley moseley@hank.org ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| Bill Moseley wrote: > What's a bad idea? Having a lc_collate on the cluster that doesn't > support the encodings in the databases? Exactly > Again, not sure what "it" is, but I do find it confusing when the > cluster can have only one lc_collate, but the databases on that > cluster can have more than one encoding. It is confusing, so don't do it. > That's why I was asking > how postgresql handles (possibly) different encodings. It doesn't. > Are you saying that if a database is encoded as utf8 then the cluster > should be initiated with something like en_US.utf8? And then all > databaes on that cluster should be encoded the same? Yes > I thought the locale defines the order of the characters, but not the > encoding of those characters. In theory, they are independent concepts. But in practice, the C library gets a bunch bytes from the application (in this case, the PostgreSQL server) and is asked to sort them. So it needs to know what these bytes are supposed to mean. By design of the POSIX locale facilities, the C library is told that by way of the locale. It would be much simpler for everyone if there was a function strcmp(string1, string2, collation, encoding), but there isn't. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Bill Moseley <moseley@hank.org> writes: > $ LC_ALL=en_US.UTF-8 locale charmap > UTF-8 > > $ LC_ALL=en_US locale charmap > ISO-8859-1 > > $ LC_ALL=C locale charmap > ANSI_X3.4-1968 Unfortunately Postgres only supports a single collation cluster-wide. So depending on which collation you use of the ones above you would really have to select either UTF-8 ISO-8859-1 or SQL_ASCII (ie ANSI_X3.4-1968). Anything else and the collation just won't work properly. It will be expecting UTF-8 and be fed ISO-8859-1 strings, resulting in weird and sometimes inconsistent sort orders. There's a certain amount of feeling that using any locale other than C is probably not ever the right thing given the current functionality. Just about any database has some strings in it that are really just ascii strings like char(1) primary keys and other internal database strings. You may not want them being subject to the locale's collation for comparison purposes and you may not want the overhead of variable width character encodings. Those of us in this camp are defining all our databases using C locale and then using the pg_strxfrm() function that's been floating around the list for a while to handle sorting strings that need to be sorted in various locales. This has performs acceptably (but not spectacularly) under glibc but it's not clear which other libc implementations it works well under. It also doesn't solve the whole problem since functions like substr() or LIKE are locale sensitive too. If you need an encoding like UTF-8 and you're stuck either pushing all your string manipulations into the client or going ahead with a non-C locale and UTF-8 even for the strings that are really just ascii strings. -- greg ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Sat, Feb 18, 2006 at 09:31:27PM -0500, Greg Stark wrote: > Anything else and the collation just won't work properly. It will be > expecting UTF-8 and be fed ISO-8859-1 strings, resulting in weird > and sometimes inconsistent sort orders. So if I have utf8 encoded text and the lc_collate is anything but utf8 then sorting will be all wrong for any chars that don't map to ASCII (>127). Kind of a mess. > There's a certain amount of feeling that using any locale other than C is > probably not ever the right thing given the current functionality. Just about > any database has some strings in it that are really just ascii strings like > char(1) primary keys and other internal database strings. You may not want > them being subject to the locale's collation for comparison purposes and you > may not want the overhead of variable width character encodings. Is the Holy Grail encoding and lc_collate settings per column? Changing topics, but I'm going to play with different cluster settings for collate. If I create a cluster in given directory is there any problems with moving that cluster (renaming the directory)? Thanks for your comments, Greg. -- Bill Moseley moseley@hank.org ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| ||||
| On Sat, Feb 18, 2006 at 08:16:07PM -0800, Bill Moseley wrote: > Is the Holy Grail encoding and lc_collate settings per column? Well yes. I've been trying to create a system where you can handle multiple collations in the same database. I posted the details to -hackers and got part of the way, but it's a lot of work. As for encodings, to be honest, I'm not sure whether it's a great idea to support multiple encodings simultaneously. Things become a lot easier if you know everything is the same encoding. If you set the client_encoding automatically on startup it has pretty much the same effect as having the server always use that encoding. It's just a bit of time wasted in conversion, but the client doesn't need to care. By way of example, see ICU which is an internationalisation library we're considering to get consistant locale support over all platforms. It supports one encoding, namely UTF-16. It has various functions to convert other encodings to or from that, but internally it's all UTF-16. So if we do use that, then all encodings (except native UTF-16) will need to conversion all the time, so you don't buy anything by having the server in some random encoding. The problem ofcourse being that the SQL standard requires some encoding support. No-one has really come up with a proposal for that yet. IMHO, that's a parser issue more than anything else. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFD+GhyIB7bNG8LQkwRAvfiAJ9eJ4n4cuMI4Hh5kH99qR lby3SZ/wCfVNe5 iLHToMvqjNemVaDCiifS1uk= =OEis -----END PGP SIGNATURE----- |