This is a discussion on String encoding during connection "handshake" within the pgsql Hackers forums, part of the PostgreSQL category; --> On Wednesday 28 November 2007, Trevor Talbot wrote: > I'm not entirely sure how that's supposed to solve the ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On Wednesday 28 November 2007, Trevor Talbot wrote: > I'm not entirely sure how that's supposed to solve the client > authentication issue though. Demanding that clients present auth data > in UTF-8 is no different than demanding they present it in the > encoding it was entered in originally... Oh no, it's a big difference: PREDICTABILITY! Why must I guess the encoding used by the administrator? What if he's Chinese? Instead, I know the cluster's encoding, just as I know the server name and the TCP port. And the connection handshake carries on without misunderstandings (read wrong encoding). ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| On Wednesday 28 November 2007, Alvaro Herrera wrote: > sulfinu@gmail.com escribió: > > Martijn, > > > > > > > > understood well the problem. Afterall, it's just that C has a very > > outdated notion of "char"s (and no notion of Unicode). I was naively > > under the impression that "char"s have evolved in nowadays C. > > This is not the language's fault in any way. We support plenty of > encodings beyond UTF-8. Yes, you support (and worry about) encodings simply because of a C limitation dating from 1974, if I recall correctly... In Java, for example, a "char" is a very well defined datum, namely a Unicode point. While in C it can be some char or another (or an error!) depending on what encoding was used. The only definition that stands up is that a "char" is a byte. Its interpretation is unsure and unsafe (see my original problem). On Wednesday 28 November 2007, Martijn van Oosterhout wrote: > On Wed, Nov 28, 2007 at 05:54:05PM +0200, sulfinu@gmail.com wrote: > > Regarding the problem of "One True Encoding", the answer seems obvious to > > me: use only one encoding per database cluster, either UTF-8 or UTF-16 or > > another Unicode-aware scheme, whichever yields a statistically smaller > > database for the languages employed by the users in their data. This > > encoding should be a one time choice! De facto, this is already happening > > now, because one cannot change collation rules after a cluster has been > > created. > > Umm, each database in a cluster can have a different encoding, so there > is no such thing as the "cluster's encoding". I implied that a cluster should have a single encoding that covers the whole Unicode set. That would certainly satisfy everybody. > You can certainly argue > that it should be a one time choice, but I doubt you'll get people to > remove the possibilites we have now. If fact, if anything we'd probably > go the otherway, allow you to select the collation on a per > database/table/column level (SQL complaince requires this). The collation order is implemented in close relationship with the byte representation of strings, but conceptually depends on the locale solely and has nothing to do with the encoding. > This has nothing to do with C by the way. C has many features that > allow you to work with different encodings. It just doesn't force you > to use any particular one. Yes, my point exactly! C forces you to worry about encoding. I mean, if you're not an ASCII-only user Think of it this way: if I give you a Java String you will perfectly know what I meant; if I send you a C char* you don't know what it is in the absence of extra information - you can even use it as a uint8*, as it is actually done in md5.c. I consider this matter closed from my point of view and I have modified the JDBC driver according to my needs. Thank you all for the help. ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On 11/28/07, sulfinu@gmail.com <sulfinu@gmail.com> wrote: > On Wednesday 28 November 2007, Trevor Talbot wrote: > > I'm not entirely sure how that's supposed to solve the client > > authentication issue though. Demanding that clients present auth data > > in UTF-8 is no different than demanding they present it in the > > encoding it was entered in originally... > Oh no, it's a big difference: PREDICTABILITY! > Why must I guess the encoding used by the administrator? What if he's Chinese? > Instead, I know the cluster's encoding, just as I know the server name and > the TCP port. And the connection handshake carries on without > misunderstandings (read wrong encoding). What if the user and client program is Chinese too? Not everything is developed in an environment where UTF-8 support is easily available. Either way, it is a demand on the client, and not necessarily a simple one. ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| <sulfinu@gmail.com> writes: > Yes, you support (and worry about) encodings simply because of a C limitation > dating from 1974, if I recall correctly... > In Java, for example, a "char" is a very well defined datum, namely a Unicode > point. While in C it can be some char or another (or an error!) depending on > what encoding was used. No, you're being confused by C's idiosyncratic terminology. "char" in C just means 1-byte integral data type. If you want to store a unicode code point you use a different data type. Incidentally I'm not sure but I don't think it's true that "char" in Java stores a unicode code point. I thought Java used UTF16 internally for strings and strings stored arrays of chars. In which case "char" in Java stores two bytes of a UTF16 encoded string which is pretty analogous to storing UTF8 encoded strings in C where each "char" stores one byte of a UTF8 encoded string. > Think of it this way: if I give you a Java String you will perfectly know what > I meant; if I send you a C char* you don't know what it is in the absence of > extra information - you can even use it as a uint8*, as it is actually done > in md5.c. That's because you're comparing apples to oranges. In C you don't even know if a char* is a string at all. It's a pointer to some bytes and those could contain anything. And think about what happens in Java if you have to deal with UTF8 encoded strings or Big5 encoded strings. They aren't "strings" in the Java object hierarchy so when someone passes you a "MyString" you have the same problems of needing to know what encoding was used. Presumably you would put that in a member variable of the MyString class but that just goes to how the data structures in C are laid out and what you're considering "extra information". -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On 11/28/07, sulfinu@gmail.com <sulfinu@gmail.com> wrote: > Yes, you support (and worry about) encodings simply because of a C limitation > dating from 1974, if I recall correctly... > In Java, for example, a "char" is a very well defined datum, namely a Unicode > point. While in C it can be some char or another (or an error!) depending on > what encoding was used. The only definition that stands up is that a "char" > is a byte. Its interpretation is unsure and unsafe (see my original problem). It's not really that simple. Java, for instance, does not actually support Unicode characters / codepoints at the base level; it merely deals in UTF-16 code units. (The critical difference is in surrogate pairs.) You're still stuck dealing with a specific encoding even in many modern languages. PostgreSQL's encoding support is not just about languages though, it's also about client convenience. It could simply choose a single encoding and parrot data to and from the client, but it also does on-the-fly conversion when a client requests it. It's a very useful feature, and many mature networked applications support similar things. An easy example is the World Wide Web itself. > I implied that a cluster should have a single encoding that covers the whole > Unicode set. That would certainly satisfy everybody. Note that it might not. Unicode does not encode *every* character, and in some cases there is no round-trip mapping between it and other character sets. The result could be a loss of semantic data. I suspect it actually would satisfy everyone in PostgreSQL's case, but it's not something you can assume without checking. > > This has nothing to do with C by the way. C has many features that > > allow you to work with different encodings. It just doesn't force you > > to use any particular one. > Yes, my point exactly! C forces you to worry about encoding. I mean, if you're > not an ASCII-only user For a networked application, you're stuck worrying about the encoding regardless of language. UTF-8 is the most common Internet transport, for instance, but that's not the native internal encoding used by Java and most other Unicode processing platforms to date. That's fairly simple since it's still only a single character set, but if your application domain predates Unicode, you can't avoid dealing with the legacy encodings at some level anyway. As I implied earlier, I do think it would be worthwhile for PostgreSQL to move toward handling it better, so I'm not saying this is a bad idea. It's just that it's a much more complex topic than it might seem at first glance. I'm glad you got something working for you. ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Wed, 28 Nov 2007, sulfinu@gmail.com wrote: > I consider this matter closed from my point of view and I have modified the > JDBC driver according to my needs. > Could you explain in more detail what you've done to the JDBC driver in case it is generally useful or other people have the same problem? Kris Jurka ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| ||||
| On Wed, 2007-11-28 at 09:38 -0800, Trevor Talbot wrote: > PostgreSQL's problem is that it (and AFAICT POSIX) conflates encoding > with locale, when the two are entirely separate concepts. In what way does PostgreSQL conflate encoding with locale? -Neil ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |