Unix Technical Forum

String encoding during connection "handshake"

This is a discussion on String encoding during connection "handshake" within the pgsql Hackers forums, part of the PostgreSQL category; --> On Wednesday 28 November 2007, Trevor Talbot wrote: > I'm not entirely sure how that's supposed to solve the ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 04-15-2008, 10:35 PM
sulfinu@gmail.com
 
Posts: n/a
Default Re: String encoding during connection "handshake"

On Wednesday 28 November 2007, Trevor Talbot wrote:
> I'm not entirely sure how that's supposed to solve the client
> authentication issue though. Demanding that clients present auth data
> in UTF-8 is no different than demanding they present it in the
> encoding it was entered in originally...

Oh no, it's a big difference: PREDICTABILITY!
Why must I guess the encoding used by the administrator? What if he's Chinese?
Instead, I know the cluster's encoding, just as I know the server name and
the TCP port. And the connection handshake carries on without
misunderstandings (read wrong encoding).



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #12 (permalink)  
Old 04-15-2008, 10:35 PM
sulfinu@gmail.com
 
Posts: n/a
Default Re: String encoding during connection "handshake"

On Wednesday 28 November 2007, Alvaro Herrera wrote:
> sulfinu@gmail.com escribió:
> > Martijn,
> >
> > don't take it personal, I am just trying to obtain confirmation thatI
> >
> > understood well the problem. Afterall, it's just that C has a very
> > outdated notion of "char"s (and no notion of Unicode). I was naively
> > under the impression that "char"s have evolved in nowadays C.

>
> This is not the language's fault in any way. We support plenty of
> encodings beyond UTF-8.

Yes, you support (and worry about) encodings simply because of a C limitation
dating from 1974, if I recall correctly...
In Java, for example, a "char" is a very well defined datum, namely a Unicode
point. While in C it can be some char or another (or an error!) depending on
what encoding was used. The only definition that stands up is that a "char"
is a byte. Its interpretation is unsure and unsafe (see my original problem).

On Wednesday 28 November 2007, Martijn van Oosterhout wrote:
> On Wed, Nov 28, 2007 at 05:54:05PM +0200, sulfinu@gmail.com wrote:
> > Regarding the problem of "One True Encoding", the answer seems obvious to
> > me: use only one encoding per database cluster, either UTF-8 or UTF-16 or
> > another Unicode-aware scheme, whichever yields a statistically smaller
> > database for the languages employed by the users in their data. This
> > encoding should be a one time choice! De facto, this is already happening
> > now, because one cannot change collation rules after a cluster has been
> > created.

>
> Umm, each database in a cluster can have a different encoding, so there
> is no such thing as the "cluster's encoding".

I implied that a cluster should have a single encoding that covers the whole
Unicode set. That would certainly satisfy everybody.

> You can certainly argue
> that it should be a one time choice, but I doubt you'll get people to
> remove the possibilites we have now. If fact, if anything we'd probably
> go the otherway, allow you to select the collation on a per
> database/table/column level (SQL complaince requires this).

The collation order is implemented in close relationship with the byte
representation of strings, but conceptually depends on the locale solely and
has nothing to do with the encoding.

> This has nothing to do with C by the way. C has many features that
> allow you to work with different encodings. It just doesn't force you
> to use any particular one.

Yes, my point exactly! C forces you to worry about encoding. I mean, if you're
not an ASCII-only user

Think of it this way: if I give you a Java String you will perfectly know what
I meant; if I send you a C char* you don't know what it is in the absence of
extra information - you can even use it as a uint8*, as it is actually done
in md5.c.

I consider this matter closed from my point of view and I have modified the
JDBC driver according to my needs.
Thank you all for the help.

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #13 (permalink)  
Old 04-15-2008, 10:35 PM
Trevor Talbot
 
Posts: n/a
Default Re: String encoding during connection "handshake"

On 11/28/07, sulfinu@gmail.com <sulfinu@gmail.com> wrote:
> On Wednesday 28 November 2007, Trevor Talbot wrote:


> > I'm not entirely sure how that's supposed to solve the client
> > authentication issue though. Demanding that clients present auth data
> > in UTF-8 is no different than demanding they present it in the
> > encoding it was entered in originally...


> Oh no, it's a big difference: PREDICTABILITY!
> Why must I guess the encoding used by the administrator? What if he's Chinese?
> Instead, I know the cluster's encoding, just as I know the server name and
> the TCP port. And the connection handshake carries on without
> misunderstandings (read wrong encoding).


What if the user and client program is Chinese too? Not everything is
developed in an environment where UTF-8 support is easily available.
Either way, it is a demand on the client, and not necessarily a simple
one.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #14 (permalink)  
Old 04-15-2008, 10:35 PM
Gregory Stark
 
Posts: n/a
Default Re: String encoding during connection "handshake"

<sulfinu@gmail.com> writes:

> Yes, you support (and worry about) encodings simply because of a C limitation
> dating from 1974, if I recall correctly...
> In Java, for example, a "char" is a very well defined datum, namely a Unicode
> point. While in C it can be some char or another (or an error!) depending on
> what encoding was used.


No, you're being confused by C's idiosyncratic terminology. "char" in C just
means 1-byte integral data type. If you want to store a unicode code point you
use a different data type.

Incidentally I'm not sure but I don't think it's true that "char" in Java
stores a unicode code point. I thought Java used UTF16 internally for strings
and strings stored arrays of chars. In which case "char" in Java stores two
bytes of a UTF16 encoded string which is pretty analogous to storing UTF8
encoded strings in C where each "char" stores one byte of a UTF8 encoded
string.

> Think of it this way: if I give you a Java String you will perfectly know what
> I meant; if I send you a C char* you don't know what it is in the absence of
> extra information - you can even use it as a uint8*, as it is actually done
> in md5.c.


That's because you're comparing apples to oranges. In C you don't even know if
a char* is a string at all. It's a pointer to some bytes and those could
contain anything.

And think about what happens in Java if you have to deal with UTF8 encoded
strings or Big5 encoded strings. They aren't "strings" in the Java object
hierarchy so when someone passes you a "MyString" you have the same problems
of needing to know what encoding was used. Presumably you would put that in a
member variable of the MyString class but that just goes to how the data
structures in C are laid out and what you're considering "extra information".

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #15 (permalink)  
Old 04-15-2008, 10:35 PM
Trevor Talbot
 
Posts: n/a
Default Re: String encoding during connection "handshake"

On 11/28/07, sulfinu@gmail.com <sulfinu@gmail.com> wrote:

> Yes, you support (and worry about) encodings simply because of a C limitation
> dating from 1974, if I recall correctly...
> In Java, for example, a "char" is a very well defined datum, namely a Unicode
> point. While in C it can be some char or another (or an error!) depending on
> what encoding was used. The only definition that stands up is that a "char"
> is a byte. Its interpretation is unsure and unsafe (see my original problem).


It's not really that simple. Java, for instance, does not actually
support Unicode characters / codepoints at the base level; it merely
deals in UTF-16 code units. (The critical difference is in surrogate
pairs.) You're still stuck dealing with a specific encoding even in
many modern languages.

PostgreSQL's encoding support is not just about languages though, it's
also about client convenience. It could simply choose a single
encoding and parrot data to and from the client, but it also does
on-the-fly conversion when a client requests it. It's a very useful
feature, and many mature networked applications support similar
things. An easy example is the World Wide Web itself.

> I implied that a cluster should have a single encoding that covers the whole
> Unicode set. That would certainly satisfy everybody.


Note that it might not. Unicode does not encode *every* character, and
in some cases there is no round-trip mapping between it and other
character sets. The result could be a loss of semantic data. I suspect
it actually would satisfy everyone in PostgreSQL's case, but it's not
something you can assume without checking.

> > This has nothing to do with C by the way. C has many features that
> > allow you to work with different encodings. It just doesn't force you
> > to use any particular one.


> Yes, my point exactly! C forces you to worry about encoding. I mean, if you're
> not an ASCII-only user


For a networked application, you're stuck worrying about the encoding
regardless of language. UTF-8 is the most common Internet transport,
for instance, but that's not the native internal encoding used by Java
and most other Unicode processing platforms to date. That's fairly
simple since it's still only a single character set, but if your
application domain predates Unicode, you can't avoid dealing with the
legacy encodings at some level anyway.


As I implied earlier, I do think it would be worthwhile for PostgreSQL
to move toward handling it better, so I'm not saying this is a bad
idea. It's just that it's a much more complex topic than it might seem
at first glance.

I'm glad you got something working for you.

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #16 (permalink)  
Old 04-15-2008, 10:35 PM
Kris Jurka
 
Posts: n/a
Default Re: String encoding during connection "handshake"



On Wed, 28 Nov 2007, sulfinu@gmail.com wrote:

> I consider this matter closed from my point of view and I have modified the
> JDBC driver according to my needs.
>


Could you explain in more detail what you've done to the JDBC driver in
case it is generally useful or other people have the same problem?

Kris Jurka

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #17 (permalink)  
Old 04-15-2008, 10:35 PM
Neil Conway
 
Posts: n/a
Default Re: String encoding during connection "handshake"

On Wed, 2007-11-28 at 09:38 -0800, Trevor Talbot wrote:
> PostgreSQL's problem is that it (and AFAICT POSIX) conflates encoding
> with locale, when the two are entirely separate concepts.


In what way does PostgreSQL conflate encoding with locale?

-Neil



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 01:29 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com