Unix Technical Forum

UTF8 on Debian

This is a discussion on UTF8 on Debian within the pgsql Hackers forums, part of the PostgreSQL category; --> Something very strange is going on on my machine with UTF8: postgres=# show server_encoding; server_encoding ----------------- UTF8 (1 row) ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-15-2008, 10:28 PM
Gregory Stark
 
Posts: n/a
Default UTF8 on Debian


Something very strange is going on on my machine with UTF8:

postgres=# show server_encoding;
server_encoding
-----------------
UTF8
(1 row)

postgres=# select length(convert_from(E'\343\203\251\343\202\244\343 \202\273\343\203\263','utf8'));
length
--------
8
(1 row)

postgres=# select 'substring(s,'||i||',1)',convert_to(substring(s,i, 1),'utf8') from (select convert_from(E'\343\203\251\343\202\244\343\202\27 3\343\203\263','utf8') as s)a, (select generate_series(1,8) as i)b;
?column? | convert_to
------------------+------------
substring(s,1,1) | \343
substring(s,2,1) | \203\251
substring(s,3,1) | \343
substring(s,4,1) | \202\244
substring(s,5,1) | \343
substring(s,6,1) | \202\273
substring(s,7,1) | \343
substring(s,8,1) | \203\263
(8 rows)

I believe this is in fact only four katakana characters. (Namely U+30E9 U+30A4
U+30BB U+30F3) \343 is merely the first byte of each three-byte encoding for
the individual characters.

Dave doesn't see the same behaviour on this three machines, so I think it's
something unique to my machine. Possibly not a Postgres bug at all but some
kind of install gotcha.

I'm running Debian unstable with glibc 2.6.1-4 so it is a bit bleeding edge.
But as I understand it the utf8 decoding is all our code anyways so I can't
quite figure out how it could be glibc's fault.

Does anybody else see anything like this?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-15-2008, 10:28 PM
Gregory Stark
 
Posts: n/a
Default Re: UTF8 on Debian

"Gregory Stark" <stark@enterprisedb.com> writes:

> Something very strange is going on on my machine with UTF8:
>
> postgres=# show server_encoding;
> server_encoding
> -----------------
> UTF8
> (1 row)


Hm. Well this doesn't look right:

(gdb) p pg_wchar_table[PG_UTF8]
$9 = {mb2wchar_with_len = 0x8364cff <pg_mule2wchar_with_len>,
mblen = 0x8364eda <pg_mule_mblen>, dsplen = 0x8364f60 <pg_mule_dsplen>,
mbverify = 0x83655cd <pg_mule_verifier>, maxmblen = 4}

This looks like version skew around the file that Tom recently whacked around
to restore binary compatibility with 8.2. But I can't see how -- I don't have
any old header files in /usr/include or anything like that so I can't see how
I could have picked up an out-of-date header file.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-15-2008, 10:28 PM
Gregory Stark
 
Posts: n/a
Default Re: UTF8 on Debian


The problem was simply that wchar.c has an array indexed by server encoding
number. This puts the order back in sync. I don't haven't looked at the
surrounding code enough to know if there aren't more problems like this
lurking or if there's a way to fix it so it isn't such a pain to maintain.

Index: src/backend/utils/mb/wchar.c
================================================== =================
RCS file: /home/stark/src/REPOSITORY/pgsql/src/backend/utils/mb/wchar.c,v
retrieving revision 1.64
diff -u -r1.64 wchar.c
--- src/backend/utils/mb/wchar.c 18 Sep 2007 17:41:17 -0000 1.64
+++ src/backend/utils/mb/wchar.c 15 Oct 2007 21:53:06 -0000
@@ -1315,6 +1315,7 @@
{pg_euccn2wchar_with_len, pg_euccn_mblen, pg_euccn_dsplen, pg_euccn_verifier, 2}, /* 2; PG_EUC_CN */
{pg_euckr2wchar_with_len, pg_euckr_mblen, pg_euckr_dsplen, pg_euckr_verifier, 3}, /* 3; PG_EUC_KR */
{pg_euctw2wchar_with_len, pg_euctw_mblen, pg_euctw_dsplen, pg_euctw_verifier, 4}, /* 4; PG_EUC_TW */
+ {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* 33; PG_EUC_JIS_2004 */
{pg_utf2wchar_with_len, pg_utf_mblen, pg_utf_dsplen, pg_utf8_verifier, 4}, /* 5; PG_UTF8 */
{pg_mule2wchar_with_len, pg_mule_mblen, pg_mule_dsplen, pg_mule_verifier, 4}, /* 6; PG_MULE_INTERNAL */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* 7; PG_LATIN1 */
@@ -1343,13 +1344,12 @@
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* 30; PG_WIN1254 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* 31; PG_WIN1255 */
{pg_latin12wchar_with_len, pg_latin1_mblen, pg_latin1_dsplen, pg_latin1_verifier, 1}, /* 32; PG_WIN1257 */
- {pg_eucjp2wchar_with_len, pg_eucjp_mblen, pg_eucjp_dsplen, pg_eucjp_verifier, 3}, /* 33; PG_EUC_JIS_2004 */
{0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2}, /* 34; PG_SJIS */
{0, pg_big5_mblen, pg_big5_dsplen, pg_big5_verifier, 2}, /* 35; PG_BIG5 */
{0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifier, 2}, /* 36; PG_GBK */
{0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifier, 2}, /* 37; PG_UHC */
- {pg_johab2wchar_with_len, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* 38; PG_JOHAB */
{0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifier, 4}, /* 39; PG_GB18030 */
+ {pg_johab2wchar_with_len, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifier, 3}, /* 38; PG_JOHAB */
{0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifier, 2} /* 40; PG_SHIFT_JIS_2004 */
};


--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-15-2008, 10:28 PM
Tom Lane
 
Posts: n/a
Default Re: UTF8 on Debian

Gregory Stark <stark@enterprisedb.com> writes:
> Something very strange is going on on my machine with UTF8:


Argh, I missed rearranging a couple of tables in the recent patch to fix
up the encoding IDs. Will fix.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-15-2008, 10:28 PM
Tom Lane
 
Posts: n/a
Default Re: UTF8 on Debian

Gregory Stark <stark@enterprisedb.com> writes:
> ... This puts the order back in sync. I don't haven't looked at the
> surrounding code enough to know if there aren't more problems like this
> lurking or if there's a way to fix it so it isn't such a pain to maintain.


The real problem is I believed the comment in pg_wchar.h that told me
the only table that needed fixed was pg_enc2name :-(

A check of Tatsuo's last couple of encoding-related commits confirms
that there are only two order-sensitive tables, though, so I think
we're good now. Also this time I troubled to run the src/test/mb/
tests, which I really should have tried earlier. Ah well.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-15-2008, 10:28 PM
Gregory Stark
 
Posts: n/a
Default Re: UTF8 on Debian

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> Something very strange is going on on my machine with UTF8:

>
> Argh, I missed rearranging a couple of tables in the recent patch to fix
> up the encoding IDs. Will fix.


Hm, now things work correctly within the server if I play with text data
generated with convert_from(). But I can't seem to send utf-8 text directlyin
psql any more:

postgres=# select 'ƒ‰ƒCƒZƒ“';
?column?
----------
'
(1 row)

postgres=# show client_encoding;
client_encoding
-----------------
UTF8
(1 row)

postgres=# show server_encoding;
server_encoding
-----------------
UTF8
(1 row)

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-15-2008, 10:28 PM
Gregory Stark
 
Posts: n/a
Default Re: UTF8 on Debian


"Gregory Stark" <stark@enterprisedb.com> writes:

> Hm, now things work correctly within the server if I play with text data
> generated with convert_from(). But I can't seem to send utf-8 text directly in
> psql any more:


Ah, figured it out, nothing to do with the server. I didn't have the locale
set to a UTF8 locale before starting up psql. I'm a bit puzzled how the locale
psql is started up in works into this though. It seems to confuse psql's lexer
at the least.

But it goes to all this trouble to track client_encoding, shouldn't it be
using that? Or at least warning you if the encoding it's run under doesn't
match it?


--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-15-2008, 10:28 PM
Tom Lane
 
Posts: n/a
Default Re: UTF8 on Debian

Gregory Stark <stark@enterprisedb.com> writes:
> Hm, now things work correctly within the server if I play with text data
> generated with convert_from(). But I can't seem to send utf-8 text directly=
> in
> psql any more:


Did you rebuild libpq.so? It's got the bogus table embedded in it too.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 03:07 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com