Unix Technical Forum

Re: Unicode problems on IRC

This is a discussion on Re: Unicode problems on IRC within the pgsql Hackers forums, part of the PostgreSQL category; --> >On 2005-04-10, Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 03:22 AM
John Hansen
 
Posts: n/a
Default Re: Unicode problems on IRC

>On 2005-04-10, Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot )
us> wrote:
>> Andrew - Supernews <andrew+nonews ( at ) supernews ( dot ) com>

writes:
>>> I think you will find that this impression is actually false. Or

that at
>>> the very least, _correct_ verification of UTF-8 sequences will still
>>> catch essentially all cases of non-utf-8 input mislabelled as utf-8
>>> while allowing the full range of Unicode codepoints.

>>
>> Yeah? Cool. Does John's proposed patch do it "correctly"?
>>
>> http://candle.pha.pa.us/mhonarc/patches2/msg00076.html

>
>It looks correct to me. The only thing I think that code will let

through
>incorrectly are encoded surrogates; those could be fixed by adding one

line:
>
> switch (*source) {
> /* no fall-through in this inner switch */
> case 0xE0: if (a < 0xA0) return false; break;
>+ case 0xED: if (a > 0x9F) return false; break;
> case 0xF0: if (a < 0x90) return false; break;
> case 0xF4: if (a > 0x8F) return false; break;
>


That's right, dono how I missed that one, but looks correct to me, and
is in line with the code in ConvertUTF.c from unicode.org, on which I
based the patch, extended to support 6 byte utf8 characters.

>(Accepting encoded surrogates in utf-8 was always forbidden by most
>specifications that used utf-8, though the Unicode specs originally

were
>not absolute about it (but forbade generating them). Current Unicode
>specifications define those sequences as malformed. Surrogates are the
>code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode
>characters 0x10000 - 0x10FFFF as two 16-bit values; UTF-8 requires that
>such characters are encoded directly rather than via surrogate pairs.)
>
>--
>Andrew, Supernews
>http://www.supernews.com - individual and corporate NNTP services


.... John

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 03:23 AM
Andrew - Supernews
 
Posts: n/a
Default Re: Unicode problems on IRC

On 2005-04-10, "John Hansen" <john@geeknet.com.au> wrote:
> That's right, dono how I missed that one, but looks correct to me, and
> is in line with the code in ConvertUTF.c from unicode.org, on which I
> based the patch, extended to support 6 byte utf8 characters.


Frankly, you should probably de-extend it back down to 4 bytes. That's
enough to encode the Unicode range of 0x000000 - 0x10FFFF, and enough
other stuff would break if anyone allocated a character outside that
range that I don't think it it worth worrying about. (Even the ISO
people have agreed to conform to that limitation.) Even if insanity
struck simultaneously at both standards bodies, 4 bytes is enough to
go to 0x1FFFFF so there is still substantial slack. (A number of other
specifications based on utf-8 have removed the 5 and 6 byte sequences
too, so there is substantial precedent for this.)

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 02:19 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com