Unix Technical Forum

EOL characters and multibyte encodings

This is a discussion on EOL characters and multibyte encodings within the pgsql Hackers forums, part of the PostgreSQL category; --> I finally was able PL/R to compile and run on Windows recently. This has lead to people using a ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-12-2008, 10:12 AM
Joe Conway
 
Posts: n/a
Default EOL characters and multibyte encodings

I finally was able PL/R to compile and run on Windows recently. This has
lead to people using a Windows based client (typically PgAdmin III) to
create PL/R functions. Immediately I started to receive reports of
failures that turned out to be due to the carriage return (\r) used in
standard Win32 EOLs (\r\n). It seems that the R parser only accepts
newlines (\n), even on Win32 (confirmed on r-devel list with a core
developer).

My first thought on fixing this issue was to simply replace all
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
R parser. As far as I know, any instances of '\r' embedded in a
syntactically valid R statement must be escaped (i.e. literally the
characters "\" and "r"), so that should not be a problem. But I am
concerned about how this potentially plays against multibyte characters.
Is it safe to do this, or do I need to use a mb-aware replace algorithm?

Thanks,

Joe



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-12-2008, 10:12 AM
Andrew Dunstan
 
Posts: n/a
Default Re: EOL characters and multibyte encodings



Joe Conway wrote:
> I finally was able PL/R to compile and run on Windows recently. This
> has lead to people using a Windows based client (typically PgAdmin
> III) to create PL/R functions. Immediately I started to receive
> reports of failures that turned out to be due to the carriage return
> (\r) used in standard Win32 EOLs (\r\n). It seems that the R parser
> only accepts newlines (\n), even on Win32 (confirmed on r-devel list
> with a core developer).
>
> My first thought on fixing this issue was to simply replace all
> instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to
> the R parser. As far as I know, any instances of '\r' embedded in a
> syntactically valid R statement must be escaped (i.e. literally the
> characters "\" and "r"), so that should not be a problem. But I am
> concerned about how this potentially plays against multibyte
> characters. Is it safe to do this, or do I need to use a mb-aware
> replace algorithm?
>
>


Didn't we just settle that all the server-side encodings have to be
ASCII supersets? In which case, just removing the CRs should be quite safe.

cheers

andrew

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-12-2008, 10:12 AM
Tom Lane
 
Posts: n/a
Default Re: EOL characters and multibyte encodings

Joe Conway <mail@joeconway.com> writes:
> I finally was able PL/R to compile and run on Windows recently. This has
> lead to people using a Windows based client (typically PgAdmin III) to
> create PL/R functions. Immediately I started to receive reports of
> failures that turned out to be due to the carriage return (\r) used in
> standard Win32 EOLs (\r\n). It seems that the R parser only accepts
> newlines (\n), even on Win32 (confirmed on r-devel list with a core
> developer).


> My first thought on fixing this issue was to simply replace all
> instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
> R parser. As far as I know, any instances of '\r' embedded in a
> syntactically valid R statement must be escaped (i.e. literally the
> characters "\" and "r"), so that should not be a problem. But I am
> concerned about how this potentially plays against multibyte characters.
> Is it safe to do this, or do I need to use a mb-aware replace algorithm?


It's safe, because you'll be dealing with prosrc inside the backend,
therefore using a backend-legal encoding, and those don't have any ASCII
aliasing problems (all bytes of an MB character must have high bit set).

However I dislike doing it exactly that way because line numbers in the
R script will all get doubled. Unless R never reports errors in terms
of line numbers, you'd be better off to either delete the \r characters
or replace them with spaces.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-12-2008, 10:12 AM
Joe Conway
 
Posts: n/a
Default Re: EOL characters and multibyte encodings

Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>> My first thought on fixing this issue was to simply replace all
>> instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
>> R parser. As far as I know, any instances of '\r' embedded in a
>> syntactically valid R statement must be escaped (i.e. literally the
>> characters "\" and "r"), so that should not be a problem. But I am
>> concerned about how this potentially plays against multibyte characters.
>> Is it safe to do this, or do I need to use a mb-aware replace algorithm?

>
> It's safe, because you'll be dealing with prosrc inside the backend,
> therefore using a backend-legal encoding, and those don't have any ASCII
> aliasing problems (all bytes of an MB character must have high bit set).


Great -- I wasn't sure about that.

> However I dislike doing it exactly that way because line numbers in the
> R script will all get doubled. Unless R never reports errors in terms
> of line numbers, you'd be better off to either delete the \r characters
> or replace them with spaces.


Good point. But I need to be able to deal with Apple EOLs too -- IIRC
those can be *only* '\r'. So I guess I need to do a look-ahead whenever
I run into '\r', see if it is followed by '\n', and then munge the
string accordingly.

Joe

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-12-2008, 10:13 AM
William ZHANG
 
Posts: n/a
Default Re: EOL characters and multibyte encodings


"Joe Conway" <mail@joeconway.com>
> Tom Lane wrote:
>> Joe Conway <mail@joeconway.com> writes:
>>> My first thought on fixing this issue was to simply replace all
>>> instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
>>> R parser. As far as I know, any instances of '\r' embedded in a
>>> syntactically valid R statement must be escaped (i.e. literally the
>>> characters "\" and "r"), so that should not be a problem. But I am
>>> concerned about how this potentially plays against multibyte characters.
>>> Is it safe to do this, or do I need to use a mb-aware replace algorithm?

>>
>> It's safe, because you'll be dealing with prosrc inside the backend,
>> therefore using a backend-legal encoding, and those don't have any ASCII
>> aliasing problems (all bytes of an MB character must have high bit set).


The lower byte of some characters in BIG5, GBK, GB18030 may be less than
0x7F and don't have the high bit set. Fortunately, they don't use 0x0D and
0x0A (CR and LF).

Regards,
William ZHANG

> Great -- I wasn't sure about that.
>




Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-12-2008, 10:13 AM
Andrew Dunstan
 
Posts: n/a
Default Re: EOL characters and multibyte encodings



William ZHANG wrote:
>>>
>>> It's safe, because you'll be dealing with prosrc inside the backend,
>>> therefore using a backend-legal encoding, and those don't have any ASCII
>>> aliasing problems (all bytes of an MB character must have high bit set).
>>>

>
> The lower byte of some characters in BIG5, GBK, GB18030 may be less than
> 0x7F and don't have the high bit set. Fortunately, they don't use 0x0D and
> 0x0A (CR and LF).
>
>
>


Those are client-only encodings, precisely for this sort of reason, and
thus not relevant to the present discussion. As Tom points out above,
when the language handler gets the code it will be encoded in the
relevant backend encoding which can't be any of these.

(Side note: the restriction by the R parser to unix-only line endings is
a dreadful piece of design. As Jon Postel rightly said, the best rule is
"Be liberal in what you accept and conservative in what you send." Just
about every parser for every language has been able to handle this, so
why must R be different?)

cheers

andrew

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 06:48 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com