Unix Technical Forum

\x96 in column value?

This is a discussion on \x96 in column value? within the MySQL General forum forums, part of the MySQL category; --> Why is the character \x96 not allowed as a value in an INSERT query? I've SET NAMES utf8, the ...


Go Back   Unix Technical Forum > Database Server Software > MySQL > MySQL General forum

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-17-2008, 04:09 PM
Jerry Schwartz
 
Posts: n/a
Default \x96 in column value?

Why is the character \x96 not allowed as a value in an INSERT query? I've SET
NAMES utf8, the table collation is utf8_general_ci, the default character set
for the table is utf8. I get this error message from my application:

ERROR: INSERT INTO prod
(prod_title,prod_published,prod_pub_prod_id,prod_c ontent_info,prod_samp_doc,prod_toc_doc,prod_type,p rod_ready,prod_discont,prod_info_type,prod_id,prod _num,pub_id,prod_added,prod_updated,prod_export,pr od_changed)
VALUES ('Disposable Paper Products ? Uruguay','2007-12-01','DPP2007UY','Pages:
44','\\\\xa20\\d\\publishers\\eo\\fullreports\\DPP 07_UY.pdf','\\\\xa20\\d\\publishers\\eo\\content\\ DPP07_UY-LOC.doc','S',1,0,'E','Y8F1kqasoQVE90F',65642,'BG3g 33kv5CDkim7',NOW(),NOW(),1,1);
Incorrect string value: '\x96 Urug...' for column 'prod_title' at row 1

That bizarre character in the middle of 'Disposable Paper Products ? Uruguay'
is, as you can see, \x96. MySQL certainly doesn't seem to like it. The data is
being read from a file. Curiously, if I copy the line directly from the input
file it looks like this:

Disposable Paper Products - Uruguay

You may not be able to see it, but that is actually an n-dash (\x96).

Regards,

Jerry Schwartz
The Infoshop by Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341

www.the-infoshop.com
www.giiexpress.com
www.etudes-marche.com




Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-17-2008, 04:09 PM
Jerry Schwartz
 
Posts: n/a
Default RE: \x96 in column value?

I'm running afoul of the UTF8 character set somehow:

mysql> select convert(char(0x96) using utf8);
+----------------------------------+
| convert(char(0x96) using utf8) |
+----------------------------------+
| NULL |
+----------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> show warnings;
+-------+------+-------------------------------------+
| Level | Code | Message |
+-------+------+-------------------------------------+
| Error | 1300 | Invalid utf8 character string: '96' |
+-------+------+-------------------------------------+
1 row in set (0.00 sec)

On top of my other problems, I've discovered that pasting the UTF8 character
represented by 0x96 into the MySQL CLI (Windows) somehow converts the
character to 0x2D (a normal dash); so a lot of my testing has been wasted.
Pasting it into a Windows-based editor preserves the character as 0x96.


Regards,

Jerry Schwartz
The Infoshop by Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341

www.the-infoshop.com
www.giiexpress.com
www.etudes-marche.com

>-----Original Message-----
>From: Jerry Schwartz [mailto:jschwartz@the-infoshop.com]
>Sent: Wednesday, April 16, 2008 12:49 PM
>To: 'Mysql'
>Subject: \x96 in column value?
>
>Why is the character \x96 not allowed as a value in an INSERT query?
>I've SET
>NAMES utf8, the table collation is utf8_general_ci, the default
>character set
>for the table is utf8. I get this error message from my application:
>
>ERROR: INSERT INTO prod
>(prod_title,prod_published,prod_pub_prod_id,prod_ content_info,prod_samp_
>doc,prod_toc_doc,prod_type,prod_ready,prod_discon t,prod_info_type,prod_i
>d,prod_num,pub_id,prod_added,prod_updated,prod_ex port,prod_changed)
>VALUES ('Disposable Paper Products ? Uruguay','2007-12-
>01','DPP2007UY','Pages:
>44','\\\\xa20\\d\\publishers\\eo\\fullreports\\DP P07_UY.pdf','\\\\xa20\\
>d\\publishers\\eo\\content\\DPP07_UY-
>LOC.doc','S',1,0,'E','Y8F1kqasoQVE90F',65642,'BG3 g33kv5CDkim7',NOW(),NOW
>(),1,1);
>Incorrect string value: '\x96 Urug...' for column 'prod_title' at row 1
>
>That bizarre character in the middle of 'Disposable Paper Products ?
>Uruguay'
>is, as you can see, \x96. MySQL certainly doesn't seem to like it. The
>data is
>being read from a file. Curiously, if I copy the line directly from the
>input
>file it looks like this:
>
>Disposable Paper Products - Uruguay
>
>You may not be able to see it, but that is actually an n-dash (\x96).
>
>Regards,
>
>Jerry Schwartz
>The Infoshop by Global Information Incorporated
>195 Farmington Ave.
>Farmington, CT 06032
>
>860.674.8796 / FAX: 860.674.8341
>
>www.the-infoshop.com
>www.giiexpress.com
>www.etudes-marche.com
>
>
>
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe: http://lists.mysql.com/mysql?unsub=jschwartz@the-
>infoshop.com





Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-17-2008, 04:09 PM
Tim McDaniel
 
Posts: n/a
Default RE: \x96 in column value?

On Wed, 16 Apr 2008, Jerry Schwartz <jschwartz@the-infoshop.com> wrote:
> I'm running afoul of the UTF8 character set somehow:
>
> mysql> select convert(char(0x96) using utf8);
> +----------------------------------+
> | convert(char(0x96) using utf8) |
> +----------------------------------+
> | NULL |
> +----------------------------------+
> 1 row in set, 1 warning (0.00 sec)
>
> mysql> show warnings;
> +-------+------+-------------------------------------+
> | Level | Code | Message |
> +-------+------+-------------------------------------+
> | Error | 1300 | Invalid utf8 character string: '96' |
> +-------+------+-------------------------------------+
> 1 row in set (0.00 sec)
>
> On top of my other problems, I've discovered that pasting the UTF8
> character represented by 0x96 into the MySQL CLI (Windows) somehow
> converts the character to 0x2D (a normal dash); so a lot of my
> testing has been wasted. Pasting it into a Windows-based editor
> preserves the character as 0x96.


In an earlier note, he wrote
> You may not be able to see it, but that is actually an n-dash
> (\x96).


Actually, \x96 is not an en-dash.
<http://www.unicode.org/charts/PDF/U0080.pdf> says that it's
"START OF GUARDED AREA". x96 is in the middle of a block of control
characters from the unnamed control character at \x80 through
APPLICATION PROGRAM COMMAND at \x9F (or arguably NO-BREAK SPACE at
\xa0).

Microsoft, in some of their Windows code pages, assigned meanings to
those values that differ from the Unicode and ISO-8859-1 standards
(quelle suprise), assigning many of them uses as printable characters.
I think it's the Windows 1250 code page, at
<http://www.microsoft.com/globaldev/reference/sbcs/1250.mspx>.
As that page and
<http://www.microsoft.com/typography/developers/fdsspec/punc2.htm>
note, the Unicode standard value for an en-dash is U+2013 (which
appears to be in hex).

As to whether this affects the problem I don't know. Since x96 is a
valid character, whether Microsoft or real Unicode, I would not expect
it to be a problem per se. I just wanted to point out what it might
not mean.

--
Tim McDaniel, tmcd@panix.com
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-17-2008, 04:09 PM
Jerry Schwartz
 
Posts: n/a
Default RE: \x96 in column value?

>-----Original Message-----
>From: Tim McDaniel [mailto:tmcd@panix.com]
>Sent: Wednesday, April 16, 2008 2:32 PM
>Cc: 'Mysql'
>Subject: RE: \x96 in column value?
>
>On Wed, 16 Apr 2008, Jerry Schwartz <jschwartz@the-infoshop.com> wrote:
>> I'm running afoul of the UTF8 character set somehow:
>>
>> mysql> select convert(char(0x96) using utf8);
>> +----------------------------------+
>> | convert(char(0x96) using utf8) |
>> +----------------------------------+
>> | NULL |
>> +----------------------------------+
>> 1 row in set, 1 warning (0.00 sec)
>>
>> mysql> show warnings;
>> +-------+------+-------------------------------------+
>> | Level | Code | Message |
>> +-------+------+-------------------------------------+
>> | Error | 1300 | Invalid utf8 character string: '96' |
>> +-------+------+-------------------------------------+
>> 1 row in set (0.00 sec)
>>
>> On top of my other problems, I've discovered that pasting the UTF8
>> character represented by 0x96 into the MySQL CLI (Windows) somehow
>> converts the character to 0x2D (a normal dash); so a lot of my
>> testing has been wasted. Pasting it into a Windows-based editor
>> preserves the character as 0x96.

>
>In an earlier note, he wrote
>> You may not be able to see it, but that is actually an n-dash
>> (\x96).

>
>Actually, \x96 is not an en-dash.
><http://www.unicode.org/charts/PDF/U0080.pdf> says that it's
>"START OF GUARDED AREA". x96 is in the middle of a block of control
>characters from the unnamed control character at \x80 through
>APPLICATION PROGRAM COMMAND at \x9F (or arguably NO-BREAK SPACE at
>\xa0).


[JS] Right you are. This whole business gives me an extreme headache. When
working in PHP, I assume my Windows-generated input is cp1252 and I convert
that to UTF-8. Aside from that, we always work in UTF-8 (database and web)
because I have to handle Chinese. (I have no idea if I'm doing that right, I
can't read the results. ;<)

In Microsoft's code page 1252, 0x96 is indeed an n-dash. I think this might
be my clue.

Although our web pages specify UTF-8, I found an article in MSDN that seems
to say that IE interprets UTF-8 pages using a code page in the cp1200
"family", whatever that means. That must be why our data looks correct going
end-to-end.

I also found http://effbot.org/zone/unicode-gremlins.htm, which gives a bit
of Python code to translate some cp1252 bits to their Unicode equivalents.
It also give you a nice list of the problem characters. There are also
examples in the PHP documentation of the iconv() function, but there is also
a comment that 0x96 breaks iconv.

I need to chew on this some more. PHP doesn't really handle multi-byte
characters until 6.x.

>
>Microsoft, in some of their Windows code pages, assigned meanings to
>those values that differ from the Unicode and ISO-8859-1 standards
>(quelle suprise), assigning many of them uses as printable characters.
>I think it's the Windows 1250 code page, at
><http://www.microsoft.com/globaldev/reference/sbcs/1250.mspx>.
>As that page and
><http://www.microsoft.com/typography/developers/fdsspec/punc2.htm>
>note, the Unicode standard value for an en-dash is U+2013 (which
>appears to be in hex).
>
>As to whether this affects the problem I don't know. Since x96 is a
>valid character, whether Microsoft or real Unicode, I would not expect
>it to be a problem per se. I just wanted to point out what it might
>not mean.
>
>--
>Tim McDaniel, tmcd@panix.com
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe: http://lists.mysql.com/mysql?unsub=jschwartz@the-
>infoshop.com





Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 03:07 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com