Unix Technical Forum

SEO

vBulletin Search Engine Optimization


Go Back   Unix Technical Forum > Database Server Software > MySQL

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 02-28-2008, 10:32 AM
firewoodtim@yahoo.com
 
Posts: n/a
Default mixed encodings - how to manage

I have a table in MySQL that is presently entirely encoded with latin1
charset. Several of the varchar fields are used as CSS styling values
and, as such, do not need to be encoded with utf8. However, two of
the fields become visible content in the HTML page, and I want to
change the encoding for those two fields to utf8.

What kind of header() and meta tag specifiers should I use in a mixed
encoding situation like that? If I specify "UTF-8" will the browser
be able to automatically distinguish the situations that require
reading one byte (latin1) from those requiring more bytes (utf8)?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 02-28-2008, 10:32 AM
firewoodtim@yahoo.com
 
Posts: n/a
Default Re: mixed encodings - how to manage

On Sun, 20 Jan 2008 20:23:01 +0100, "J.O. Aho" <user@example.net>
wrote:

>firewoodtim@yahoo.com wrote:
>
>> This is what I thought was likely, and it certainly makes sense.
>> However, what threw me was concern about out how a browser set to read
>> an html file using utf8 would be able to recognize a 1 byte character,

>
>1 byte characters has a ASCII value of 127 or less.
>
>> when it was also expecting 2 or 3 or even 4 byte characters as well.

>
>Those start with a 128 or higher ASCII value.


So just to be sure, let me see if I understand this correctly.

I have a PHP script running that takes latin1 data from a variety of
MySQL columns and uses them as CSS values in "style" attributes. These
do not have to be converted to utf8 encoding, because latin1 values
are a subset of utf8 and any conversion would result in the same 1
byte characters that I started with anyway.

In the case of the utf8 fields that will be displayed visibly in the
browser, not used as html styling values, these will be correctly
interpreted by the browser, since if there are any non-latin1
characters, they will be recognized by the browser as utf8 from their
encoding values (128 or higher) and displayed correctly.

This means that if I specify utf8 using the header() function and
again in the meta tags, and then use latin1 and utf8 characters in the
script to form the styling and visible content respectively,
everything should turn out OK.

Am I right?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 02-28-2008, 10:32 AM
Kees Nuyt
 
Posts: n/a
Default Re: mixed encodings - how to manage

On Sun, 20 Jan 2008 20:10:28 +0100, "J.O. Aho"
<user@example.net> wrote:

>Kees Nuyt wrote:
>> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
>> wrote:
>>


[snip]

>> Using UTF-8 won't hurt.
>> For the ASCII part (0-9, A-Z, a-z, almost all punctation)
>> the bytes are exactly the same.
>> Only special characters need a 2 or sometimes 3 byte
>> sequence.

>
>The characters with ASCII value 127 and lower are the same, everything else
>are different.
>
>
>> Mixed encoding within a web page isn't really possible.
>> The whole page uses whatever is declared in the encoding
>> header, the DOCUMENT TYPE or in the meta header.
>> You can't switch to another encoding halfway.

>
>If the data in the database is mixed, then the charset has to be unified
>before injected into the "page", or else characters may not be displayes
>correctly, trying to show iso-8859 characters as UTF-8 will result in quite
>many question marks and the other way around will result in strange characters.


I agree.
--
( Kees
)
c[_] There is only one boss, the customer. And he can fire
everybody in the company from the chairman on down,
simply by spending his money somewhere else. (Sam Walton) (#35)
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 02-28-2008, 10:32 AM
Kees Nuyt
 
Posts: n/a
Default Re: mixed encodings - how to manage

On Sun, 20 Jan 2008 14:08:23 -0500, firewoodtim@yahoo.com
wrote:

>On Sun, 20 Jan 2008 19:51:45 +0100, Kees Nuyt <k.nuyt@nospam.demon.nl>
>wrote:
>
>>On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
>>wrote:
>>
>>>I have a table in MySQL that is presently entirely encoded with latin1
>>>charset. Several of the varchar fields are used as CSS styling values
>>>and, as such, do not need to be encoded with utf8. However, two of
>>>the fields become visible content in the HTML page, and I want to
>>>change the encoding for those two fields to utf8.
>>>
>>>What kind of header() and meta tag specifiers should I use in a mixed
>>>encoding situation like that? If I specify "UTF-8" will the browser
>>>be able to automatically distinguish the situations that require
>>>reading one byte (latin1) from those requiring more bytes (utf8)?

>>
>>Using UTF-8 won't hurt.
>>For the ASCII part (0-9, A-Z, a-z, almost all punctation)
>>the bytes are exactly the same.
>>Only special characters need a 2 or sometimes 3 byte
>>sequence.
>>
>>Mixed encoding within a web page isn't really possible.
>>The whole page uses whatever is declared in the encoding
>>header, the DOCUMENT TYPE or in the meta header.
>>You can't switch to another encoding halfway.
>>
>>Again, the single byte non-diacritic characters are mostly
>>the same between ISO-8859-1 and UTF-8, only the encoding
>>of special ones is different.
>>You have to choose one and stick to it. If ISO-8859-1
>>misses a few symbols you need, you'd better use UTF-8
>>everywhere. That way you cover many many more symbols than
>>in any 1 byte encoding.
>>
>>(Anyone please correct me if I'm wrong).

>
>This is what I thought was likely, and it certainly makes sense.
>However, what threw me was concern about out how a browser set to read
>an html file using utf8 would be able to recognize a 1 byte character,
>when it was also expecting 2 or 3 or even 4 byte characters as well.
>Maybe I don't have to worry about this and can proceed on faith alone,
>but it would be both interesting and reassuring to know the answer.
>Does anyone know the way this is accomplished?


Here is how UTF-8 works:
http://en.wikipedia.org/wiki/UTF-8#Description
--
( Kees
)
c[_] Why is the alphabet in that order? Is it because of that song? (#495)
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 02-28-2008, 10:32 AM
Axel Schwenke
 
Posts: n/a
Default Re: mixed encodings - how to manage

Kees Nuyt wrote:
> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
> wrote:
>
>> I have a table in MySQL that is presently entirely encoded with latin1
>> charset. Several of the varchar fields are used as CSS styling values
>> and, as such, do not need to be encoded with utf8. However, two of
>> the fields become visible content in the HTML page, and I want to
>> change the encoding for those two fields to utf8.
>>
>> What kind of header() and meta tag specifiers should I use in a mixed
>> encoding situation like that? If I specify "UTF-8" will the browser
>> be able to automatically distinguish the situations that require
>> reading one byte (latin1) from those requiring more bytes (utf8)?

>
> Using UTF-8 won't hurt.


I strongly disagree.

utf8 adds significant overhead to string processing and in many cases MySQL
has to reserve 3 bytes per utf8 character (i.e. for index- or record size).

The right approach is to use the "minimal" charset that can encode the
characters in question. So in the before mentioned table only the two fields
should be changed to use the utf8 encoding.

> Mixed encoding within a web page isn't really possible.


Right.

MySQL solved this problem with the introduction of the client related
charset settings in 4.1. Now you can store data in i.e. latin1 but all data
the client sends to or retrieves from the database can be i.e. utf8 (and
will automagically be converted).

So the final suggestion is:
- declare utf8 encoding in the HTTP header
- use 'SET NAMES utf8' to declare all client <-> database traffic to use the
utf8 encoding
- store data in the appropriate encoding, use utf8 only where necessary


XL
--
Axel Schwenke, Support Engineer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 02-28-2008, 10:32 AM
firewoodtim@yahoo.com
 
Posts: n/a
Default Re: mixed encodings - how to manage

On Mon, 21 Jan 2008 18:19:52 +0100, Axel Schwenke
<axel.schwenke@gmx.de> wrote:

>Kees Nuyt wrote:
>> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com
>> wrote:
>>
>>> I have a table in MySQL that is presently entirely encoded with latin1
>>> charset. Several of the varchar fields are used as CSS styling values
>>> and, as such, do not need to be encoded with utf8. However, two of
>>> the fields become visible content in the HTML page, and I want to
>>> change the encoding for those two fields to utf8.
>>>
>>> What kind of header() and meta tag specifiers should I use in a mixed
>>> encoding situation like that? If I specify "UTF-8" will the browser
>>> be able to automatically distinguish the situations that require
>>> reading one byte (latin1) from those requiring more bytes (utf8)?

>>
>> Using UTF-8 won't hurt.

>
>I strongly disagree.
>
>utf8 adds significant overhead to string processing and in many cases MySQL
>has to reserve 3 bytes per utf8 character (i.e. for index- or record size).
>
>The right approach is to use the "minimal" charset that can encode the
>characters in question. So in the before mentioned table only the two fields
>should be changed to use the utf8 encoding.
>
>> Mixed encoding within a web page isn't really possible.

>
>Right.
>
>MySQL solved this problem with the introduction of the client related
>charset settings in 4.1. Now you can store data in i.e. latin1 but all data
>the client sends to or retrieves from the database can be i.e. utf8 (and
>will automagically be converted).
>
>So the final suggestion is:
>- declare utf8 encoding in the HTTP header
>- use 'SET NAMES utf8' to declare all client <-> database traffic to use the
>utf8 encoding
>- store data in the appropriate encoding, use utf8 only where necessary
>
>
>XL


So, using this suggestion, I could just do the following (using the
same table and fields as in my original description):
1. ALTER TABLE table_name MODIFY varchar_column VARCHAR(50) NOT NULL
CHARACTER SET utf8;
2. ALTER TABLE table_name MODIFY text_column TEXT NOT NULL CHARACTER
SET utf8;
I issue a similar ALTER command for each column in each table that I
want to convert to utf8.

Steps 1 & 2 will take care of converting the data from latin1 to utf8
and will set the default for those columns only to utf8 encoding,
leaving the table encoding default intact at latin1.

3. In my php file, right after connecting to the db, add the line:
mysql_query("SET NAMES 'utf8' ");
4. Set the html header using the line:
header('Content-Type:text/html; charset=UTF-8');
5. Set a meta tag in all my scripts to:
<meta http-equiv="Content-Type" content="text/html;
charset=UTF-8" />

Will that take care of it correctly? Are there any other steps to
take?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 02-28-2008, 10:32 AM
Axel Schwenke
 
Posts: n/a
Default Re: mixed encodings - how to manage

Hi,

firewoodtim@yahoo.com wrote:
> On Mon, 21 Jan 2008 18:19:52 +0100, Axel Schwenke
> <axel.schwenke@gmx.de> wrote:
>
>> So the final suggestion is:
>> - declare utf8 encoding in the HTTP header
>> - use 'SET NAMES utf8' to declare all client <-> database traffic to use the
>> utf8 encoding
>> - store data in the appropriate encoding, use utf8 only where necessary

>
> So, using this suggestion, I could just do the following (using the
> same table and fields as in my original description):
> 1. ALTER TABLE table_name MODIFY varchar_column VARCHAR(50) NOT NULL
> CHARACTER SET utf8;
> 2. ALTER TABLE table_name MODIFY text_column TEXT NOT NULL CHARACTER
> SET utf8;
> I issue a similar ALTER command for each column in each table that I
> want to convert to utf8.


You can put several MODIFY COLUMN in one ALTER TABLE statement. This is
recommended as it will reduce execution time.

> Steps 1 & 2 will take care of converting the data from latin1 to utf8


Exactly.

> and will set the default for those columns only to utf8 encoding,
> leaving the table encoding default intact at latin1.


On column level this is not a default, but an exact setting. All string-
type columns have an attribute "collation" and - implicitly for the
collation - "encoding" (in MySQL terminology: "charset").

Tables, databases and the server have a *default* setting. This default
setting is used when a new database object is created without explicit
specification of charset and/or collation.

Example: a new column CHAR(10) is added to table `foo` in database `bar`. In
absence of a charset specification MySQL checks if table `foo` has a default
set and if yes, uses it for the new column. Otherwise the default for
database `bar` or - finally - the server default is used.

> 3. In my php file, right after connecting to the db, add the line:
> mysql_query("SET NAMES 'utf8' ");


This will make all PHP <-> MySQL conversation use utf8. Note: the MySQL
client can automatically reconnect in case of connection loss. Make sure
this statement is sent also in that case.

> 4. Set the html header using the line:
> header('Content-Type:text/html; charset=UTF-8');
> 5. Set a meta tag in all my scripts to:
> <meta http-equiv="Content-Type" content="text/html;
> charset=UTF-8" />


You need only one of those. Technically 4. is the better solution. Meta tags
in HTML have been added for cases where one cannot set HTTP headers (i.e.
HTML files that are delivered by Apache directly).

> Will that take care of it correctly? Are there any other steps to
> take?


Of course everything else - especially the non-code parts in your .php files
- that is delivered to the HTTP client, has to use utf8 encoding. And if
there is another data source besides MySQL, it has to either deliver utf8 or
your PHP code has to explicitly convert data from that data source to utf8.


XL
--
Axel Schwenke, Support Engineer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 02-28-2008, 10:32 AM
firewoodtim@yahoo.com
 
Posts: n/a
Default Re: mixed encodings - how to manage


>This will make all PHP <-> MySQL conversation use utf8. Note: the MySQL
>client can automatically reconnect in case of connection loss. Make sure
>this statement is sent also in that case.


How do I resend SET NAMES in case of an automatic reconnect? Is that
part of my PHP scripting?

>> Will that take care of it correctly? Are there any other steps to
>> take?

>
>Of course everything else - especially the non-code parts in your .php files
>- that is delivered to the HTTP client, has to use utf8 encoding. And if
>there is another data source besides MySQL, it has to either deliver utf8 or
>your PHP code has to explicitly convert data from that data source to utf8.


I am confused. What do you mean by "non-code" parts of my .php files?
Aren't php files all code? I use Dreamweaver for writing code, and,
to the best of my knowledge, all the scripts have ASCII encoding.
Perhaps you are including here the possibility that I would send
non-ASCII literal strings to the client, like Japanese characters or
odd symbols. If that is what you are referring to, I do not; I use
strictly the symbols on my (English) keyboard.

Finally, my understanding of your position that fields encoded with
utf8 should be kept to a minimum is based on the possibility that
fields that will only ever hold ASCII characters might be used for
indexing. That would be wasteful, if they were encoded using utf8.
That makes sense. But if there were no such fields in the db, that
is, if all indexing on string-type fields were on fields that had to
be utf-encoded anyway, there would be no way around large index files,
and utf8 encoding would not be any more wasteful ifapplied to the db
as a whole than if we just use utf8 encoding on select fields. Is
that correct?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 02-28-2008, 10:32 AM
firewoodtim@yahoo.com
 
Posts: n/a
Default Re: mixed encodings - how to manage

Axel,
Thanks for all the help you are giving. I would not know how to do
any of this without someone lending me a hand.

Perhaps some of the confusion I am having is a result of not
understanding terminology. Let me see if I understand some of the
terms you have used.

MySQL client = the mysql program one uses on the command line to send
commands and receive data directly from mysqld. This, of course, has
nothing to do with the php scripts that I write. If this is what you
mean, do I use .my.cnf file to automatically send "SET NAMES 'utf8' "
on reconnect?

HTTP client = the browser

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 02-28-2008, 10:32 AM
Axel Schwenke
 
Posts: n/a
Default Re: mixed encodings - how to manage

firewoodtim@yahoo.com wrote:
>> This will make all PHP <-> MySQL conversation use utf8. Note: the MySQL
>> client can automatically reconnect in case of connection loss. Make sure
>> this statement is sent also in that case.

>
> How do I resend SET NAMES in case of an automatic reconnect? Is that
> part of my PHP scripting?


This depends on which MySQL connector and which DB abstraction layer (if
any) you use. I.e. mysqli knows of the MYSQLI_INIT_COMMAND option
(http://php.net/manual/en/function.mysqli-options.php)

Or you can just forbid automatic reconnect and (re)connect explicitly in
your PHP code.

>> Of course everything else - especially the non-code parts in your .phpfiles
>> - that is delivered to the HTTP client, has to use utf8 encoding. And if
>> there is another data source besides MySQL, it has to either deliver utf8 or
>> your PHP code has to explicitly convert data from that data source to utf8.

>
> I am confused. What do you mean by "non-code" parts of my .php files?
> Aren't php files all code?


Nope. Only the text enclosed in <?php ... ?> tags is PHP code. I.e. the
following is a perfectly legal (and IMHO nice) .php file:

<HTML>
...
<?php $object= do_something() ?>

<TABLE>
<?php foreach ($object->things() as $thing) { ?>
<TR>
<TD>
<?php print(foo($thing)) ?>
</TD>
<TD>
<?php print(bar($thing)) ?>
</TD>
</TR>
</TABLE>
...
</HTML>

Here everything outside the <?php ... ?> tags is sent verbatim to the HTTP
client. If there are any characters outside the ASCII range, they must be
encoded in utf8.


> I use Dreamweaver for writing code, and,
> to the best of my knowledge, all the scripts have ASCII encoding.


This would be OK as long as there are only characters from the ASCII range.
Something like

<?php $foo="German Umlauts are äöü, ÄÖÜ and ß";
print($foo); ?>

would already break it. I have seen a lot of strange "HTML" from web
designers like nonstandard space, bullet marks, nonstandard quotations marks
and the like.

> Perhaps you are including here the possibility that I would send
> non-ASCII literal strings to the client, like Japanese characters or
> odd symbols. If that is what you are referring to, I do not; I use
> strictly the symbols on my (English) keyboard.


So what do you need utf8 for anyway? Is *all* your page content coming from
the database? No (semi)static menus or footers?

> Finally, my understanding of your position that fields encoded with
> utf8 should be kept to a minimum is based on the possibility that
> fields that will only ever hold ASCII characters might be used for
> indexing. That would be wasteful, if they were encoded using utf8.
> That makes sense. But if there were no such fields in the db, that
> is, if all indexing on string-type fields were on fields that had to
> be utf-encoded anyway, there would be no way around large index files,
> and utf8 encoding would not be any more wasteful ifapplied to the db
> as a whole than if we just use utf8 encoding on select fields. Is
> that correct?


There is more to that. MySQL has both fixed length and variable length data
types. For the latter there would be no extra storage required if utf8
encoding is used but only ASCII characters are stored. But for fixed length
fields MySQL will always use 3 bytes per utf8 character.

But there are some implicit constraints too. Example:

CREATE TABLE t1 (
c1 VARCHAR(200),
c2 VARCHAR(200),
INDEX (c1, c2)
)

this will not work with utf8 because indexes are restricted to 1000 bytes
and (200+200)*3 > 1000. Even if you intend to store only ASCII in those
columns, MySQL has still to plan for the worst case. There is a similar
limit of 64K for the row size. You just cannot have 5 VARCHAR(5000) columns
if you use utf8. You can with latin1.

When MySQL allocates buffers for utf8 strings or records containing utf8,it
will always reserve 3 bytes per utf8 char. This includes read_buffer,
sort_buffer, net_buffer, implicit temporary tables etc. pp. Using utf8 where
you don't need it is wasting memory!

Also *all* string operations are more expensive on utf8. Just walking an
utf8 string char by char means *interpreting* the content. For latin1
strings this is just incrementing a pointer. In many cases I would rather
use ucs2 than utf8 to *store* data. Communication to the outside world is
completely different topic. utf8 is fine there.


XL
--
Axel Schwenke, Support Engineer, MySQL AB

Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/
MySQL User Forums: http://forums.mysql.com/
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:58 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
UnixAdminTalk.com

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448