vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I have a table in MySQL that is presently entirely encoded with latin1 charset. Several of the varchar fields are used as CSS styling values and, as such, do not need to be encoded with utf8. However, two of the fields become visible content in the HTML page, and I want to change the encoding for those two fields to utf8. What kind of header() and meta tag specifiers should I use in a mixed encoding situation like that? If I specify "UTF-8" will the browser be able to automatically distinguish the situations that require reading one byte (latin1) from those requiring more bytes (utf8)? |
| |||
| On Sun, 20 Jan 2008 20:23:01 +0100, "J.O. Aho" <user@example.net> wrote: >firewoodtim@yahoo.com wrote: > >> This is what I thought was likely, and it certainly makes sense. >> However, what threw me was concern about out how a browser set to read >> an html file using utf8 would be able to recognize a 1 byte character, > >1 byte characters has a ASCII value of 127 or less. > >> when it was also expecting 2 or 3 or even 4 byte characters as well. > >Those start with a 128 or higher ASCII value. So just to be sure, let me see if I understand this correctly. I have a PHP script running that takes latin1 data from a variety of MySQL columns and uses them as CSS values in "style" attributes. These do not have to be converted to utf8 encoding, because latin1 values are a subset of utf8 and any conversion would result in the same 1 byte characters that I started with anyway. In the case of the utf8 fields that will be displayed visibly in the browser, not used as html styling values, these will be correctly interpreted by the browser, since if there are any non-latin1 characters, they will be recognized by the browser as utf8 from their encoding values (128 or higher) and displayed correctly. This means that if I specify utf8 using the header() function and again in the meta tags, and then use latin1 and utf8 characters in the script to form the styling and visible content respectively, everything should turn out OK. Am I right? |
| |||
| On Sun, 20 Jan 2008 20:10:28 +0100, "J.O. Aho" <user@example.net> wrote: >Kees Nuyt wrote: >> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com >> wrote: >> [snip] >> Using UTF-8 won't hurt. >> For the ASCII part (0-9, A-Z, a-z, almost all punctation) >> the bytes are exactly the same. >> Only special characters need a 2 or sometimes 3 byte >> sequence. > >The characters with ASCII value 127 and lower are the same, everything else >are different. > > >> Mixed encoding within a web page isn't really possible. >> The whole page uses whatever is declared in the encoding >> header, the DOCUMENT TYPE or in the meta header. >> You can't switch to another encoding halfway. > >If the data in the database is mixed, then the charset has to be unified >before injected into the "page", or else characters may not be displayes >correctly, trying to show iso-8859 characters as UTF-8 will result in quite >many question marks and the other way around will result in strange characters. I agree. -- ( Kees ) c[_] There is only one boss, the customer. And he can fire everybody in the company from the chairman on down, simply by spending his money somewhere else. (Sam Walton) (#35) |
| |||
| On Sun, 20 Jan 2008 14:08:23 -0500, firewoodtim@yahoo.com wrote: >On Sun, 20 Jan 2008 19:51:45 +0100, Kees Nuyt <k.nuyt@nospam.demon.nl> >wrote: > >>On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com >>wrote: >> >>>I have a table in MySQL that is presently entirely encoded with latin1 >>>charset. Several of the varchar fields are used as CSS styling values >>>and, as such, do not need to be encoded with utf8. However, two of >>>the fields become visible content in the HTML page, and I want to >>>change the encoding for those two fields to utf8. >>> >>>What kind of header() and meta tag specifiers should I use in a mixed >>>encoding situation like that? If I specify "UTF-8" will the browser >>>be able to automatically distinguish the situations that require >>>reading one byte (latin1) from those requiring more bytes (utf8)? >> >>Using UTF-8 won't hurt. >>For the ASCII part (0-9, A-Z, a-z, almost all punctation) >>the bytes are exactly the same. >>Only special characters need a 2 or sometimes 3 byte >>sequence. >> >>Mixed encoding within a web page isn't really possible. >>The whole page uses whatever is declared in the encoding >>header, the DOCUMENT TYPE or in the meta header. >>You can't switch to another encoding halfway. >> >>Again, the single byte non-diacritic characters are mostly >>the same between ISO-8859-1 and UTF-8, only the encoding >>of special ones is different. >>You have to choose one and stick to it. If ISO-8859-1 >>misses a few symbols you need, you'd better use UTF-8 >>everywhere. That way you cover many many more symbols than >>in any 1 byte encoding. >> >>(Anyone please correct me if I'm wrong). > >This is what I thought was likely, and it certainly makes sense. >However, what threw me was concern about out how a browser set to read >an html file using utf8 would be able to recognize a 1 byte character, >when it was also expecting 2 or 3 or even 4 byte characters as well. >Maybe I don't have to worry about this and can proceed on faith alone, >but it would be both interesting and reassuring to know the answer. >Does anyone know the way this is accomplished? Here is how UTF-8 works: http://en.wikipedia.org/wiki/UTF-8#Description -- ( Kees ) c[_] Why is the alphabet in that order? Is it because of that song? (#495) |
| |||
| Kees Nuyt wrote: > On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com > wrote: > >> I have a table in MySQL that is presently entirely encoded with latin1 >> charset. Several of the varchar fields are used as CSS styling values >> and, as such, do not need to be encoded with utf8. However, two of >> the fields become visible content in the HTML page, and I want to >> change the encoding for those two fields to utf8. >> >> What kind of header() and meta tag specifiers should I use in a mixed >> encoding situation like that? If I specify "UTF-8" will the browser >> be able to automatically distinguish the situations that require >> reading one byte (latin1) from those requiring more bytes (utf8)? > > Using UTF-8 won't hurt. I strongly disagree. utf8 adds significant overhead to string processing and in many cases MySQL has to reserve 3 bytes per utf8 character (i.e. for index- or record size). The right approach is to use the "minimal" charset that can encode the characters in question. So in the before mentioned table only the two fields should be changed to use the utf8 encoding. > Mixed encoding within a web page isn't really possible. Right. MySQL solved this problem with the introduction of the client related charset settings in 4.1. Now you can store data in i.e. latin1 but all data the client sends to or retrieves from the database can be i.e. utf8 (and will automagically be converted). So the final suggestion is: - declare utf8 encoding in the HTTP header - use 'SET NAMES utf8' to declare all client <-> database traffic to use the utf8 encoding - store data in the appropriate encoding, use utf8 only where necessary XL -- Axel Schwenke, Support Engineer, MySQL AB Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/ MySQL User Forums: http://forums.mysql.com/ |
| |||
| On Mon, 21 Jan 2008 18:19:52 +0100, Axel Schwenke <axel.schwenke@gmx.de> wrote: >Kees Nuyt wrote: >> On Sun, 20 Jan 2008 13:12:32 -0500, firewoodtim@yahoo.com >> wrote: >> >>> I have a table in MySQL that is presently entirely encoded with latin1 >>> charset. Several of the varchar fields are used as CSS styling values >>> and, as such, do not need to be encoded with utf8. However, two of >>> the fields become visible content in the HTML page, and I want to >>> change the encoding for those two fields to utf8. >>> >>> What kind of header() and meta tag specifiers should I use in a mixed >>> encoding situation like that? If I specify "UTF-8" will the browser >>> be able to automatically distinguish the situations that require >>> reading one byte (latin1) from those requiring more bytes (utf8)? >> >> Using UTF-8 won't hurt. > >I strongly disagree. > >utf8 adds significant overhead to string processing and in many cases MySQL >has to reserve 3 bytes per utf8 character (i.e. for index- or record size). > >The right approach is to use the "minimal" charset that can encode the >characters in question. So in the before mentioned table only the two fields >should be changed to use the utf8 encoding. > >> Mixed encoding within a web page isn't really possible. > >Right. > >MySQL solved this problem with the introduction of the client related >charset settings in 4.1. Now you can store data in i.e. latin1 but all data >the client sends to or retrieves from the database can be i.e. utf8 (and >will automagically be converted). > >So the final suggestion is: >- declare utf8 encoding in the HTTP header >- use 'SET NAMES utf8' to declare all client <-> database traffic to use the >utf8 encoding >- store data in the appropriate encoding, use utf8 only where necessary > > >XL So, using this suggestion, I could just do the following (using the same table and fields as in my original description): 1. ALTER TABLE table_name MODIFY varchar_column VARCHAR(50) NOT NULL CHARACTER SET utf8; 2. ALTER TABLE table_name MODIFY text_column TEXT NOT NULL CHARACTER SET utf8; I issue a similar ALTER command for each column in each table that I want to convert to utf8. Steps 1 & 2 will take care of converting the data from latin1 to utf8 and will set the default for those columns only to utf8 encoding, leaving the table encoding default intact at latin1. 3. In my php file, right after connecting to the db, add the line: mysql_query("SET NAMES 'utf8' "); 4. Set the html header using the line: header('Content-Type:text/html; charset=UTF-8'); 5. Set a meta tag in all my scripts to: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> Will that take care of it correctly? Are there any other steps to take? |
| |||
| Hi, firewoodtim@yahoo.com wrote: > On Mon, 21 Jan 2008 18:19:52 +0100, Axel Schwenke > <axel.schwenke@gmx.de> wrote: > >> So the final suggestion is: >> - declare utf8 encoding in the HTTP header >> - use 'SET NAMES utf8' to declare all client <-> database traffic to use the >> utf8 encoding >> - store data in the appropriate encoding, use utf8 only where necessary > > So, using this suggestion, I could just do the following (using the > same table and fields as in my original description): > 1. ALTER TABLE table_name MODIFY varchar_column VARCHAR(50) NOT NULL > CHARACTER SET utf8; > 2. ALTER TABLE table_name MODIFY text_column TEXT NOT NULL CHARACTER > SET utf8; > I issue a similar ALTER command for each column in each table that I > want to convert to utf8. You can put several MODIFY COLUMN in one ALTER TABLE statement. This is recommended as it will reduce execution time. > Steps 1 & 2 will take care of converting the data from latin1 to utf8 Exactly. > and will set the default for those columns only to utf8 encoding, > leaving the table encoding default intact at latin1. On column level this is not a default, but an exact setting. All string- type columns have an attribute "collation" and - implicitly for the collation - "encoding" (in MySQL terminology: "charset"). Tables, databases and the server have a *default* setting. This default setting is used when a new database object is created without explicit specification of charset and/or collation. Example: a new column CHAR(10) is added to table `foo` in database `bar`. In absence of a charset specification MySQL checks if table `foo` has a default set and if yes, uses it for the new column. Otherwise the default for database `bar` or - finally - the server default is used. > 3. In my php file, right after connecting to the db, add the line: > mysql_query("SET NAMES 'utf8' "); This will make all PHP <-> MySQL conversation use utf8. Note: the MySQL client can automatically reconnect in case of connection loss. Make sure this statement is sent also in that case. > 4. Set the html header using the line: > header('Content-Type:text/html; charset=UTF-8'); > 5. Set a meta tag in all my scripts to: > <meta http-equiv="Content-Type" content="text/html; > charset=UTF-8" /> You need only one of those. Technically 4. is the better solution. Meta tags in HTML have been added for cases where one cannot set HTTP headers (i.e. HTML files that are delivered by Apache directly). > Will that take care of it correctly? Are there any other steps to > take? Of course everything else - especially the non-code parts in your .php files - that is delivered to the HTTP client, has to use utf8 encoding. And if there is another data source besides MySQL, it has to either deliver utf8 or your PHP code has to explicitly convert data from that data source to utf8. XL -- Axel Schwenke, Support Engineer, MySQL AB Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/ MySQL User Forums: http://forums.mysql.com/ |
| |||
| >This will make all PHP <-> MySQL conversation use utf8. Note: the MySQL >client can automatically reconnect in case of connection loss. Make sure >this statement is sent also in that case. How do I resend SET NAMES in case of an automatic reconnect? Is that part of my PHP scripting? >> Will that take care of it correctly? Are there any other steps to >> take? > >Of course everything else - especially the non-code parts in your .php files >- that is delivered to the HTTP client, has to use utf8 encoding. And if >there is another data source besides MySQL, it has to either deliver utf8 or >your PHP code has to explicitly convert data from that data source to utf8. I am confused. What do you mean by "non-code" parts of my .php files? Aren't php files all code? I use Dreamweaver for writing code, and, to the best of my knowledge, all the scripts have ASCII encoding. Perhaps you are including here the possibility that I would send non-ASCII literal strings to the client, like Japanese characters or odd symbols. If that is what you are referring to, I do not; I use strictly the symbols on my (English) keyboard. Finally, my understanding of your position that fields encoded with utf8 should be kept to a minimum is based on the possibility that fields that will only ever hold ASCII characters might be used for indexing. That would be wasteful, if they were encoded using utf8. That makes sense. But if there were no such fields in the db, that is, if all indexing on string-type fields were on fields that had to be utf-encoded anyway, there would be no way around large index files, and utf8 encoding would not be any more wasteful ifapplied to the db as a whole than if we just use utf8 encoding on select fields. Is that correct? |
| |||
| Axel, Thanks for all the help you are giving. I would not know how to do any of this without someone lending me a hand. Perhaps some of the confusion I am having is a result of not understanding terminology. Let me see if I understand some of the terms you have used. MySQL client = the mysql program one uses on the command line to send commands and receive data directly from mysqld. This, of course, has nothing to do with the php scripts that I write. If this is what you mean, do I use .my.cnf file to automatically send "SET NAMES 'utf8' " on reconnect? HTTP client = the browser |
| ||||
| firewoodtim@yahoo.com wrote: >> This will make all PHP <-> MySQL conversation use utf8. Note: the MySQL >> client can automatically reconnect in case of connection loss. Make sure >> this statement is sent also in that case. > > How do I resend SET NAMES in case of an automatic reconnect? Is that > part of my PHP scripting? This depends on which MySQL connector and which DB abstraction layer (if any) you use. I.e. mysqli knows of the MYSQLI_INIT_COMMAND option (http://php.net/manual/en/function.mysqli-options.php) Or you can just forbid automatic reconnect and (re)connect explicitly in your PHP code. >> Of course everything else - especially the non-code parts in your .phpfiles >> - that is delivered to the HTTP client, has to use utf8 encoding. And if >> there is another data source besides MySQL, it has to either deliver utf8 or >> your PHP code has to explicitly convert data from that data source to utf8. > > I am confused. What do you mean by "non-code" parts of my .php files? > Aren't php files all code? Nope. Only the text enclosed in <?php ... ?> tags is PHP code. I.e. the following is a perfectly legal (and IMHO nice) .php file: <HTML> ... <?php $object= do_something() ?> <TABLE> <?php foreach ($object->things() as $thing) { ?> <TR> <TD> <?php print(foo($thing)) ?> </TD> <TD> <?php print(bar($thing)) ?> </TD> </TR> </TABLE> ... </HTML> Here everything outside the <?php ... ?> tags is sent verbatim to the HTTP client. If there are any characters outside the ASCII range, they must be encoded in utf8. > I use Dreamweaver for writing code, and, > to the best of my knowledge, all the scripts have ASCII encoding. This would be OK as long as there are only characters from the ASCII range. Something like <?php $foo="German Umlauts are äöü, ÄÖÜ and ß"; print($foo); ?> would already break it. I have seen a lot of strange "HTML" from web designers like nonstandard space, bullet marks, nonstandard quotations marks and the like. > Perhaps you are including here the possibility that I would send > non-ASCII literal strings to the client, like Japanese characters or > odd symbols. If that is what you are referring to, I do not; I use > strictly the symbols on my (English) keyboard. So what do you need utf8 for anyway? Is *all* your page content coming from the database? No (semi)static menus or footers? > Finally, my understanding of your position that fields encoded with > utf8 should be kept to a minimum is based on the possibility that > fields that will only ever hold ASCII characters might be used for > indexing. That would be wasteful, if they were encoded using utf8. > That makes sense. But if there were no such fields in the db, that > is, if all indexing on string-type fields were on fields that had to > be utf-encoded anyway, there would be no way around large index files, > and utf8 encoding would not be any more wasteful ifapplied to the db > as a whole than if we just use utf8 encoding on select fields. Is > that correct? There is more to that. MySQL has both fixed length and variable length data types. For the latter there would be no extra storage required if utf8 encoding is used but only ASCII characters are stored. But for fixed length fields MySQL will always use 3 bytes per utf8 character. But there are some implicit constraints too. Example: CREATE TABLE t1 ( c1 VARCHAR(200), c2 VARCHAR(200), INDEX (c1, c2) ) this will not work with utf8 because indexes are restricted to 1000 bytes and (200+200)*3 > 1000. Even if you intend to store only ASCII in those columns, MySQL has still to plan for the worst case. There is a similar limit of 64K for the row size. You just cannot have 5 VARCHAR(5000) columns if you use utf8. You can with latin1. When MySQL allocates buffers for utf8 strings or records containing utf8,it will always reserve 3 bytes per utf8 char. This includes read_buffer, sort_buffer, net_buffer, implicit temporary tables etc. pp. Using utf8 where you don't need it is wasting memory! Also *all* string operations are more expensive on utf8. Just walking an utf8 string char by char means *interpreting* the content. For latin1 strings this is just incrementing a pointer. In many cases I would rather use ucs2 than utf8 to *store* data. Communication to the outside world is completely different topic. utf8 is fine there. XL -- Axel Schwenke, Support Engineer, MySQL AB Online User Manual: http://dev.mysql.com/doc/refman/5.0/en/ MySQL User Forums: http://forums.mysql.com/ |