This is a discussion on Storing Japanese characters in UTF-8 database within the DB2 forums, part of the Database Server Software category; --> I am having some data reject on an insert of Japanese characters to a Varchar column in my DB2 ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I am having some data reject on an insert of Japanese characters to a Varchar column in my DB2 UTF-8 database. I am trying to understand how big to make the Varchar column for the inesert to work successfully for all of my data. I would also like ot understand if Vargraphic is a better approach and what ramifications that might have. This data is sourced from UTF-8 Oracle (using byte semantics) which defines a columns as Varchar2(255). I had defined my DB2 columns Varchar(255) but the insert rejects some rows. I notice that for Oracle : if I apply the length function to a specific column , I get a value of 8 for a specified key. However, for this same key that sucecssfully loaded on the DB2 side, I get a length of 33. Sometimes, the length on the DB2 side is more than 5 times the Oracle side. I do not understand how this can be. Also, Can someone clarify how much storage DB2 allocates in bytes, for the following (assume non-nullable) ? My assumptions are below - CHAR(1) - assume 1 byte CHAR(100) - assume 100 bytes VARCHAR(1) - assume length + 1 byte = 2 bytes VARCHAR(100) - assume lenght + 100 bytes = 101 bytes at most VARGRAPHIC(100) - ? Thank you in advance for your assistance. |
| |||
| "mike_dba" <michaelaaldrich@yahoo.com> wrote in message news:1144856686.027653.142990@i39g2000cwa.googlegr oups.com... >I am having some data reject on an insert of Japanese characters to a > Varchar column in my DB2 UTF-8 database. I am trying to understand how > big to make the Varchar column for the inesert to work successfully for > all of my data. I would also like ot understand if Vargraphic is a > better approach and what ramifications that might have. > > This data is sourced from UTF-8 Oracle (using byte semantics) which > defines a columns as > Varchar2(255). I had defined my DB2 columns Varchar(255) but the > insert rejects some rows. > > I notice that for Oracle : if I apply the length function to a specific > column , I get a value of 8 for a specified key. However, for this > same key that sucecssfully loaded on the DB2 side, I get a length of > 33. Sometimes, the length on the DB2 side is more than 5 times the > Oracle side. I do not understand how this can be. > > Also, Can someone clarify how much storage DB2 allocates in bytes, for > the following (assume non-nullable) ? My assumptions are below - > > CHAR(1) - assume 1 byte > CHAR(100) - assume 100 bytes > VARCHAR(1) - assume length + 1 byte = 2 bytes > VARCHAR(100) - assume lenght + 100 bytes = 101 bytes at most > VARGRAPHIC(100) - ? > > Thank you in advance for your assistance. > I would triple the size of the VARCHAR in DB2. You can go up to about 32K (not quite) with VARCHAR in DB2. DB2 uses 2 extra bytes for the length, but that does not reduce the amount of space for the data. If it is nullable, it uses one additional byte. |
| |||
| Thank you for your response. The triple size will work in most cases. However, we have noticed the following scenario - I have a varchar(255) in Oracle. For a particular data item on Oracle, I apply the Oracle function for length and get 86 character back. I also apply the Oracle function for byte size and get back 238. Inserting this same data into DB2 varchar(255) rejects. Inserting into DB2 varchar(2000) works. DB2 length function returns 466 (this should be bytes, correct ?). How could this be ? |
| |||
| I don't know how Oracle measures the length of a UTF-8 varchar field or how it physically stores it but UDB stores UTF-8 data as 1-4 bytes for each character. Characters in pictographic languages, especially Japanese, Chinese, and other languages used in that part of the world, are more often the longer three and four byte ones. A varchar definition specifies the number of bytes, not the number of characters to be stored. The data capacity of the column can't be precisely determined because each character may occupy a different number of bytes. A single UTF-8 character occupies a maximum of four bytes. Specify a varchar length four times the number of characters you expect to store and you'll always have enough space. Unfortunately, this will NOT prevent storing more characters than you want if each character is shorter than the 4/character worst case. Philip Sherman mike_dba wrote: > Thank you for your response. The triple size will work in most cases. > However, we have noticed the following scenario - > > I have a varchar(255) in Oracle. For a particular data item on Oracle, > I apply the Oracle function for length and get 86 character back. I > also apply the Oracle function for byte size and get back 238. > > Inserting this same data into DB2 varchar(255) rejects. Inserting > into DB2 varchar(2000) works. DB2 length function returns 466 (this > should be bytes, correct ?). How could this be ? > |
| |||
| Is there any way in which you can change the character semantics for the field? In particular, I know in oracle you can define a fields data type to either use byte or character length semantics: A) varchar2(120 BYTE) B) varchar2(120 CHAR) Where A is 120 bytes and B is 120 characters? Also, just out of curiosity if I were to attempt to load a UTF-16 character into a field in a UTF-8 database should I expect DB2 to use 8 bytes for this? |
| |||
| Is there any way in which you can change the character semantics for the field? In particular, I know in oracle you can define a fields data type to either use byte or character length semantics: A) varchar2(120 BYTE) B) varchar2(120 CHAR) Where A is 120 bytes and B is 120 characters? Also, just out of curiosity if I were to attempt to load a UTF-16 character into a field in a UTF-8 database should I expect DB2 to use 8 bytes for this? |
| |||
| "mike_dba" <michaelaaldrich@yahoo.com> wrote in message news:1144859559.846256.327460@t31g2000cwb.googlegr oups.com... > Thank you for your response. The triple size will work in most cases. > However, we have noticed the following scenario - > > I have a varchar(255) in Oracle. For a particular data item on Oracle, > I apply the Oracle function for length and get 86 character back. I > also apply the Oracle function for byte size and get back 238. > > Inserting this same data into DB2 varchar(255) rejects. Inserting > into DB2 varchar(2000) works. DB2 length function returns 466 (this > should be bytes, correct ?). How could this be ? > You obviously have too much time on your hands. |
| |||
| Hi.. I have a table which I use to store names and other information(English). Now, the requirement is that the table needs to store all local language like Japanese, Chinese, Hindhi etc. What all things do I have to change in the Database.The databse is a DB2 UTF-8 database. Also, how can I test if the data is being Inserted and how to retreive the data. I dont have a clue. Can some one guide me.. Thanks in advance RaInDeEr. |
| |||
| mike_dba wrote: > Thank you for your response. The triple size will work in most cases. > However, we have noticed the following scenario - > > I have a varchar(255) in Oracle. For a particular data item on Oracle, > I apply the Oracle function for length and get 86 character back. I > also apply the Oracle function for byte size and get back 238. > > Inserting this same data into DB2 varchar(255) rejects. Inserting > into DB2 varchar(2000) works. DB2 length function returns 466 (this > should be bytes, correct ?). How could this be ? > The following functions have been introduced in Viper ------- * CHARACTER_LENGTH * OCTET_LENGTH * POSITION * SUBSTRING The modified existing functions include: * LENGTH * LOCATE These functions process strings along character boundaries rather than along byte or double-byte boundaries. Each function (except OCTET_LENGTH) accepts an argument specifying the code unit, or string length unit of the result: * CODEUNITS16 specifies that the result is to be expressed in 16-bit UTF-16 code units * CODEUNITS32 specifies that the result is to be expressed in 32-bit UTF-32 code units * OCTETS specifies that the result is to be expressed in bytes This argument is optional for the existing functions. ---------- More information is needed to figure out the issue on bind-in. But the fact that the byte length is notes > 400 gives some hint. Unicode has an n-m relationship between glyphs and code points. It could be that in your DB2 App you end up with longer code points for the same glyph. Just a wild guess. You could compare the HEX() values between the two. Cheers Serge -- Serge Rielau DB2 Solutions Development IBM Toronto Lab |
| ||||
| After accepting the fact that for a table with a varchar(255) in Oracle, I will need to define a DB2 table as varchar (3*255), I find that I exceed the 32K page size (as someone pointed out earlier) . So how come Oracle can have a table defined exactly the same as DB2 and yet fit the data in a 32K page and DB2 cannot ? |