Javability (Java, Zaurus, Linux, Live) by Jean-Marc Autexier, Saarland/Germany
cat /dev/www | egrep 'Java|Linux|Zaurus|ITnews|Live' > blog

17.9.04 19:28 Oracle variable multibyte Unicode and java ( , , , )

Today, I discovered that Oracle variable mutibyte support isn't as good as expected.

Background: when you run Oracle with a Unicode character set (for example UTF-8) you can define VARCHAR2 column length not only in bytes but also in amount of characters:

create table MYTABLE
(
COLUMN1 VARCHAR2(200 char);
)

This will create a table MYTABLE with a column COLUMN1 which will accept 2 chars, independent of the used character set (remember that depending on the chosen character set a character can take more or less bytes).

This works fine until you reach 4000 bytes, which is the maximum a VARCHAR2 field can hold. As you see, I'm talking here about bytes, not characters. This means that if you have a character set which takes 2 bytes per character (as UTF-16), you can insert 2000 characters.

For UTF-8 it is more difficult. UTF-8 is a variable length character set. A character can take one to 3 bytes. This means that the calculation is more difficult. In best situation (characters with 1 byte), you can enter 4000 characters. In worst case (characters with 3 bytes), you only have 1333 characters.

If you want to insert or update such a column, be sure to first check the length of your data in UTF-8 bytes and not only count the number of character (which might be different, also because you might use a completely other characters set inside your application).

Here is a short Java UTF-8 check:

String value = ... ;
// only check above 1333
if ( (value != null) && (value.length() > 1333) )
{
// get bytes of the string in UTF-8 format
bytes [] utf8Bytes = value.getBytes("UTF-8") ;

if (utf8Bytes.length > 4000)
{
// Get 4000 first bytes and ...
byte [] tempBytes = new byte[3999] ;
System.arraycopy(utf8Bytes,0, tempBytes, 0,3999) ;
// ... built String from UTF-8
value = new String(tempBytes,"UTF-8") ;
}


Of course this is only a bad hack as you cut the original String and all data's behind 4000 UTF-8 bytes are lost.

posted by Jean-Marc Autexier | 0 comments | Permalink | Send to Friends | Google it!
Subscribe

Locations of visitors to this page
selected blogs
ressources
Security
Unsorted
Fun
Free&Open Software
archives
This is a personal web page. Things said here do not represent the position of my employer.
RSS icons by: FastIcon.com