Unix Technical Forum

Selecting on non ASCII varchars

This is a discussion on Selecting on non ASCII varchars within the pgsql Interfaces jdbc forums, part of the PostgreSQL category; --> Hi, I have a unicode database. Inserting unicode strings works fine. Selecting data based on int columns works fine ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Interfaces jdbc

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-15-2008, 11:39 PM
Jeremy LaCivita
 
Posts: n/a
Default Selecting on non ASCII varchars

Hi,

I have a unicode database. Inserting unicode strings works fine.
Selecting data based on int columns works fine too.

However, I am unable to select based on varchar columns when the
select contains non ascii characters.

the same select will work in Aqua Data Studio, just not from java.
Am i setting up my connections or prepared statements wrong?

/* begin example code */
javax.naming.InitialContext ctx = new javax.naming.InitialContext();
javax.sql.DataSource ref1 = (javax.sql.DataSource)ctx.lookup("java:/
PostgresDS");
Connection conn = ref1.getConnection();
PreparedStatement pst = conn.prepareStatement("SELECT * from mytable
m where m.title ~* ?");
pst.setString(1, myString);
ResultSet rs = pst.executeQuery();
/* end example code */

mytable.title is a varchar(300)
myString is a java.lang.String which was loaded from a unicode xml
stream.

whenever myString contains accented or chinese characters, for
example, the result set will be empty even though there are records
in the database that should match. doing the same query manually in
aqua data studio works fine.

I'm using postgres 8.0.3

Any ideas?

-Jeremy

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-15-2008, 11:39 PM
Oliver Jowett
 
Posts: n/a
Default Re: Selecting on non ASCII varchars

Jeremy LaCivita wrote:

> PreparedStatement pst = conn.prepareStatement("SELECT * from mytable m
> where m.title ~* ?");


If you use direct equality (=), does it work?

There have been comments on pgsql-bugs recently that some areas of the
backend code (case insensitive comparison and regexp) do not work
correctly in all cases when multibyte encodings are used. You might want
to repost to -bugs if basic equality works correctly.

Do you have a selfcontained testcase we can try? In particular we need
to know the actual column values and regexp patterns you have problems with.

-O

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-15-2008, 11:40 PM
Vadim Nasardinov
 
Posts: n/a
Default Re: Selecting on non ASCII varchars

On Tuesday 04 October 2005 16:16, Jeremy LaCivita wrote:
> Hmmm
>
> so it turns out if i take all my Strings and do this:
>
> str = new String(str.getBytes(), "utf-8");
>
> then it works.
>
> Correct me if i'm wrong, but that says to me that the Strings were
> in UTF-8 already, but Java didn't know it, so it couldn't send them
> to postgres properly.


It's meaningless to ask what encoding a String has. String are
sequence of chars -- they don't have an encoding. The notion of
"encoding" comes into play only when you have to represent a String as
a sequence of bytes.

So, if this returns true for you:

str.equals(new String(str.getBytes(), "utf-8"));

that means your default encoding is either utf-8 or a subset of utf-8,
at least for the characters found in str.

String#getBytes() uses the default encoding which may be specified via
the environment variable LANG on on Unix-like systems.

So, if my default encoding is UTF-8, I get this:

| $ echo $LANG
| en_US.UTF-8
| $ bsh2
| BeanShell 2.0-0.b1.7jpp - by Pat Niemeyer (pat@pat.net)
| bsh % print(System.getProperty("file.encoding"));
| UTF-8
| bsh % str = "Funny char: \u00e8";
| bsh % print(str);
| Funny char: è
| bsh % print(str.equals(new String(str.getBytes(), "utf-8")));
| true
| bsh %

If I change the default encoding to ISO-8859-1, I get this:

| $ env LANG=en_US.iso88591 bsh2
| BeanShell 2.0-0.b1.7jpp - by Pat Niemeyer (pat@pat.net)
| bsh % print(System.getProperty("file.encoding"));
| ISO-8859-1
| bsh % str = "Funny char: \u00e8";
| bsh % print(str);
| Funny char: è
| bsh % print(str.equals(new String(str.getBytes(), "utf-8")));
| false
| bsh %

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-15-2008, 11:40 PM
Marc Herbert
 
Posts: n/a
Default Re: Selecting on non ASCII varchars

Vadim Nasardinov <vadimn@redhat.com> writes:

> On Tuesday 04 October 2005 16:16, Jeremy LaCivita wrote:


>> Correct me if i'm wrong, but that says to me that the Strings were
>> in UTF-8 already, but Java didn't know it, so it couldn't send them
>> to postgres properly.

>
> It's meaningless to ask what encoding a String has. String are
> sequence of chars -- they don't have an encoding.


Actually they are encoded using UTF-16

<http://java.sun.com/developer/technicalArticles/Intl/Supplementary/>

Granted, this is the no-brainer "same value" encoding... as long as
codepoint < U+FFFF



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 08:02 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com