Unix Technical Forum

SEO

vBulletin Search Engine Optimization


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql General

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-09-2008, 08:27 AM
Bill Moseley
 
Posts: n/a
Default Mixing different LC_COLLATE and database encodings



I've been going through the docs and list archives trying to get
clear on encoding issues, but still have a few question.

Do I have these statements correct?

- LC_COLLATE is set on the cluster at initdb time. From that point
on all database text is sorted based on that *regardless* of the
encoding set on an individual database.

- So for lc_collate="C" sorting is just based on the byte values, and
if lc_collate="en_US" then sorting is based on the us_US order, and
the bytes are assumed to be 8859-1 (if that matters).

- To clarify the first point, if the database is encoded utf-8 and
lc_collate is en_US then Postgresql does NOT try to convert utf-8 to
8859-1 before sorting.

- If the "client encoding" and the database encoding differ then
Postgresql will convert between the two encodings during I/O.

- The exception is if *either* the client or the server's encoding is
"SQL_ASCII" then no client<=>server conversion is done.


Sound about right?


1) What else is the database's encoding used for besides to determine
how to convert text in input and output based on the client encoding?

2) What client encoding is used if the client does not specify one?
For example, I'm looking through Perl's DBD::Pg and I don't see any
place where it calls PQsetClientEncoding().


http://search.cpan.org/src/DBDPG/DBD-Pg-1.41/
http://www.postgresql.org/docs/7.4/i...NSLATION-TABLE


3) The vast majority of my utf-8 encoded text that I need to display
sorted probably maps to 8859-1 characters.

I think I already answered this above, but:

Am I correct that Postgresql is *not* converting text from the
database encoding to the cluster encoding before sorting? That is
with "C" it's just sorting in byte order, and with en_US it's just
assuming that the bytes are 8859-1 and ignoring that it's really
utf-8?

That is, if I have text that's in utf-8 but includes characters that
would map to 8859-1 (say accented chars), that sorting will not be
correct because it's not converted to 8859-1 when sorting?



4) If the above is true, then if I wanted my utf-8 encoded text to be
sorted correctly then I'd need to re-initdb using --encoding=en_US.UTF-8,
correct?



5) I suppose there's not way to answer this, short of running
benchmarks, but any ideas what using a lc_collate with utf-8 would do
to performance? Is it a big hit?




Not related to Postgresql, but testing some of this is confusing
due to my environment. How do I get my xterm to work with utf8?
Does ssh do something with encoding?

If I have a utf8 xterm window open on my machine, then ssh to
the server running postgresql where the default locale is "POSIX"
Then running:

LANG=en_US.utf8 psql utf8test
utf8test=> \encoding
UNICODE

utf8test=> select first_name from person where last_name = 'Anderson';

Then I see:

Zo<C3><AB>

But, if on that same remote machine I run a unicode xterm (uxterm in
Debian) then in that xterm window I do:

utf8test=> \encoding
UNICODE
utf8test=> select first_name from person where last_name = 'Anderson';
Zoë (correct)

It's must slower running xterm remotely than using my local xterm and
ssh, so it would be nice to be able to display the utf8.



--
Bill Moseley
moseley@hank.org


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-09-2008, 08:27 AM
Peter Eisentraut
 
Posts: n/a
Default Re: Mixing different LC_COLLATE and database encodings

Bill Moseley wrote:
> Do I have these statements correct?


yes

> 1) What else is the database's encoding used for besides to determine
> how to convert text in input and output based on the client encoding?


nothing

> 2) What client encoding is used if the client does not specify one?


the server encoding

> 3) The vast majority of my utf-8 encoded text that I need to display
> sorted probably maps to 8859-1 characters.


probably not

> That is, if I have text that's in utf-8 but includes characters that
> would map to 8859-1 (say accented chars), that sorting will not be
> correct because it's not converted to 8859-1 when sorting?


right

> 4) If the above is true, then if I wanted my utf-8 encoded text to be
> sorted correctly then I'd need to re-initdb using
> --encoding=en_US.UTF-8, correct?


right

> 5) I suppose there's not way to answer this, short of running
> benchmarks, but any ideas what using a lc_collate with utf-8 would do
> to performance? Is it a big hit?


I don't know why that would be a problem.

> Not related to Postgresql, but testing some of this is confusing
> due to my environment. How do I get my xterm to work with utf8?
> Does ssh do something with encoding?


I don't use xterm so I'll skip the rest.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-09-2008, 08:27 AM
Bill Moseley
 
Posts: n/a
Default Re: Mixing different LC_COLLATE and database encodings

On Sat, Feb 18, 2006 at 05:20:19PM +0100, Peter Eisentraut wrote:
> > 2) What client encoding is used if the client does not specify one?

>
> the server encoding


What's the server encoding? The environment when the cluster is
started? How do you find out what it's running as?

Does that mean if the encoding is anything other than "C" then
Postgresql will convert? That is, if my database is utf8 and the
server is en_US then text will be sent to the client as 8859-1? Not,
that's not correct as I'm not seeing that. So I guess I'm not clear
on that point.

> > 5) I suppose there's not way to answer this, short of running
> > benchmarks, but any ideas what using a lc_collate with utf-8 would do
> > to performance? Is it a big hit?

>
> I don't know why that would be a problem.


Just that sorting utf8 is a bit more work that sorting raw bytes.

Thanks for the help,

--
Bill Moseley
moseley@hank.org


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-09-2008, 08:27 AM
Peter Eisentraut
 
Posts: n/a
Default Re: Mixing different LC_COLLATE and database encodings

Bill Moseley wrote:
> What's the server encoding?


When you say "My database is in utf8", then "utf8" is the server
encoding.

> Does that mean if the encoding is anything other than "C"


C is a locale, not an encoding.

> Just that sorting utf8 is a bit more work that sorting raw bytes.


Sorting in C locale is certainly faster, but for anything else, there
won't be any noticeable difference I would think.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-09-2008, 08:27 AM
Tom Lane
 
Posts: n/a
Default Re: Mixing different LC_COLLATE and database encodings

Bill Moseley <moseley@hank.org> writes:
> - To clarify the first point, if the database is encoded utf-8 and
> lc_collate is en_US then Postgresql does NOT try to convert utf-8 to
> 8859-1 before sorting.


Basically, this is a horribly bad idea and you should never do it.
The database encoding should always match what the locale assumes
for its character set (unless the locale is "C", which doesn't care).

We'd enforce that you never do it if we knew a portable way to determine
the character set assumed by an LC_COLLATE setting.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-09-2008, 08:27 AM
Bill Moseley
 
Posts: n/a
Default Re: Mixing different LC_COLLATE and database encodings

On Sat, Feb 18, 2006 at 01:40:09PM -0500, Tom Lane wrote:
> Bill Moseley <moseley@hank.org> writes:
> > - To clarify the first point, if the database is encoded utf-8 and
> > lc_collate is en_US then Postgresql does NOT try to convert utf-8 to
> > 8859-1 before sorting.

>
> Basically, this is a horribly bad idea and you should never do it.
> The database encoding should always match what the locale assumes
> for its character set (unless the locale is "C", which doesn't care).


What's a bad idea? Having a lc_collate on the cluster that doesn't
support the encodings in the databases?

> We'd enforce that you never do it if we knew a portable way to determine
> the character set assumed by an LC_COLLATE setting.


Again, not sure what "it" is, but I do find it confusing when the
cluster can have only one lc_collate, but the databases on that
cluster can have more than one encoding. That's why I was asking
how postgresql handles (possibly) different encodings.

Are you saying that if a database is encoded as utf8 then the cluster
should be initiated with something like en_US.utf8? And then all
databaes on that cluster should be encoded the same?



I suspect I don't understand how LC_COLLATE works that well.

I thought the locale defines the order of the characters, but not the
encoding of those characters. Maybe that's not correct. I assumed the
same locale should sort the same chars represented in different
encodings the same way. Maybe that's not the case:

$ LC_ALL=en_US.UTF-8 locale charmap
UTF-8

$ LC_ALL=en_US locale charmap
ISO-8859-1

$ LC_ALL=C locale charmap
ANSI_X3.4-1968

--
Bill Moseley
moseley@hank.org


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-09-2008, 08:27 AM
Peter Eisentraut
 
Posts: n/a
Default Re: Mixing different LC_COLLATE and database encodings

Bill Moseley wrote:
> What's a bad idea? Having a lc_collate on the cluster that doesn't
> support the encodings in the databases?


Exactly

> Again, not sure what "it" is, but I do find it confusing when the
> cluster can have only one lc_collate, but the databases on that
> cluster can have more than one encoding.


It is confusing, so don't do it.

> That's why I was asking
> how postgresql handles (possibly) different encodings.


It doesn't.

> Are you saying that if a database is encoded as utf8 then the cluster
> should be initiated with something like en_US.utf8? And then all
> databaes on that cluster should be encoded the same?


Yes

> I thought the locale defines the order of the characters, but not the
> encoding of those characters.


In theory, they are independent concepts. But in practice, the C
library gets a bunch bytes from the application (in this case, the
PostgreSQL server) and is asked to sort them. So it needs to know what
these bytes are supposed to mean. By design of the POSIX locale
facilities, the C library is told that by way of the locale. It would
be much simpler for everyone if there was a function strcmp(string1,
string2, collation, encoding), but there isn't.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-09-2008, 08:27 AM
Greg Stark
 
Posts: n/a
Default Re: Mixing different LC_COLLATE and database encodings


Bill Moseley <moseley@hank.org> writes:

> $ LC_ALL=en_US.UTF-8 locale charmap
> UTF-8
>
> $ LC_ALL=en_US locale charmap
> ISO-8859-1
>
> $ LC_ALL=C locale charmap
> ANSI_X3.4-1968


Unfortunately Postgres only supports a single collation cluster-wide. So
depending on which collation you use of the ones above you would really have
to select either UTF-8 ISO-8859-1 or SQL_ASCII (ie ANSI_X3.4-1968). Anything
else and the collation just won't work properly. It will be expecting UTF-8
and be fed ISO-8859-1 strings, resulting in weird and sometimes inconsistent
sort orders.

There's a certain amount of feeling that using any locale other than C is
probably not ever the right thing given the current functionality. Just about
any database has some strings in it that are really just ascii strings like
char(1) primary keys and other internal database strings. You may not want
them being subject to the locale's collation for comparison purposes and you
may not want the overhead of variable width character encodings.

Those of us in this camp are defining all our databases using C locale and
then using the pg_strxfrm() function that's been floating around the list for
a while to handle sorting strings that need to be sorted in various locales.
This has performs acceptably (but not spectacularly) under glibc but it's not
clear which other libc implementations it works well under.

It also doesn't solve the whole problem since functions like substr() or LIKE
are locale sensitive too. If you need an encoding like UTF-8 and you're stuck
either pushing all your string manipulations into the client or going ahead
with a non-C locale and UTF-8 even for the strings that are really just ascii
strings.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-09-2008, 08:27 AM
Bill Moseley
 
Posts: n/a
Default Re: Mixing different LC_COLLATE and database encodings

On Sat, Feb 18, 2006 at 09:31:27PM -0500, Greg Stark wrote:
> Anything else and the collation just won't work properly. It will be
> expecting UTF-8 and be fed ISO-8859-1 strings, resulting in weird
> and sometimes inconsistent sort orders.


So if I have utf8 encoded text and the lc_collate is anything but
utf8 then sorting will be all wrong for any chars that don't map to
ASCII (>127). Kind of a mess.


> There's a certain amount of feeling that using any locale other than C is
> probably not ever the right thing given the current functionality. Just about
> any database has some strings in it that are really just ascii strings like
> char(1) primary keys and other internal database strings. You may not want
> them being subject to the locale's collation for comparison purposes and you
> may not want the overhead of variable width character encodings.


Is the Holy Grail encoding and lc_collate settings per column?


Changing topics, but I'm going to play with different cluster
settings for collate. If I create a cluster in given directory
is there any problems with moving that cluster (renaming the
directory)?

Thanks for your comments, Greg.


--
Bill Moseley
moseley@hank.org


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-09-2008, 08:27 AM
Martijn van Oosterhout
 
Posts: n/a
Default Re: Mixing different LC_COLLATE and database encodings

On Sat, Feb 18, 2006 at 08:16:07PM -0800, Bill Moseley wrote:
> Is the Holy Grail encoding and lc_collate settings per column?


Well yes. I've been trying to create a system where you can handle
multiple collations in the same database. I posted the details to
-hackers and got part of the way, but it's a lot of work.

As for encodings, to be honest, I'm not sure whether it's a great idea
to support multiple encodings simultaneously. Things become a lot
easier if you know everything is the same encoding. If you set the
client_encoding automatically on startup it has pretty much the same
effect as having the server always use that encoding. It's just a bit
of time wasted in conversion, but the client doesn't need to care.

By way of example, see ICU which is an internationalisation library
we're considering to get consistant locale support over all platforms.
It supports one encoding, namely UTF-16. It has various functions to
convert other encodings to or from that, but internally it's all
UTF-16. So if we do use that, then all encodings (except native UTF-16)
will need to conversion all the time, so you don't buy anything by
having the server in some random encoding.

The problem ofcourse being that the SQL standard requires some encoding
support. No-one has really come up with a proposal for that yet. IMHO,
that's a parser issue more than anything else.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFD+GhyIB7bNG8LQkwRAvfiAJ9eJ4n4cuMI4Hh5kH99qR lby3SZ/wCfVNe5
iLHToMvqjNemVaDCiifS1uk=
=OEis
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 08:47 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
UnixAdminTalk.com

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493