vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hello I am testing fulltext. 1. I am not able use fulltext with latin2 encoding about only utf8 dictionaries in doc). 2. with hspell dictionaries (fresh copy from open office) I got different and wrong results. Original (old) result ts=# select * from ts_debug('Pøíli¹ ¾lu»ouèký kùò se napil ¾luté vody'); ts_name | tok_type | description | token | dict_name | tsvector --------------+----------+-------------+-----------+ -------------------+ ------------ default_czech | word | Word | Pøíli¹ | {cz_ispell,simple} | 'pøíli¹' default_czech | word | Word | ¾lu»ouèký | {cz_ispell,simple} | '¾lu»ouèký' default_czech | word | Word | kùò | {cz_ispell,simple} | 'kùò' default_czech | lword | Latin word | se | {cz_ispell,simple} | default_czech | lword | Latin word | napil | {cz_ispell,simple} | 'napít' default_czech | word | Word | ¾luté | {cz_ispell,simple} | '¾lutý' default_czech | lword | Latin word | vody | {cz_ispell,simple} | 'voda' (7 øádek) New results: postgres=# create Text search dictionary cspell(template=ispell, dictfile=czech, afffile=czech, stopwords=czech); CREATE TEXT SEARCH DICTIONARY postgres=# CREATE text search configuration cs (copy=english); CREATE TEXT SEARCH CONFIGURATION postgres=# alter text search configuration cs alter mapping for word, lword with cspell, simple; ALTER TEXT SEARCH CONFIGURATION postgres=# select * from ts_debug('cs','Pøíli¹ ¾lu»ouèký kùò se napil ¾luté vody'); Alias | Description | Token | Dictionaries | Lexized token -------+---------------+-----------+-----------------+--------------------- word | Word | Pøíli¹ | {cspell,simple} | cspell: {pøíli¹} blank | Space symbols | | {} | word | Word | ¾lu»ouèký | {cspell,simple} | cspell: {¾lu»ouèký} blank | Space symbols | | {} | word | Word | kùò | {cspell,simple} | cspell: {kùò} blank | Space symbols | | {} | lword | Latin word | se | {cspell,simple} | cspell: {} blank | Space symbols | | {} | lword | Latin word | napil | {cspell,simple} | simple: {napil} blank | Space symbols | | {} | word | Word | ¾luté | {cspell,simple} | simple: {¾luté} blank | Space symbols | | {} | lword | Latin word | vody | {cspell,simple} | simple: {vody} (13 rows) This query returned true in 8.2 and now: postgres=# select to_tsvector('cs','Pøíli¹ ¾lutý kùò se napil ¾luté vody') @@ to_tsquery('cs','napít'); ?column? ---------- f (1 row) Regards Pavel Stehule ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Pavel, I can't read your posting. Can you use plain text format ? Oleg On Mon, 3 Sep 2007, Pavel Stehule wrote: > Hello > I am testing fulltext. > 1. I am not able use fulltext with latin2 encoding > > 2. with hspell dictionaries (fresh copy from open office) I gotdifferent and wrong results. > Original (old) result > ts=# select * from ts_debug('P??li? ?lu?ou?k? k?? se napil ?lut? vody'); ts_name | tok_type | description | token | dict_name | tsvector --------------+----------+-------------+-----------+-------------------+ ------------ default_czech | word | Word | P??li? |{cz_ispell,simple} | 'p??li?' default_czech | word | Word | ?lu?ou?k? |{cz_ispell,simple} | '?lu?ou?k?' default_czech | word | Word | k?? | {cz_ispell,simple} | 'k??' default_czech | lword | Latin word | se | {cz_ispell,simple} | default_czech | lword | Latin word | napil |{cz_ispell,simple} | 'nap?t' default_czech | word | Word | ?lut? |{cz_ispell,simple} | '?lut?' default_czech | lword | Latin word | vody |{cz_ispell,simple} | 'voda' (7 ??dek) > New results > postgres=# alter text search configuration cs alter mapping for word,lword with cspell, simple;ALTER TEXT SEARCH CONFIGURATIONpostgres=# select * from ts_debug('cs','P??li? ?lu?ou?k? k?? se napil?lut? vody'); Alias | Description | Token | Dictionaries | Lexized token-------+---------------+-----------+-----------------+--------------------- word | Word | P??li? | {cspell,simple} | cspell: {p??li?} blank | Space symbols | | {} | word | Word | ?lu?ou?k? | {cspell,simple} | cspell: {?lu?ou?k?} blank | Space symbols | | {} | word | Word | k?? | {cspell,simple} | cspell: {k??} blank | Space symbols | | {} | lword | Latin word | se | {cspell,simple} | cspell: {} blank | Space symbols | | {} | lword | Latin word | napil | {cspell,simple} | simple: {napil} blank | Space symbols | | {} | word | Word | ?lut? | {cspell,simple} | simple: {?lut?} blank | Space symbols | | {} | lword | Latin word | vody | {cspell,simple} | simple: {vody}(13 rows) > This query returned true in 8.2 and now: > postgres=# select to_tsvector('cs','P??li? ?lut? k?? se napil ?lut?vody') @@ to_tsquery('cs','nap?t'); ?column?---------- f(1 row) > RegardsPavel Stehule > Regards, Oleg __________________________________________________ ___________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| > 1. I am not able use fulltext with latin2 encoding > about only utf8 dictionaries in doc). You can use any server encoding, but dictionary's files should be in utf8 - dictionary will convert utf8 files into server encoding. > > > 2. with hspell dictionaries (fresh copy from open office) I got > different and wrong results. > postgres=# select to_tsvector('cs','PÅ™Ã*liÅ¡ žlutý kůň se napil žluté > vody') @@ to_tsquery('cs','napÃ*t'); > ?column? > ---------- > f > (1 row) Pls, output of: select ts_lexize('cspell','napil'); select to_tsvector('cs','PÅ™Ã*liÅ¡ žlutý kůň se napil žluté vody'); -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/ ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| 2007/9/3, Teodor Sigaev <teodor@sigaev.ru>: > > 1. I am not able use fulltext with latin2 encoding > > about only utf8 dictionaries in doc). > You can use any server encoding, but dictionary's files should be in utf8 - > dictionary will convert utf8 files into server encoding. > > > > > > > 2. with hspell dictionaries (fresh copy from open office) I got > > different and wrong results. > > postgres=# select to_tsvector('cs','PÅ™Ã*liÅ¡ žlutý kůň se napil žluté > > vody') @@ to_tsquery('cs','napÃ*t'); > > ?column? > > ---------- > > f > > (1 row) > > Pls, output of: > select ts_lexize('cspell','napil'); > select to_tsvector('cs','PÅ™Ã*liÅ¡ žlutý kůň se napil žluté > vody'); > > postgres=# select ts_lexize('cspell','napil'); ts_lexize ----------- (1 row) postgres=# select to_tsvector('cs','PÅ™Ã*liÅ¡ žlutý kůň se napil žluté vody'); to_tsvector ----------------------------------------------------------- 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'pÅ™Ã*liÅ¡':1 (1 row) There is difference 8.2.x postgres=# select lexize('cz_ispell','jablka'); lexize ---------- {jablko} (1 row) 8.3 postgres=# select ts_lexize('cspell','jablka'); ts_lexize ----------- (1 row) postgres=# select ts_lexize('cspell','jablko'); ts_lexize ----------- {jablko} (1 row) Pavel Stehule ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Pavel Stehule wrote: > 2007/9/3, Teodor Sigaev <teodor@sigaev.ru>: >>> 1. I am not able use fulltext with latin2 encoding >>> about only utf8 dictionaries in doc). >> You can use any server encoding, but dictionary's files should be in utf8 - >> dictionary will convert utf8 files into server encoding. >> >>> >>> 2. with hspell dictionaries (fresh copy from open office) I got >>> different and wrong results. >>> postgres=# select to_tsvector('cs','PÅ™Ã*liÅ¡ žlutý kůň se napil žluté >>> vody') @@ to_tsquery('cs','napÃ*t'); >>> ?column? >>> ---------- >>> f >>> (1 row) >> Pls, output of: >> select ts_lexize('cspell','napil'); >> select to_tsvector('cs','PÅ™Ã*liÅ¡ žlutý kůň se napil žluté >> vody'); >> >> > postgres=# select ts_lexize('cspell','napil'); > ts_lexize > ----------- > > (1 row) > postgres=# select to_tsvector('cs','PÅ™Ã*liÅ¡ žlutý kůň se napil žluté vody'); > to_tsvector > ----------------------------------------------------------- > 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'pÅ™Ã*liÅ¡':1 > (1 row) > > There is difference > 8.2.x > postgres=# select lexize('cz_ispell','jablka'); > lexize > ---------- > {jablko} > (1 row) > 8.3 > postgres=# select ts_lexize('cspell','jablka'); > ts_lexize > ----------- > > (1 row) > postgres=# select ts_lexize('cspell','jablko'); > ts_lexize > ----------- > {jablko} > (1 row) Can you post a link to the ispell dictionary file you're using so I and others can reproduce that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| I used dictionaries from fedora core packages hunspell-cs-20060303-5.fc7.i386.rpm then I converted it to utf8 with iconv Pavel 2007/9/4, Heikki Linnakangas <heikki@enterprisedb.com>: > Pavel Stehule wrote: > > 2007/9/3, Teodor Sigaev <teodor@sigaev.ru>: > >>> 1. I am not able use fulltext with latin2 encoding > >>> about only utf8 dictionaries in doc). > >> You can use any server encoding, but dictionary's files should be in utf8 - > >> dictionary will convert utf8 files into server encoding. > >> > >>> > >>> 2. with hspell dictionaries (fresh copy from open office) I got > >>> different and wrong results. > >>> postgres=# select to_tsvector('cs','PÅ™Ã*liÅ¡ žlutý kůň se napil žluté > >>> vody') @@ to_tsquery('cs','napÃ*t'); > >>> ?column? > >>> ---------- > >>> f > >>> (1 row) > >> Pls, output of: > >> select ts_lexize('cspell','napil'); > >> select to_tsvector('cs','PÅ™Ã*liÅ¡ žlutý kůň se napil žluté > >> vody'); > >> > >> > > postgres=# select ts_lexize('cspell','napil'); > > ts_lexize > > ----------- > > > > (1 row) > > postgres=# select to_tsvector('cs','PÅ™Ã*liÅ¡ žlutý kůň se napil žluté vody'); > > to_tsvector > > ----------------------------------------------------------- > > 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'pÅ™Ã*liÅ¡':1 > > (1 row) > > > > There is difference > > 8.2.x > > postgres=# select lexize('cz_ispell','jablka'); > > lexize > > ---------- > > {jablko} > > (1 row) > > 8.3 > > postgres=# select ts_lexize('cspell','jablka'); > > ts_lexize > > ----------- > > > > (1 row) > > postgres=# select ts_lexize('cspell','jablko'); > > ts_lexize > > ----------- > > {jablko} > > (1 row) > > Can you post a link to the ispell dictionary file you're using so I and > others can reproduce that? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Pavel Stehule wrote: > I used dictionaries from fedora core packages > > hunspell-cs-20060303-5.fc7.i386.rpm > > then I converted it to utf8 with iconv Ok, thanks. Apparently it's a bug I introduced when I refactored spell.c to use the readline function for reading and recoding the input file. I didn't notice that some calls to STRNCMP used the non-lowercased version of the input line. Patch attached. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| ||||
| 2007/9/4, Heikki Linnakangas <heikki@enterprisedb.com>: > Pavel Stehule wrote: > > I used dictionaries from fedora core packages > > > > hunspell-cs-20060303-5.fc7.i386.rpm > > > > then I converted it to utf8 with iconv > > Ok, thanks. > > Apparently it's a bug I introduced when I refactored spell.c to use the > readline function for reading and recoding the input file. I didn't > notice that some calls to STRNCMP used the non-lowercased version of the > input line. Patch attached. > > -- It works Thank you Pavel ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |