vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Fixes the following bugs: - ispell initialization crashed on empty dictionary file - ispell initialization crashed on affix file with prefixes but no suffixes - stop words file was ran through pg_verify_mbstr, with database encoding, but it's later interpreted as being UTF-8. Now verifies that it's UTF-8, regardless of database encoding. Other changes: - readstopwords now sorts the stop words after loading them. Removed the separate sortstopwords function. - readstopwords calls recode_and_lowerstr directly, instead of using the "wordop" function pointer in StopList struct. All callers used recode_and_lowerstr anyway, so this simplifies the code a little bit. Is there any external dictionary implementations that would require different behavior? - bunch of comments added, typos fixed, and other cleanup The code still needs lots of love, but it's a start... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > - readstopwords calls recode_and_lowerstr directly, instead of using the > "wordop" function pointer in StopList struct. All callers used > recode_and_lowerstr anyway, so this simplifies the code a little bit. Is > there any external dictionary implementations that would require > different behavior? I don't think eliminating wordop altogether is such a hot idea; some dictionary could possibly want to do different processing than that. Something that was annoying me yesterday was that it was not clear whether we had fixed every single place that uses a tsearch config file to assume that the file is in UTF8 and should be converted to database encoding. So I was thinking of hardwiring the "recode" part into readstopwords, and using wordop just for the "lowercase" part, which seemed to me like a saner division of labor. That is, UTF8 is a policy that we want to enforce globally, but lowercasing maybe not, and this still leaves the door open for more processing besides lowercasing. Oleg, Teodor, what do you think about this? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> - readstopwords calls recode_and_lowerstr directly, instead of using the >> "wordop" function pointer in StopList struct. All callers used >> recode_and_lowerstr anyway, so this simplifies the code a little bit. Is >> there any external dictionary implementations that would require >> different behavior? > > I don't think eliminating wordop altogether is such a hot idea; some > dictionary could possibly want to do different processing than that. Ok. > Something that was annoying me yesterday was that it was not clear > whether we had fixed every single place that uses a tsearch config file > to assume that the file is in UTF8 and should be converted to database > encoding. I'm afraid there's still a lot of inconsistencies in that. I'm just looking at dict_synonym, and it looks like it has the same problem I patched in readstopwords; it's using pg_verifymbstr, with database encoding, to verify the input file. It also seems to be calling pg_mblen, which depends on database encoding, against UTF-8 encoded strings. I'll look at those more closely.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Thu, 23 Aug 2007, Tom Lane wrote: > "Heikki Linnakangas" <heikki@enterprisedb.com> writes: >> - readstopwords calls recode_and_lowerstr directly, instead of using the >> "wordop" function pointer in StopList struct. All callers used >> recode_and_lowerstr anyway, so this simplifies the code a little bit. Is >> there any external dictionary implementations that would require >> different behavior? > > I don't think eliminating wordop altogether is such a hot idea; some > dictionary could possibly want to do different processing than that. > > Something that was annoying me yesterday was that it was not clear > whether we had fixed every single place that uses a tsearch config file > to assume that the file is in UTF8 and should be converted to database > encoding. So I was thinking of hardwiring the "recode" part into > readstopwords, and using wordop just for the "lowercase" part, which > seemed to me like a saner division of labor. That is, UTF8 is a policy > that we want to enforce globally, but lowercasing maybe not, and this > still leaves the door open for more processing besides lowercasing. > > Oleg, Teodor, what do you think about this? > I agrre with utf-8 recoding and please, don't lowercase. Dictionaries are very different. > regards, tom lane > Regards, Oleg __________________________________________________ ___________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Tom Lane wrote: > Something that was annoying me yesterday was that it was not clear > whether we had fixed every single place that uses a tsearch config file > to assume that the file is in UTF8 and should be converted to database > encoding. So I was thinking of hardwiring the "recode" part into > readstopwords, and using wordop just for the "lowercase" part, which > seemed to me like a saner division of labor. That is, UTF8 is a policy > that we want to enforce globally, but lowercasing maybe not, and this > still leaves the door open for more processing besides lowercasing. I think we also want to always run input files through pg_verify_mbstr. We do it for stopwords, and synonym files (though incorrectly), but not for thesaurus files or ispell files. It's probably best to do that within the recode-function as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Heikki Linnakangas wrote: > Tom Lane wrote: >> Something that was annoying me yesterday was that it was not clear >> whether we had fixed every single place that uses a tsearch config file >> to assume that the file is in UTF8 and should be converted to database >> encoding. So I was thinking of hardwiring the "recode" part into >> readstopwords, and using wordop just for the "lowercase" part, which >> seemed to me like a saner division of labor. That is, UTF8 is a policy >> that we want to enforce globally, but lowercasing maybe not, and this >> still leaves the door open for more processing besides lowercasing. > > I think we also want to always run input files through pg_verify_mbstr. > We do it for stopwords, and synonym files (though incorrectly), but not > for thesaurus files or ispell files. It's probably best to do that > within the recode-function as well. Ok, here's an updated version of the patch. - ispell initialization crashed on empty dictionary file - ispell initialization crashed on affix file with prefixes but no suffixes - stop words file was ran through pg_verify_mbstr, with database encoding, but it's later interpreted as being UTF-8. Now verifies that it's UTF-8, regardless of database encoding. - introduces new t_readline function that reads a line from a file, verifies that it's valid UTF-8, and converts it to database encoding. Modified all places that read tsearch config files to use this function instead of fgets directly. - readstopwords now sorts the stop words after loading them. Removed the separate sortstopwords function. - moved the wordop-input parameter from StopList struct to a direct argument to readstopwords. Seems cleaner to me that way, the struct is now purely an output of readstopwords, not mixed input/output. readstopwords now recodes the input implicitly using t_readline. - bunch of comments added, typos fixed, and other cleanup PS. It's bank holiday here in the UK on Monday, so I won't be around until Tuesday if something comes up. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| And here's the attachment I forgot. Heikki Linnakangas wrote: > Heikki Linnakangas wrote: >> Tom Lane wrote: >>> Something that was annoying me yesterday was that it was not clear >>> whether we had fixed every single place that uses a tsearch config file >>> to assume that the file is in UTF8 and should be converted to database >>> encoding. So I was thinking of hardwiring the "recode" part into >>> readstopwords, and using wordop just for the "lowercase" part, which >>> seemed to me like a saner division of labor. That is, UTF8 is a policy >>> that we want to enforce globally, but lowercasing maybe not, and this >>> still leaves the door open for more processing besides lowercasing. >> I think we also want to always run input files through pg_verify_mbstr. >> We do it for stopwords, and synonym files (though incorrectly), but not >> for thesaurus files or ispell files. It's probably best to do that >> within the recode-function as well. > > Ok, here's an updated version of the patch. > > - ispell initialization crashed on empty dictionary file > - ispell initialization crashed on affix file with prefixes but no suffixes > - stop words file was ran through pg_verify_mbstr, with database > encoding, but it's later interpreted as being UTF-8. Now verifies that > it's UTF-8, regardless of database encoding. > > > - introduces new t_readline function that reads a line from a file, > verifies that it's valid UTF-8, and converts it to database encoding. > Modified all places that read tsearch config files to use this function > instead of fgets directly. > > - readstopwords now sorts the stop words after loading them. Removed the > separate sortstopwords function. > > - moved the wordop-input parameter from StopList struct to a direct > argument to readstopwords. Seems cleaner to me that way, the struct is > now purely an output of readstopwords, not mixed input/output. > readstopwords now recodes the input implicitly using t_readline. > > - bunch of comments added, typos fixed, and other cleanup > > PS. It's bank holiday here in the UK on Monday, so I won't be around > until Tuesday if something comes up. > -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Ok, here's an updated version of the patch. I haven't actually read this patch yet, but the description all sounds like the Right Thing now. Will review and commit today. Also, I believe there's consensus to rename the standard Snowball dictionaries to "english_stem" etc, so I'll make that happen too. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| ||||
| "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Ok, here's an updated version of the patch. Applied, with a few trivial additional cleanups I noticed while reading the patch. I included your HeadlineText de-duplication too. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |