Unix Technical Forum

Latin vs non-Latin words in text search parsing

This is a discussion on Latin vs non-Latin words in text search parsing within the pgsql Hackers forums, part of the PostgreSQL category; --> If I am reading the state machine in wparser_def.c correctly, the three classifications of words that the default parser ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-15-2008, 10:28 PM
Tom Lane
 
Posts: n/a
Default Latin vs non-Latin words in text search parsing

If I am reading the state machine in wparser_def.c correctly, the
three classifications of words that the default parser knows are

lword Composed entirely of ASCII letters
nlword Composed entirely of non-ASCII letters
(where "letter" is defined by iswalpha())
word Entirely alphanumeric (per iswalnum()), but not above
cases

This classification is probably sane enough for dealing with mixed
Russian/English text --- IIUC, Russian words will come entirely from
the Cyrillic alphabet which has no overlap with ASCII letters. But
I'm thinking it'll be quite inconvenient for other European languages
whose alphabets include the base ASCII letters plus other stuff such
as accented letters. They will have a lot of words that fall into
the catchall "word" category, which will mean they have to index
mixed alpha-and-number words in order to catch all native words.

ISTM that perhaps a more generally useful definition would be

lword Only ASCII letters
nlword Entirely letters per iswalpha(), but not lword
word Entirely alphanumeric per iswalnum(), but not nlword
(hence, includes at least one digit)

However, I am no linguist and maybe I'm missing something.

Comments?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-15-2008, 10:28 PM
Alvaro Herrera
 
Posts: n/a
Default Re: Latin vs non-Latin words in text search parsing

Tom Lane wrote:

> ISTM that perhaps a more generally useful definition would be
>
> lword Only ASCII letters
> nlword Entirely letters per iswalpha(), but not lword
> word Entirely alphanumeric per iswalnum(), but not nlword
> (hence, includes at least one digit)
>
> However, I am no linguist and maybe I'm missing something.


I tend to agree with the need to redefine the categories. I am not sure
I agree with this particular definition though. I would think that a
"latin word" should include ASCII letters and accented letters, and a
non-latin word would be one that included only non-ASCII chars.

alvherre=# select * from ts_debug('spanish', 'añadido añadió añadidura');
Alias | Description | Token | Dictionaries | Lexized token
-------+---------------+-----------+----------------+--------------------------
word | Word | añadido | {spanish_stem} | spanish_stem: {añad}
blank | Space symbols | | {} |
word | Word | añadió | {spanish_stem} | spanish_stem: {añad}
blank | Space symbols | | {} |
word | Word | añadidura | {spanish_stem} | spanish_stem: {añadidur}
(5 lignes)

I would think those would all fit in the "latin word" category. This
example is more interesting because it shows a word categorized
differently just because the plural loses the accent:

alvherre=# select * from ts_debug('spanish', 'caracteres carácter');
Alias | Description | Token | Dictionaries | Lexized token
-------+---------------+------------+----------------+--------------------------
lword | Latin word | caracteres | {spanish_stem} | spanish_stem: {caracter}
blank | Space symbols | | {} |
word | Word | carácter | {spanish_stem} | spanish_stem: {caract}
(3 lignes)

I am not sure if there are any western european languages were words can
only be formed with non-ascii chars. At least in spanish accents tend
to be rare. However, I would think this is also wrong:

alvherre=# select * from ts_debug('french', 'à');
Alias | Description | Token | Dictionaries | Lexized token
--------+----------------+-------+---------------+-----------------
nlword | Non-latin word | à | {french_stem} | french_stem: {}
(1 ligne)

I don't think this is much of a problem, this particular word being
(most likely) a stopword.

So, how about

lword Entirely letters per iswalpha, with at least one ASCII
nlword Entirely letters per iswalpha
word Entirely alphanumeric per iswalnum, but not nlword

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-15-2008, 10:28 PM
Tom Lane
 
Posts: n/a
Default Re: Latin vs non-Latin words in text search parsing

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> ISTM that perhaps a more generally useful definition would be
>>
>> lword Only ASCII letters
>> nlword Entirely letters per iswalpha(), but not lword
>> word Entirely alphanumeric per iswalnum(), but not nlword


> ... how about


> lword Entirely letters per iswalpha, with at least one ASCII
> nlword Entirely letters per iswalpha
> word Entirely alphanumeric per iswalnum, but not nlword


Hmm. Then we have no category for "entirely ASCII", which is an
interesting category at least from the English standpoint, and I think
also in a lot of computer-oriented contexts. I think you may be putting
too much emphasis on the "Latin" aspect of the category name, which I
find to be a bit historical. I'm not sure if it's too late to consider
renaming the categories; if we were willing to do that I'd propose
categories "aword", "naword", "word", defined as above.

Another thing that bothers me about your suggestion is that (at least in
some locales) iswalpha will return true for things that are neither
ASCII letters nor accented versions of them, eg Cyrillic letters.
So I'm not sure the surprise factor is any less with your approach
than mine: you could still get "lword" for something decidedly not
Latin-derived.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-15-2008, 10:28 PM
Heikki Linnakangas
 
Posts: n/a
Default Re: Latin vs non-Latin words in text search parsing

Alvaro Herrera wrote:
> Tom Lane wrote:
>
>> ISTM that perhaps a more generally useful definition would be
>>
>> lword Only ASCII letters
>> nlword Entirely letters per iswalpha(), but not lword
>> word Entirely alphanumeric per iswalnum(), but not nlword
>> (hence, includes at least one digit)

> ...
> I am not sure if there are any western european languages were words can
> only be formed with non-ascii chars.


There is at least in Swedish: "ö" (island) and å (river). They're both a
bit special because they're just one letter each.

> lword Entirely letters per iswalpha, with at least one ASCII
> nlword Entirely letters per iswalpha
> word Entirely alphanumeric per iswalnum, but not nlword


I don't like this categorization much more than the original. The
distinction between lword and nlword is useless for most European
languages.

I suppose that Tom's argument that it's useful to distinguish words made
of purely ASCII characters in computer-oriented stuff is valid, though I
can't immediately think of a use case. For things like parsing a
programming language, that's not really enough, so you'd probably end up
writing your own parser anyway. I'm also not clear what the use case for
the distinction between words with digits or not is. I don't think
there's any natural languages where a word can contain digits, so it
must be a computer-oriented thing as well.

I like the "aword" name more than "lword", BTW. If we change the meaning
of the classes, surely we can change the name as well, right?

Note that the default parser is useless for languages like Japanese,
where words are not separated by whitespace, anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-15-2008, 10:28 PM
Tatsuo Ishii
 
Posts: n/a
Default Re: Latin vs non-Latin words in text search parsing

> Alvaro Herrera wrote:
> > Tom Lane wrote:
> >
> >> ISTM that perhaps a more generally useful definition would be
> >>
> >> lword Only ASCII letters
> >> nlword Entirely letters per iswalpha(), but not lword
> >> word Entirely alphanumeric per iswalnum(), but not nlword
> >> (hence, includes at least one digit)

> > ...
> > I am not sure if there are any western european languages were words can
> > only be formed with non-ascii chars.

>
> There is at least in Swedish: "ö" (island) and å (river). They're both a
> bit special because they're just one letter each.
>
> > lword Entirely letters per iswalpha, with at least one ASCII
> > nlword Entirely letters per iswalpha
> > word Entirely alphanumeric per iswalnum, but not nlword

>
> I don't like this categorization much more than the original. The
> distinction between lword and nlword is useless for most European
> languages.
>
> I suppose that Tom's argument that it's useful to distinguish words made
> of purely ASCII characters in computer-oriented stuff is valid, though I
> can't immediately think of a use case. For things like parsing a
> programming language, that's not really enough, so you'd probably end up
> writing your own parser anyway. I'm also not clear what the use case for
> the distinction between words with digits or not is. I don't think
> there's any natural languages where a word can contain digits, so it
> must be a computer-oriented thing as well.
>
> I like the "aword" name more than "lword", BTW. If we change the meaning
> of the classes, surely we can change the name as well, right?
>
> Note that the default parser is useless for languages like Japanese,
> where words are not separated by whitespace, anyway.


Above is true but that does not neccessary mean that Tsearch is not
used for Japanese at all. I overcome the problem above by doing a
pre-process step which separate Japanese sentences to words devided by
white space. I wish I could write a new parser which could do the
job for 8.4 or later...

Please change the word definition very carefully.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-15-2008, 10:28 PM
Gregory Stark
 
Posts: n/a
Default Re: Latin vs non-Latin words in text search parsing

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> Alvaro Herrera wrote:
>> Tom Lane wrote:
>>
>>> ISTM that perhaps a more generally useful definition would be
>>>
>>> lword Only ASCII letters
>>> nlword Entirely letters per iswalpha(), but not lword
>>> word Entirely alphanumeric per iswalnum(), but not nlword
>>> (hence, includes at least one digit)

>> ...
>> I am not sure if there are any western european languages were words can
>> only be formed with non-ascii chars.

>
> There is at least in Swedish: "ö" (island) and å (river). They're both a
> bit special because they're just one letter each.


For what it's worth I did the same search last night and found three French
words including "çÃ*" -- which admittedly is likely to be a noise word. Other
dictionaries such as Italian and Irish also have one-letter words like this.
The only other with multi-letter words is actually Faroese with "Ã*ð" and "óð".

> I like the "aword" name more than "lword", BTW. If we change the meaning
> of the classes, surely we can change the name as well, right?


I'm not very familiar with the use case here. Is there a good reason to want
to abbreviate these names? I think I would expect "ascii", "word", and "token"
for the three categories Tom describes.

> Note that the default parser is useless for languages like Japanese,
> where words are not separated by whitespace, anyway.


I also wonder about languages like Arabic and Hindi which do have words but
I'm not sure if they use white space as simply as in latin languages.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-15-2008, 10:28 PM
Tom Lane
 
Posts: n/a
Default Re: Latin vs non-Latin words in text search parsing

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Alvaro Herrera wrote:
>> lword Entirely letters per iswalpha, with at least one ASCII
>> nlword Entirely letters per iswalpha
>> word Entirely alphanumeric per iswalnum, but not nlword


> I don't like this categorization much more than the original. The
> distinction between lword and nlword is useless for most European
> languages.


Right. That's not an objection in itself, since you can just add the
same dictionary mappings to both token types, but the question is when
would such a distinction actually be useful? AFAICS the only case where
it'd make sense to put different mappings on lword and nlword with the
above definitions is when dealing with Russian or similar languages,
where the entire alphabet is non-ASCII. However, my proposal (pure
ASCII vs not pure ASCII) seems to work just as well for that case as
this proposal does.

> ... I'm also not clear what the use case for
> the distinction between words with digits or not is. I don't think
> there's any natural languages where a word can contain digits, so it
> must be a computer-oriented thing as well.


Well, that's exactly why we *should* distinguish words-with-digits;
it's unlikely that any standard dictionary will do sane things with
them, so if you want to index them they need to go down a different
dictionary chain.

A more drastic change would be to not treat a string like "beta1"
as a single token at all, so that the alphanumeric-word category
would go away entirely. However I'm disinclined to tinker with
the parser that much. It's seen enough use in the contrib module
that I'm prepared to grant that the design is generally useful.
I'm just worried that the subcategories of "word" need a bit of
adjustment for languages other than Russian and English.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-15-2008, 10:29 PM
Tom Lane
 
Posts: n/a
Default Re: Latin vs non-Latin words in text search parsing

Gregory Stark <stark@enterprisedb.com> writes:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> I like the "aword" name more than "lword", BTW. If we change the meaning
>> of the classes, surely we can change the name as well, right?


> I'm not very familiar with the use case here. Is there a good reason to want
> to abbreviate these names? I think I would expect "ascii", "word", and "token"
> for the three categories Tom describes.


Please look at the first nine rows of the table here:
http://developer.postgresql.org/pgdo...h-parsers.html
It's not clear to me where we'd go with the names for the
hyphenated-word and hyphenated-word-part categories. Also, ISTM that
we should use related names for these three categories, since they are
all considered valid parts of hyphenated words.

Another point: "token" is probably unreasonably confusing as a name for
a token type. "Is that a token token or a word token?"

Maybe "aword", "word", and "numword"?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-15-2008, 10:29 PM
Tom Lane
 
Posts: n/a
Default Re: Latin vs non-Latin words in text search parsing

I wrote:
> Maybe "aword", "word", and "numword"?


Does the lack of response mean people are satisfied with that?

Fleshing the proposal out to include the hyphenated-word categories:

aword All ASCII letters
word All letters according to iswalpha()
numword Mixed letters and digits (all iswalnum())

ahword Hyphenated word, all ASCII letters
hword Hyphenated word, all letters
numhword Hyphenated word, mixed letters and digits

apart_hword Part of hyphenated word, all ASCII letters
part_hword Part of hyphenated word, all letters
numpart_hword Part of hyphenated word, mixed letters and digits

(As an example, "foo-beta1" is a numhword, with component tokens
"foo" an aword and "beta1" a numword. This is how it works now
modulo the redefinition of the base categories.)

I'm not totally thrilled with these short names for the hyphenation
categories, but they will seem at least somewhat familiar to users
of contrib/tsearch2, and it's probably not worth changing them just
to make them look prettier.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-15-2008, 10:29 PM
Tom Lane
 
Posts: n/a
Default Re: Latin vs non-Latin words in text search parsing

I wrote:
> (As an example, "foo-beta1" is a numhword, with component tokens
> "foo" an aword and "beta1" a numword. This is how it works now
> modulo the redefinition of the base categories.)


Argh... need more caffeine. Obviously the component tokens would
be apart_hword and numpart_hword. They'd be the others only if
they were *not* part of a hyphenated word.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 09:55 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com