Unix Technical Forum

Why not keeping positions in GIN?

This is a discussion on Why not keeping positions in GIN? within the pgsql Hackers forums, part of the PostgreSQL category; --> Hi, I was walking through GIN am source code these days, and found that it has only posting lists ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-12-2008, 08:50 AM
Hitoshi Harada
 
Posts: n/a
Default Why not keeping positions in GIN?

Hi,

I was walking through GIN am source code these days, and found that it has
only posting lists but no positions related those.

The reason I was doing that is, to try to implement n-gram text search index
on GIN for myself. As you know Japanese is not like English or other
European languages. If you write Japanese (or other 'not separated') text
index by n-gram, it should have entry positions on the entry as well as the
posting lists, because you must know if each split query key are joined with
each other in the data. To know this, position must be there.

It's not only about Japanese. When you search "phrase" for text in English,
the same logic above will be needed. I don't research about tsearch2 but is
there any problem?? Also, in some case int-array inverted index needs the
entry positions as well, I guess. Obtaining positions with posting lists is
"general" enough for GIN, isn't it?

Is there any future plan around it?


Regards,

Hitoshi Harada



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-12-2008, 08:50 AM
Guillaume Smet
 
Posts: n/a
Default Re: Why not keeping positions in GIN?

On 5/25/07, Hitoshi Harada <hitoshi_harada@forcia.com> wrote:
> It's not only about Japanese. When you search "phrase" for text in English,
> the same logic above will be needed. I don't research about tsearch2 but is
> there any problem?? Also, in some case int-array inverted index needs the
> entry positions as well, I guess. Obtaining positions with posting lists is
> "general" enough for GIN, isn't it?
>
> Is there any future plan around it?


We talked of this with Oleg and Teodor when I worked on GIN for
pg_trgm. I know there is a long term plan to solve this issue (and
especially improve ranking in full text search).

I'm not sure the position is general enough. What I'd like to have is
the ability to add metadata. For example, in the case of pg_trgm, I'd
like to have the length of the original string as it's a strong factor
in similarity calculation. Currently, we get a lot of results which
are rechecked after the first index pass: it's not very efficient.

--
Guillaume

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-12-2008, 08:51 AM
Oleg Bartunov
 
Posts: n/a
Default Re: Why not keeping positions in GIN?

On Fri, 25 May 2007, Hitoshi Harada wrote:

> Hi,
>
> I was walking through GIN am source code these days, and found that it has
> only posting lists but no positions related those.
>
> The reason I was doing that is, to try to implement n-gram text search index
> on GIN for myself. As you know Japanese is not like English or other
> European languages. If you write Japanese (or other 'not separated') text
> index by n-gram, it should have entry positions on the entry as well as the
> posting lists, because you must know if each split query key are joined with
> each other in the data. To know this, position must be there.


FYI, Tatsuo uses tsearch2 for indexing japanese documents. But I agree,
n-gram index would be more universal for asian languages.

>
> It's not only about Japanese. When you search "phrase" for text in English,
> the same logic above will be needed. I don't research about tsearch2 but is
> there any problem?? Also, in some case int-array inverted index needs the
> entry positions as well, I guess. Obtaining positions with posting lists is
> "general" enough for GIN, isn't it?
>
> Is there any future plan around it?


Yes, we do have plans. See our todo, http://www.sai.msu.su/~megera/wiki/todo
You may read also FTSBOOK, http://www.sai.msu.su/~megera/postgres/fts/doc
and slides from PGCon2007, http://www.sai.msu.su/~megera/postgr...-pgcon2007.pdf
>
>
> Regards,
>
> Hitoshi Harada
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>


Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-12-2008, 08:51 AM
Hitoshi Harada
 
Posts: n/a
Default Re: Why not keeping positions in GIN?

> FYI, Tatsuo uses tsearch2 for indexing japanese documents. But I agree,
> n-gram index would be more universal for asian languages.

Yeah, I know, but in tsearch2 for japanese sample you must use external
morphological analysis libraries to separate words. It is powerful but I
need more "lightweight" approach. Also especially when you search for
non-document(such like titles, names, or pattern in the genome), the
approach above is not so useful.

As I mentioned, GIN is also powerful for array data type search, so I am
very expecting it will have additional information.

Anyway, thanks a lot for much information. I try to read it.

Regards,

Hitoshi Harada

> -----Original Message-----
> From: Oleg Bartunov [mailtoleg@sai.msu.su]
> Sent: Saturday, May 26, 2007 10:12 PM
> To: Hitoshi Harada
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Why not keeping positions in GIN?
>
> On Fri, 25 May 2007, Hitoshi Harada wrote:
>
> > Hi,
> >
> > I was walking through GIN am source code these days, and found that it

has
> > only posting lists but no positions related those.
> >
> > The reason I was doing that is, to try to implement n-gram text search

index
> > on GIN for myself. As you know Japanese is not like English or other
> > European languages. If you write Japanese (or other 'not separated')

text
> > index by n-gram, it should have entry positions on the entry as well as

the
> > posting lists, because you must know if each split query key are joined

with
> > each other in the data. To know this, position must be there.

>
> FYI, Tatsuo uses tsearch2 for indexing japanese documents. But I agree,
> n-gram index would be more universal for asian languages.
>
> >
> > It's not only about Japanese. When you search "phrase" for text in

English,
> > the same logic above will be needed. I don't research about tsearch2 but

is
> > there any problem?? Also, in some case int-array inverted index needs

the
> > entry positions as well, I guess. Obtaining positions with posting lists

is
> > "general" enough for GIN, isn't it?
> >
> > Is there any future plan around it?

>
> Yes, we do have plans. See our todo,

http://www.sai.msu.su/~megera/wiki/todo
> You may read also FTSBOOK, http://www.sai.msu.su/~megera/postgres/fts/doc
> and slides from PGCon2007,
> http://www.sai.msu.su/~megera/postgr...-pgcon2007.pdf
> >
> >
> > Regards,
> >
> > Hitoshi Harada
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> >
> > http://archives.postgresql.org
> >

>
> Regards,
> Oleg
> __________________________________________________ ___________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-12-2008, 08:51 AM
Oleg Bartunov
 
Posts: n/a
Default Re: Why not keeping positions in GIN?

Hitoshi,

there is no problem to write n-gram dictionary for tsearch2 ! The problem
is in how to define word boundary.

Oleg

On Sat, 26 May 2007, Hitoshi Harada wrote:

>> FYI, Tatsuo uses tsearch2 for indexing japanese documents. But I agree,
>> n-gram index would be more universal for asian languages.

> Yeah, I know, but in tsearch2 for japanese sample you must use external
> morphological analysis libraries to separate words. It is powerful but I
> need more "lightweight" approach. Also especially when you search for
> non-document(such like titles, names, or pattern in the genome), the
> approach above is not so useful.
>
> As I mentioned, GIN is also powerful for array data type search, so I am
> very expecting it will have additional information.
>
> Anyway, thanks a lot for much information. I try to read it.
>
> Regards,
>
> Hitoshi Harada
>
>> -----Original Message-----
>> From: Oleg Bartunov [mailtoleg@sai.msu.su]
>> Sent: Saturday, May 26, 2007 10:12 PM
>> To: Hitoshi Harada
>> Cc: pgsql-hackers@postgresql.org
>> Subject: Re: [HACKERS] Why not keeping positions in GIN?
>>
>> On Fri, 25 May 2007, Hitoshi Harada wrote:
>>
>>> Hi,
>>>
>>> I was walking through GIN am source code these days, and found that it

> has
>>> only posting lists but no positions related those.
>>>
>>> The reason I was doing that is, to try to implement n-gram text search

> index
>>> on GIN for myself. As you know Japanese is not like English or other
>>> European languages. If you write Japanese (or other 'not separated')

> text
>>> index by n-gram, it should have entry positions on the entry as well as

> the
>>> posting lists, because you must know if each split query key are joined

> with
>>> each other in the data. To know this, position must be there.

>>
>> FYI, Tatsuo uses tsearch2 for indexing japanese documents. But I agree,
>> n-gram index would be more universal for asian languages.
>>
>>>
>>> It's not only about Japanese. When you search "phrase" for text in

> English,
>>> the same logic above will be needed. I don't research about tsearch2 but

> is
>>> there any problem?? Also, in some case int-array inverted index needs

> the
>>> entry positions as well, I guess. Obtaining positions with posting lists

> is
>>> "general" enough for GIN, isn't it?
>>>
>>> Is there any future plan around it?

>>
>> Yes, we do have plans. See our todo,

> http://www.sai.msu.su/~megera/wiki/todo
>> You may read also FTSBOOK, http://www.sai.msu.su/~megera/postgres/fts/doc
>> and slides from PGCon2007,
>> http://www.sai.msu.su/~megera/postgr...-pgcon2007.pdf
>>>
>>>
>>> Regards,
>>>
>>> Hitoshi Harada
>>>
>>>
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 4: Have you searched our list archives?
>>>
>>> http://archives.postgresql.org
>>>

>>
>> Regards,
>> Oleg
>> __________________________________________________ ___________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83

>


Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 08:46 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com