Unix Technical Forum

Re: [GENERAL] Fragments in tsearch2 headline

This is a discussion on Re: [GENERAL] Fragments in tsearch2 headline within the pgsql Hackers forums, part of the PostgreSQL category; --> [moved to -hackers, because talk is about implementation details] > I've ported the patch of Sushant Sinha for fragmented ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 05-24-2008, 07:11 AM
Teodor Sigaev
 
Posts: n/a
Default Re: [GENERAL] Fragments in tsearch2 headline

[moved to -hackers, because talk is about implementation details]

> I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
> (http://archives.postgresql.org/pgsql...1/msg00508.php)

Thank you.

1 > diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c
now contrib/tsearch2 is compatibility layer for old applications - they don't
know about new features. So, this part isn't needed.

2 solution to compile function (ts_headline_with_fragments) into core, but
using it only from contrib module looks very odd. So, new feature can be used
only with compatibility layer for old release

3 headline_with_fragments() is hardcoded to use default parser, but what will be
in case when configuration uses another parser? For example, for japanese language.

4 I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] )
and function should accept 'NumFragments=>N' for default parser. Another parsers
may use another options.

5 it just doesn't work correctly, because new code doesn't care of parser
specific type of lexemes.
contrib_regression=# select headline_with_fragments('english', 'wow asd-wow
wow', 'asd', '');
headline_with_fragments
----------------------------------
...wow asd-wow<b>asd</b>-wow wow
(1 row)


So, I incline to use existing framework/infrastructure although it may be a
subject to change.

Some description:
1 ts_headline defines a correct parser to use
2 it calls hlparsetext to split text into structure suitable for both goals:
find the best fragment(s) and concatenate that fragment(s) back to the text
representation
3 it calls parser specific method prsheadline which works with preparsed text
(parse was done in hlparsetext). Method should mark a needed
words/parts/lexemes etc.
4 ts_headline glues fragments into text and returns that.

We need a parser's headline method because only parser knows all about its lexemes.


--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 05-29-2008, 10:42 AM
Sushant Sinha
 
Posts: n/a
Default Re: [GENERAL] Fragments in tsearch2 headline

Now I understand the code much better. A few more questions on headline
generation that I was not able to get from the code:

1. Why is hlparsetext used to parse the document rather than the
parsetext function? Since words to be included in the headline will be
marked afterwords, it seems more reasonable to just use the parsetext
function.

The main difference I see is the use of hlfinditem and marking whether
some word is repeated.

The reason this is important is that hlparsetext does not seem to be
storing word positions which parsetext does. The word positions are
important for generating headline with fragments.

2.
> I would prefer the signature ts_headline( [regconfig,] text, tsquery
>[,text] )and function should accept 'NumFragments=>N' for default
>parser. Another parsers may use another options.


Does this mean we want a unified function ts_headline and we trigger the
fragments if NumFragments is specified? It seems that introducing a new
function which can take configuration OID, or name is complex as there
are so many functions handling these issues in wparser.c.

If this is true then we need to just add marking of headline words in
prsd_headline. Otherwise we will need another prsd_headline_with_covers
function.

3. In many cases people may already have TSVector for a given document
(for search operation). Would it be faster to pass TSVector to headline
function when compared to computing TSVector each time? If that is the
case then should we have an option to pass TSVector to headline
function?

-Sushant.

On Sat, 2008-05-24 at 07:57 +0400, Teodor Sigaev wrote:
> [moved to -hackers, because talk is about implementation details]
>
> > I've ported the patch of Sushant Sinha for fragmented headlines to pg8.3.1
> > (http://archives.postgresql.org/pgsql...1/msg00508.php)

> Thank you.
>
> 1 > diff -Nrub postgresql-8.3.1-orig/contrib/tsearch2/tsearch2.c
> now contrib/tsearch2 is compatibility layer for old applications - they don't
> know about new features. So, this part isn't needed.
>
> 2 solution to compile function (ts_headline_with_fragments) into core, but
> using it only from contrib module looks very odd. So, new feature can be used
> only with compatibility layer for old release
>
> 3 headline_with_fragments() is hardcoded to use default parser, but what will be
> in case when configuration uses another parser? For example, for japanese language.
>
> 4 I would prefer the signature ts_headline( [regconfig,] text, tsquery [,text] )
> and function should accept 'NumFragments=>N' for default parser. Another parsers
> may use another options.
>
> 5 it just doesn't work correctly, because new code doesn't care of parser
> specific type of lexemes.
> contrib_regression=# select headline_with_fragments('english', 'wow asd-wow
> wow', 'asd', '');
> headline_with_fragments
> ----------------------------------
> ...wow asd-wow<b>asd</b>-wow wow
> (1 row)
>
>
> So, I incline to use existing framework/infrastructure although it may be a
> subject to change.
>
> Some description:
> 1 ts_headline defines a correct parser to use
> 2 it calls hlparsetext to split text into structure suitable for both goals:
> find the best fragment(s) and concatenate that fragment(s) back to the text
> representation
> 3 it calls parser specific method prsheadline which works with preparsed text
> (parse was done in hlparsetext). Method should mark a needed
> words/parts/lexemes etc.
> 4 ts_headline glues fragments into text and returns that.
>
> We need a parser's headline method because only parser knows all about its lexemes.
>
>
> --
> Teodor Sigaev E-mail: teodor@sigaev.ru
> WWW: http://www.sigaev.ru/
>
>



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 05-29-2008, 10:42 AM
Pierre-Yves Strub
 
Posts: n/a
Default Re: [GENERAL] Fragments in tsearch2 headline

On Sat, May 24, 2008 at 11:18 PM, Sushant Sinha <sushant354@gmail.com> wrote:

> Does this mean we want a unified function ts_headline and we trigger the
> fragments if NumFragments is specified? It seems that introducing a new
> function which can take configuration OID, or name is complex as there
> are so many functions handling these issues in wparser.c.
>
> If this is true then we need to just add marking of headline words in
> prsd_headline. Otherwise we will need another prsd_headline_with_covers
> function.


I think that we could merge down the two functions using a default
cover of the whole text when NumCovers if missing. I started writing
the function. The only missing information is, as you stated, the
missing posititions in HeadlineParsedText.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 05-29-2008, 10:43 AM
Teodor Sigaev
 
Posts: n/a
Default Re: [GENERAL] Fragments in tsearch2 headline

Hi!

> 1. Why is hlparsetext used to parse the document rather than the
> parsetext function? Since words to be included in the headline will be
> marked afterwords, it seems more reasonable to just use the parsetext
> function.
> The main difference I see is the use of hlfinditem and marking whether
> some word is repeated.

hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext
doesn't.
hlparsetext preserves original form of lexemes. parsetext doesn't.

>
> The reason this is important is that hlparsetext does not seem to be
> storing word positions which parsetext does. The word positions are
> important for generating headline with fragments.

Doesn't needed - hlparsetext preserves the whole text, so, position is a number
of array.

>
> 2.
>> I would prefer the signature ts_headline( [regconfig,] text, tsquery
>> [,text] )and function should accept 'NumFragments=>N' for default
>> parser. Another parsers may use another options.

>
> Does this mean we want a unified function ts_headline and we trigger the
> fragments if NumFragments is specified?


Trigger should be inside parser-specific function (pg_ts_parser.prsheadline).
Another parsers might not recognize that option.

> It seems that introducing a new
> function which can take configuration OID, or name is complex as there
> are so many functions handling these issues in wparser.c.

No, of course - ts_headline takes care about finding configuration and calling
correct parser.

>
> If this is true then we need to just add marking of headline words in
> prsd_headline. Otherwise we will need another prsd_headline_with_covers
> function.

Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can change
an array of HeadlineParsedText.

>
> 3. In many cases people may already have TSVector for a given document
> (for search operation). Would it be faster to pass TSVector to headline
> function when compared to computing TSVector each time? If that is the
> case then should we have an option to pass TSVector to headline
> function?

As I mentioned above, tsvector doesn;t contain whole information about text.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 06-02-2008, 01:36 PM
Sushant Sinha
 
Posts: n/a
Default Re: [GENERAL] Fragments in tsearch2 headline

I have attached a new patch with respect to the current cvs head. This
produces headline in a document for a given query. Basically it
identifies fragments of text that contain the query and displays them.

DESCRIPTION

HeadlineParsedText contains an array of actual words but not
information about the norms. We need an indexed position vector for each
norm so that we can quickly evaluate a number of possible fragments.
Something that tsvector provides.

So this patch changes HeadlineParsedText to contain the norms
(ParsedText). This field is updated while parsing in hlparsetext. The
position information of the norms corresponds to the position of words
in HeadlineParsedText (not to the norms positions as is the case in
tsvector). This works correctly with the current parser. If you think
there may be issues with other parsers please let me know.

This approach does not change any other interface and fits nicely with
the overall framework.

The norms are converted into tsvector and a number of covers are
generated. The best covers are then chosen to be in the headline. The
covers are separated using a hardcoded coversep. Let me know if you want
to expose this as an option.

Covers that overlap with already chosen covers are excluded.

Some options like ShortWord and MinWords are not taken care of right
now. MaxWords are used as maxcoversize. Let me know if you would like to
see other options for fragment generation as well.

Let me know any more changes you would like to see.

-Sushant.

On Tue, 2008-05-27 at 13:30 +0400, Teodor Sigaev wrote:
> Hi!
>
> > 1. Why is hlparsetext used to parse the document rather than the
> > parsetext function? Since words to be included in the headline will be
> > marked afterwords, it seems more reasonable to just use the parsetext
> > function.
> > The main difference I see is the use of hlfinditem and marking whether
> > some word is repeated.

> hlparsetext preserves any kind of lexeme - not indexed, spaces etc. parsetext
> doesn't.
> hlparsetext preserves original form of lexemes. parsetext doesn't.
>
> >
> > The reason this is important is that hlparsetext does not seem to be
> > storing word positions which parsetext does. The word positions are
> > important for generating headline with fragments.

> Doesn't needed - hlparsetext preserves the whole text, so, position is a number
> of array.
>
> >
> > 2.
> >> I would prefer the signature ts_headline( [regconfig,] text, tsquery
> >> [,text] )and function should accept 'NumFragments=>N' for default
> >> parser. Another parsers may use another options.

> >
> > Does this mean we want a unified function ts_headline and we trigger the
> > fragments if NumFragments is specified?

>
> Trigger should be inside parser-specific function (pg_ts_parser.prsheadline).
> Another parsers might not recognize that option.
>
> > It seems that introducing a new
> > function which can take configuration OID, or name is complex as there
> > are so many functions handling these issues in wparser.c.

> No, of course - ts_headline takes care about finding configuration and calling
> correct parser.
>
> >
> > If this is true then we need to just add marking of headline words in
> > prsd_headline. Otherwise we will need another prsd_headline_with_covers
> > function.

> Yeah, pg_ts_parser.prsheadline should mark the lexemes to. It even can change
> an array of HeadlineParsedText.
>
> >
> > 3. In many cases people may already have TSVector for a given document
> > (for search operation). Would it be faster to pass TSVector to headline
> > function when compared to computing TSVector each time? If that is the
> > case then should we have an option to pass TSVector to headline
> > function?

> As I mentioned above, tsvector doesn;t contain whole information about text.
>



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 06-04-2008, 04:27 PM
Teodor Sigaev
 
Posts: n/a
Default Re: [GENERAL] Fragments in tsearch2 headline

> I have attached a new patch with respect to the current cvs head. This
> produces headline in a document for a given query. Basically it
> identifies fragments of text that contain the query and displays them.

New variant is much better, but...

> HeadlineParsedText contains an array of actual words but not
> information about the norms. We need an indexed position vector for each
> norm so that we can quickly evaluate a number of possible fragments.
> Something that tsvector provides.


Why do you need to store norms? The single purpose of norms is identifying words
from query - but it's already done by hlfinditem. It sets
HeadlineWordEntry->item to corresponding QueryOperand in tsquery.
Look, headline function is rather expensive and your patch adds a lot of extra
work - at least in memory usage. And if user calls with NumFragments=0 the that
work is unneeded.

> This approach does not change any other interface and fits nicely with
> the overall framework.

Yeah, it's a really big step forward. Thank you. You are very close to
committing except: Did you find a hlCover() function which produce a cover from
original HeadlineParsedText representation? Is any reason to do not use it?

>
> The norms are converted into tsvector and a number of covers are
> generated. The best covers are then chosen to be in the headline. The
> covers are separated using a hardcoded coversep. Let me know if you want
> to expose this as an option.



>
> Covers that overlap with already chosen covers are excluded.
>
> Some options like ShortWord and MinWords are not taken care of right
> now. MaxWords are used as maxcoversize. Let me know if you would like to
> see other options for fragment generation as well.

ShortWord, MinWords and MaxWords should store their meaning, but for each
fragment, not for the whole headline.


>
> Let me know any more changes you would like to see.


if (num_fragments == 0)
/* call the default headline generator */
mark_hl_words(prs, query, highlight, shortword, min_words, max_words);
else
mark_hl_fragments(prs, query, highlight, num_fragments, max_words);


Suppose, num_fragments < 2?

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 06-04-2008, 04:28 PM
Sushant Sinha
 
Posts: n/a
Default Re: [GENERAL] Fragments in tsearch2 headline

Efficiency: I realized that we do not need to store all norms. We need
to only store store norms that are in the query. So I moved the addition
of norms from addHLParsedLex to hlfinditem. This should add very little
memory overhead to existing headline generation.

If this is still not acceptable for default headline generation, then I
can push it into mark_hl_fragments. But I think any headline marking
function will benefit by having the norms corresponding to the query.

Why we need norms?

hlCover does the exact thing that Cover in tsrank does which is to find
the cover that contains the query. However hlcover has to go through
words that do not match the query. Cover on the other hand operates on
position indexes for just the query words and so it should be faster.

The main reason why I would I like it to be fast is that I want to
generate all covers for a given query. Then choose covers with smallest
length as they will be the one that will best explain relation of a
query to a document. Finally stretch those covers to the specified size.

In my understanding, the current headline generation tries to find the
biggest cover for display in the headline. I personally think that such
a cover does not explain the context of a query in a document. We may
differ on this and thats why we may need both options.

Let me know what you think on this patch and I will update the patch to
respect other options like MinWords and ShortWord.

NumFragments < 2:
I wanted people to use the new headline marker if they specify
NumFragments >= 1. If they do not specify the NumFragments or put it to
0 then the default marker will be used. This becomes a bit of tricky
parameter so please put in any idea on how to trigger the new marker.

On an another note I found that make_tsvector crashes if it receives a
ParsedText with curwords = 0. Specifically uniqueWORD returns curwords
as 1 even when it gets 0 words. I am not sure if this is the desired
behavior.

-Sushant.


On Mon, 2008-06-02 at 18:10 +0400, Teodor Sigaev wrote:
> > I have attached a new patch with respect to the current cvs head. This
> > produces headline in a document for a given query. Basically it
> > identifies fragments of text that contain the query and displays them.

> New variant is much better, but...
>
> > HeadlineParsedText contains an array of actual words but not
> > information about the norms. We need an indexed position vector for each
> > norm so that we can quickly evaluate a number of possible fragments.
> > Something that tsvector provides.

>
> Why do you need to store norms? The single purpose of norms is identifying words
> from query - but it's already done by hlfinditem. It sets
> HeadlineWordEntry->item to corresponding QueryOperand in tsquery.
> Look, headline function is rather expensive and your patch adds a lot of extra
> work - at least in memory usage. And if user calls with NumFragments=0 the that
> work is unneeded.
>
> > This approach does not change any other interface and fits nicely with
> > the overall framework.

> Yeah, it's a really big step forward. Thank you. You are very close to
> committing except: Did you find a hlCover() function which produce a cover from
> original HeadlineParsedText representation? Is any reason to do not use it?
>
> >
> > The norms are converted into tsvector and a number of covers are
> > generated. The best covers are then chosen to be in the headline. The
> > covers are separated using a hardcoded coversep. Let me know if you want
> > to expose this as an option.

>
>
> >
> > Covers that overlap with already chosen covers are excluded.
> >
> > Some options like ShortWord and MinWords are not taken care of right
> > now. MaxWords are used as maxcoversize. Let me know if you would like to
> > see other options for fragment generation as well.

> ShortWord, MinWords and MaxWords should store their meaning, but for each
> fragment, not for the whole headline.
>
>
> >
> > Let me know any more changes you would like to see.

>
> if (num_fragments == 0)
> /* call the default headline generator */
> mark_hl_words(prs, query, highlight, shortword, min_words, max_words);
> else
> mark_hl_fragments(prs, query, highlight, num_fragments, max_words);
>
>
> Suppose, num_fragments < 2?
>



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 06-04-2008, 04:28 PM
Teodor Sigaev
 
Posts: n/a
Default Re: [GENERAL] Fragments in tsearch2 headline

> Why we need norms?

We don't need norms at all - all matched HeadlineWordEntry already marked by
HeadlineWordEntry->item! If it equals to NULL then this word isn't contained in
tsquery.

> hlCover does the exact thing that Cover in tsrank does which is to find
> the cover that contains the query. However hlcover has to go through
> words that do not match the query. Cover on the other hand operates on
> position indexes for just the query words and so it should be faster.

Cover, by definition, is a minimal continuous text's piece matched by query. May
be a several covers in text and hlCover will find all of them. Next,
prsd_headline() (for now) tries to define the best one. "Best" means: cover
contains a lot of words from query, not less that MinWords, not greater than
MaxWords, hasn't words shorter that ShortWord on the begin and end of cover etc.
>
> The main reason why I would I like it to be fast is that I want to
> generate all covers for a given query. Then choose covers with smallest

hlCover generates all covers.

> Let me know what you think on this patch and I will update the patch to
> respect other options like MinWords and ShortWord.


As I understand, you very wish to call Cover() function instead of hlCover() -
by design, they should be identical, but accepts different document's
representation. So, the best way is generalize them: develop a new one which can
be called with some kind of callback or/and opaque structure to use it in both
rank and headline.

>
> NumFragments < 2:
> I wanted people to use the new headline marker if they specify
> NumFragments >= 1. If they do not specify the NumFragments or put it to

Ok, but if you unify cover generation and NumFragments == 1 then result for old
and new algorithms should be the same...


> On an another note I found that make_tsvector crashes if it receives a
> ParsedText with curwords = 0. Specifically uniqueWORD returns curwords
> as 1 even when it gets 0 words. I am not sure if this is the desired
> behavior.

In all places there is a check before call of make_tsvector.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 06-04-2008, 04:28 PM
Sushant Sinha
 
Posts: n/a
Default Re: [GENERAL] Fragments in tsearch2 headline

My main argument for using Cover instead of hlCover was that Cover will
be faster. I tested the default headline generation that uses hlCover
with the current patch that uses Cover. There was not much difference.
So I think you are right in that we do not need norms and we can just
use hlCover.

I also compared performance of ts_headline with my first patch to
headline generation (one which was a separate function and took tsvector
as input). The performance was dramatically different. For one query
ts_headline took roughly 200 ms while headline_with_fragments took just
70 ms. On an another query ts_headline took 76 ms while
headline_with_fragments took 24 ms. You can find 'explain analyze' for
the first query at the bottom of the page.

These queries were run multiple times to ensure that I never hit the
disk. This is a m/c with 2.0 GhZ Pentium 4 CPU and 512 MB RAM running
Linux 2.6.22-gentoo-r8.

A couple of caveats:

1. ts_headline testing was done with current cvs head where as
headline_with_fragments was done with postgres 8.3.1.

2. For headline_with_fragments, TSVector for the document was obtained
by joining with another table.

Are these differences understandable?

If you think these caveats are the reasons or there is something I am
missing, then I can repeat the entire experiments with exactly the same
conditions.

-Sushant.


Here is 'explain analyze' for both the functions:


ts_headline
------------

lawdb=# explain analyze SELECT ts_headline('english', doc, q, '')
FROM docraw, plainto_tsquery('english', 'freedom of
speech') as q
WHERE docraw.tid = 125596;
QUERY
PLAN

Nested Loop (cost=0.00..8.31 rows=1 width=497) (actual
time=199.692..200.207 rows=1 loops=1)
-> Index Scan using docraw_pkey on docraw (cost=0.00..8.29 rows=1
width=465) (actual time=0.041..0.065 rows=1 loops=1)
Index Cond: (tid = 125596)
-> Function Scan on q (cost=0.00..0.01 rows=1 width=32) (actual
time=0.010..0.014 rows=1 loops=1)
Total runtime: 200.311 ms


headline_with_fragments
-----------------------

lawdb=# explain analyze SELECT headline_with_fragments('english',
docvector, doc, q, 'MaxWords=40')
FROM docraw, docmeta, plainto_tsquery('english', 'freedom
of speech') as q
WHERE docraw.tid = 125596 and docmeta.tid=125596;
QUERY
PLAN
----------------------
Nested Loop (cost=0.00..16.61 rows=1 width=883) (actual
time=70.564..70.949 rows=1 loops=1)
-> Nested Loop (cost=0.00..16.59 rows=1 width=851) (actual
time=0.064..0.094 rows=1 loops=1)
-> Index Scan using docraw_pkey on docraw (cost=0.00..8.29
rows=1 width=454) (actual time=0.040..0.044 rows=1 loops=1)
Index Cond: (tid = 125596)
-> Index Scan using docmeta_pkey on docmeta (cost=0.00..8.29
rows=1 width=397) (actual time=0.017..0.040 rows=1 loops=1)
Index Cond: (docmeta.tid = 125596)
-> Function Scan on q (cost=0.00..0.01 rows=1 width=32) (actual
time=0.012..0.016 rows=1 loops=1)
Total runtime: 71.076 ms
(8 rows)


On Tue, 2008-06-03 at 22:53 +0400, Teodor Sigaev wrote:
> > Why we need norms?

>
> We don't need norms at all - all matched HeadlineWordEntry already marked by
> HeadlineWordEntry->item! If it equals to NULL then this word isn't contained in
> tsquery.
>
> > hlCover does the exact thing that Cover in tsrank does which is to find
> > the cover that contains the query. However hlcover has to go through
> > words that do not match the query. Cover on the other hand operates on
> > position indexes for just the query words and so it should be faster.

> Cover, by definition, is a minimal continuous text's piece matched by query. May
> be a several covers in text and hlCover will find all of them. Next,
> prsd_headline() (for now) tries to define the best one. "Best" means: cover
> contains a lot of words from query, not less that MinWords, not greater than
> MaxWords, hasn't words shorter that ShortWord on the begin and end of cover etc.
> >
> > The main reason why I would I like it to be fast is that I want to
> > generate all covers for a given query. Then choose covers with smallest

> hlCover generates all covers.
>
> > Let me know what you think on this patch and I will update the patch to
> > respect other options like MinWords and ShortWord.

>
> As I understand, you very wish to call Cover() function instead of hlCover() -
> by design, they should be identical, but accepts different document's
> representation. So, the best way is generalize them: develop a new one which can
> be called with some kind of callback or/and opaque structure to use it in both
> rank and headline.
>
> >
> > NumFragments < 2:
> > I wanted people to use the new headline marker if they specify
> > NumFragments >= 1. If they do not specify the NumFragments or put it to

> Ok, but if you unify cover generation and NumFragments == 1 then result for old
> and new algorithms should be the same...
>
>
> > On an another note I found that make_tsvector crashes if it receives a
> > ParsedText with curwords = 0. Specifically uniqueWORD returns curwords
> > as 1 even when it gets 0 words. I am not sure if this is the desired
> > behavior.

> In all places there is a check before call of make_tsvector.
>



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 06-23-2008, 11:53 AM
Sushant Sinha
 
Posts: n/a
Default Re: [GENERAL] Fragments in tsearch2 headline

I have an attached an updated patch with following changes:

1. Respects ShortWord and MinWords
2. Uses hlCover instead of Cover
3. Does not store norm (or lexeme) for headline marking
4. Removes ts_rank.h
5. Earlier it was counting even NONWORDTOKEN in the headline. Now it
only counts the actual words and excludes spaces etc.

I have also changed NumFragments option to MaxFragments as there may not
be enough covers to display NumFragments.

Another change that I was thinking:

Right now if cover size > max_words then I just cut the trailing words.
Instead I was thinking that we should split the cover into more
fragments such that each fragment contains a few query words. Then each
fragment will not contain all query words but will show more occurrences
of query words in the headline. I would like to know what your opinion
on this is.

-Sushant.

On Thu, 2008-06-05 at 20:21 +0400, Teodor Sigaev wrote:
> > A couple of caveats:
> >
> > 1. ts_headline testing was done with current cvs head where as
> > headline_with_fragments was done with postgres 8.3.1.
> > 2. For headline_with_fragments, TSVector for the document was obtained
> > by joining with another table.
> > Are these differences understandable?

>
> That is possible situation because ts_headline has several criterias of 'best'
> covers - length, number of words from query, good words at the begin and at the
> end of headline while your fragment's algorithm takes care only on total number
> of words in all covers. It's not very good, but it's acceptable, I think.
> Headline (and ranking too) hasn't any formal rules to define is it good or bad?
> Just a people's opinions.
>
> Next possible reason: original algorithm had a look on all covers trying to find
> the best one while your algorithm tries to find just the shortest covers to fill
> a headline.
>
> But it's very desirable to use ShortWord - it's not very comfortable for user if
> one option produces unobvious side effect with another one.
> `
>
> > If you think these caveats are the reasons or there is something I am
> > missing, then I can repeat the entire experiments with exactly the same
> > conditions.

>
> Interesting for me test is a comparing hlCover with Cover in your patch, i.e.
> develop a patch which uses hlCover instead of Cover and compare old patch with
> new one.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 02:47 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com