Unix Technical Forum

SEO

vBulletin Search Engine Optimization


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Patches

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-18-2008, 09:35 AM
Teodor Sigaev
 
Posts: n/a
Default Re: First implementation of GIN for pg_trgm

> From a previous discussion with Teodor, it would be better to store an
> int in the index instead of a text (it takes less space and is
> faster). I couldn't find any example so if anyone has an advice to fix
> that, it's welcome (mostly how to pack the trigram into an int instead
> of a text).


Something like that:
trg = generate_trgm(VARDATA(text), VARSIZE(text) - VARHDRSZ);
nentries = ARRNELEM(trg);
if ( nentries > 0 )
{
*entries = palloc(sizeof(Datum)*nentries);
for(i=0;i<nentries;i++)
{
int tmp=0;
trgm *ptr = GETARR(trg)+i;

CPTRGM(&tmp, ptr);
tmp >>= 8;
entries[i] = Int32GetDatum(tmp);
}
}
Do not forget to change CREATE OPERATOR CLASS accordingly.

>
> The last problem is that similarity calculated in the GIN index is
> higher than the one with GIST so I have to set the trgm_limit quite
> high to have decent results (a limit of 0.8 instead of 0.3 seems to be
> quite good).
> AFAICS, it comes from the fact that I couldn't find any way to get the
> length of the indexed trigram which is taken into account with GIST so
> we're not exactly filtering the results in the same way.
> Does anyone have an idea on how to fix this point?


For calculating similarity, you should have three value: length of first word
(let it be a indexed text) in trigrams, length of second word (query word) and
number of the same trigrams on both words. It's a pity, but during index scan we
don't know length of indexed text. So, in index scan (consistent function) we
could not compute exact value of similarity, but we can compute lower limit.
For example, if our query has 50 trigrams and only one of them is a common for
indexed value and query we can conclude that indexed value can not be similar to
our query. So, our consistent function should say FALSE when indexed value is
not similar to query exactly and TRUE in opposite case. Let lquery is a length
of query and ncommon is a number of common trigrams (look at cnt_sml()
function), and consistent function should be:
#ifdef DIVUNION
/* original formula is: count/(lvalue+lquery-lcommon), so
with any lvalue > 0 resulting similarity is smaller than
computed below */
return ( count/(lquery-lcommon) > limit ) ? TRUE : FALSE;
#else
/* original formula is: count/max(lvalue,lquery) - the same discourse */
return ( count/lquery > limit ) ? TRUE : FALSE;
#endif

Now consistent function doesn't guarantee exact result, so we should mark '%'
operator in CREATE OPERATOR CLASS with RECHECK option.

--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-18-2008, 09:35 AM
Guillaume Smet
 
Posts: n/a
Default Re: First implementation of GIN for pg_trgm

On 2/22/07, Teodor Sigaev <teodor@sigaev.ru> wrote:
> > From a previous discussion with Teodor, it would be better to store an
> > int in the index instead of a text (it takes less space and is
> > faster). I couldn't find any example so if anyone has an advice to fix
> > that, it's welcome (mostly how to pack the trigram into an int instead
> > of a text).

>
> Something like that: [snip]


I worked on that this morning before I got your email. You'll find
attached a second version of the patch using int4 to store the
trigrams (it's not exactly what you gave me but it should work I
think).

I didn't see any improvement in terms of size of the index (14 MB for
642 738 rows in the index in both cases) or speed.
Our dictionary table contains 78367 words and its size is 3 MB.

Did I miss something?

If there's no benefit to use an int4, I propose we keep the text
version because it will be better if we add UTF-8 support someday.

> our query. So, our consistent function should say FALSE when indexed value is
> not similar to query exactly and TRUE in opposite case. Let lquery is a length
> of query and ncommon is a number of common trigrams (look at cnt_sml()
> function), and consistent function should be:


Yes, that's what I did in my patch if I understand you correctly but
it's not very selective IMHO.

> Now consistent function doesn't guarantee exact result, so we should mark '%'
> operator in CREATE OPERATOR CLASS with RECHECK option.


The attached patch adds a RECHECK too. It seems to work correctly but
the RECHECK COND costs a lot of time :/.

At the moment, I have the following results:
GIST:
-------
test=# EXPLAIN ANALYZE SELECT word, similarity('aub', word) AS sml
FROM lieu_mots_gist WHERE word % 'aub' ORDER BY sml DESC;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Sort (cost=204.40..204.59 rows=78 width=11) (actual
time=98.028..98.132 rows=50 loops=1)
Sort Key: similarity('aub'::text, word)
-> Bitmap Heap Scan on lieu_mots_gist (cost=4.88..201.95 rows=78
width=11) (actual time=97.575..97.861 rows=50 loops=1)
Recheck Cond: (word % 'aub'::text)
-> Bitmap Index Scan on idx_word_gist (cost=0.00..4.86
rows=78 width=0) (actual time=97.539..97.539 rows=50 loops=1)
Index Cond: (word % 'aub'::text)
Total runtime: 98.303 ms
(7 rows)

test=# EXPLAIN ANALYZE SELECT word, similarity('auberge', word) AS sml
FROM lieu_mots_gist WHERE word % 'auberge' ORDER BY sml DESC;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Sort (cost=204.40..204.59 rows=78 width=11) (actual
time=135.128..135.196 rows=41 loops=1)
Sort Key: similarity('auberge'::text, word)
-> Bitmap Heap Scan on lieu_mots_gist (cost=4.88..201.95 rows=78
width=11) (actual time=134.829..135.016 rows=41 loops=1)
Recheck Cond: (word % 'auberge'::text)
-> Bitmap Index Scan on idx_word_gist (cost=0.00..4.86
rows=78 width=0) (actual time=134.797..134.797 rows=41 loops=1)
Index Cond: (word % 'auberge'::text)
Total runtime: 135.335 ms
(7 rows)

With GIN:
------------

test=# EXPLAIN ANALYZE SELECT word, similarity('aub', word) AS sml
FROM lieu_mots_gin WHERE word % 'aub' ORDER BY sml DESC;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Sort (cost=208.45..208.64 rows=78 width=11) (actual
time=60.409..60.513 rows=50 loops=1)
Sort Key: similarity('aub'::text, word)
-> Bitmap Heap Scan on lieu_mots_gin (cost=8.93..205.99 rows=78
width=11) (actual time=10.279..60.228 rows=50 loops=1)
Filter: (word % 'aub'::text)
-> Bitmap Index Scan on idx_word_gin (cost=0.00..8.91
rows=78 width=0) (actual time=10.069..10.069 rows=6676 loops=1)
Index Cond: (word % 'aub'::text)
Total runtime: 60.688 ms
(7 rows)

test=# EXPLAIN ANALYZE SELECT word, similarity('auberge', word) AS sml
FROM lieu_mots_gin WHERE word % 'auberge' ORDER BY sml DESC;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Sort (cost=208.45..208.64 rows=78 width=11) (actual
time=50.293..50.381 rows=41 loops=1)
Sort Key: similarity('auberge'::text, word)
-> Bitmap Heap Scan on lieu_mots_gin (cost=8.93..205.99 rows=78
width=11) (actual time=21.006..50.122 rows=41 loops=1)
Filter: (word % 'auberge'::text)
-> Bitmap Index Scan on idx_word_gin (cost=0.00..8.91
rows=78 width=0) (actual time=20.839..20.839 rows=928 loops=1)
Index Cond: (word % 'auberge'::text)
Total runtime: 50.534 ms
(7 rows)

--
Guillaume


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-18-2008, 09:35 AM
Teodor Sigaev
 
Posts: n/a
Default Re: First implementation of GIN for pg_trgm

> I didn't see any improvement in terms of size of the index (14 MB for
> 642 738 rows in the index in both cases) or speed.
> Our dictionary table contains 78367 words and its size is 3 MB.
>
> Did I miss something?

Comparing integers is cheaper than strings. Although it hasn't significant
matter for index scan.

> The attached patch adds a RECHECK too. It seems to work correctly but
> the RECHECK COND costs a lot of time :/.




How long is average length of strings in table?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-18-2008, 09:35 AM
Guillaume Smet
 
Posts: n/a
Default Re: First implementation of GIN for pg_trgm

On 2/22/07, Teodor Sigaev <teodor@sigaev.ru> wrote:
> How long is average length of strings in table?


test=# SELECT MIN(length(word)), MAX(length(word)), AVG(length(word))
FROM lieu_mots_gin;
min | max | avg
-----+-----+--------------------
1 | 38 | 7.4615463141373282
(1 row)

I don't see how to have a more precise similarity without having the
number of entries registered by the indexed value somewhere.

I think it can be interesting for other flavours of GIN usage. Is
there a way to add the number of entries of the considered indexed
item to the consistent prototype without adding too much overhead and
complexity?

--
Guillaume

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-18-2008, 09:35 AM
Oleg Bartunov
 
Posts: n/a
Default Re: First implementation of GIN for pg_trgm

On Thu, 22 Feb 2007, Guillaume Smet wrote:

> On 2/22/07, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> How long is average length of strings in table?

>
> test=# SELECT MIN(length(word)), MAX(length(word)), AVG(length(word))
> FROM lieu_mots_gin;
> min | max | avg
> -----+-----+--------------------
> 1 | 38 | 7.4615463141373282
> (1 row)
>
> I don't see how to have a more precise similarity without having the
> number of entries registered by the indexed value somewhere.
>
> I think it can be interesting for other flavours of GIN usage. Is
> there a way to add the number of entries of the considered indexed
> item to the consistent prototype without adding too much overhead and
> complexity?


You're right, it would be nice.
This is what we need for faster ranking in tsearch2, since currently we should
consult heap to get positional information, which slowdowns search.
We didn't investigate the possibility to keep additional information with
index, but keep in mind, that everything should works without index.

Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-18-2008, 09:35 AM
Teodor Sigaev
 
Posts: n/a
Default Re: First implementation of GIN for pg_trgm


> I think it can be interesting for other flavours of GIN usage. Is
> there a way to add the number of entries of the considered indexed
> item to the consistent prototype without adding too much overhead and
> complexity?

We are thinking about adding extra value, but it's still only thinking.

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-18-2008, 09:35 AM
Guillaume Smet
 
Posts: n/a
Default Re: First implementation of GIN for pg_trgm

On 2/22/07, Oleg Bartunov <oleg@sai.msu.su> wrote:
> You're right, it would be nice.
> This is what we need for faster ranking in tsearch2, since currently we should
> consult heap to get positional information, which slowdowns search.
> We didn't investigate the possibility to keep additional information with
> index, but keep in mind, that everything should works without index.


OK, thanks for your answer. If you do it one day or another, please
take into account the case of pg_trgm too as it will be far faster if
we can access to the number of entries of the indexed value in the
consistent function. As this information is available if you don't use
the index, it won't be a problem, I think.

Is there anything else I should fix/improve in this patch?

It could be nice to test it on other distribution of words and see if
it performs better than gist in other cases too. I'll try to test it
here on another table we need to index with tsearch2.

--
Guillaume

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:31 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
UnixAdminTalk.com

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771