Unix Technical Forum

SEO

vBulletin Search Engine Optimization


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 07:25 AM
Simon Riggs
 
Posts: n/a
Default Unsplitting btree index leaf pages

When we discussed online REINDEX recently we focused on the REINDEX
command itself rather than look at alternative approaches.

One reason to REINDEX is because of index page splits getting things out
of sequence and generally bloating the index.

When we VACUUM, each index is scanned in logical order.

While we scan, if we found two adjacent pages, both of which have less
than (say) 40% rows, we could re-join or "unsplit" those pages together.
The index blocks are fully locked during the read anyway and there is no
MVCC problem with moving index rows between blocks. All we have to do is
to lock both blocks, having locked them in the correct order.

The rows would always be moved to the lowest physical block id, so that
data would naturally migrate towards the start of the index file. Blocks
would then be marked half-dead just as if they had just had their last
index row removed by the vacuum.

We could start the scan by locking block 1 AND block2, then scan forward
always holding 2 locks as we go. That way we would not need to unlock
and relock the blocks to lock two blocks. The concurrency loss would not
be that great, but we would gain the ability to unsplit the two blocks
into one.

If we do this, we would could possibly avoid the need to full REINDEX
entirely.

If this method checks out we could do one of:
- make VACUUM do this always
- add an option for REINDEX: CLEAN/COMPRESS/VACUUM etc to do this upon
command only

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 07:25 AM
Tom Lane
 
Posts: n/a
Default Re: Unsplitting btree index leaf pages

Simon Riggs <simon@2ndquadrant.com> writes:
> While we scan, if we found two adjacent pages, both of which have less
> than (say) 40% rows, we could re-join or "unsplit" those pages together.


Curiously enough, this has been thought of before. It is not as easy
as you think, or it would have been done the first time around.
nbtree/README explains why:

: We consider deleting an entire page from the btree only when it's become
: completely empty of items. (Merging partly-full pages would allow better
: space reuse, but it seems impractical to move existing data items left or
: right to make this happen --- a scan moving in the opposite direction
: might miss the items if so. We could do it during VACUUM FULL, though.)

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-11-2008, 07:25 AM
Simon Riggs
 
Posts: n/a
Default Re: Unsplitting btree index leaf pages

On Wed, 2005-12-21 at 19:07 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > While we scan, if we found two adjacent pages, both of which have less
> > than (say) 40% rows, we could re-join or "unsplit" those pages together.

>
> Curiously enough, this has been thought of before. It is not as easy
> as you think, or it would have been done the first time around.
> nbtree/README explains why:
>
> : We consider deleting an entire page from the btree only when it's become
> : completely empty of items. (Merging partly-full pages would allow better
> : space reuse, but it seems impractical to move existing data items left or
> : right to make this happen --- a scan moving in the opposite direction
> : might miss the items if so. We could do it during VACUUM FULL, though.)


Sorry, I missed that.

"Seems impractical"? Complex, yes. But I think it sounds a lot simpler
than the online REINDEX scheme... which is also the reason enhancing
VACUUM FULL doesn't appeal.

During the first VACUUM index scan, we can identify pages that are
candidates for reorganisation. Then during the second physical scan that
follows we can reorg pages in the same manner we delete them.

We identify a page during VACUUM for reorg if:
- it is < 20% full and we want to write this page
- the following page is < 20% full and we want to write this page
- it has a higher physical block number than the following page
That way we aren't super aggressive about reorg, but the cases we do
pick have a tendency to keep the index clustered. (Perhaps that could be
settable via an index optional parameter). That definition also means we
have almost no additional I/O over what the VACUUM would have written
anyway.

Page deletion requires us to lock left, lock right, lock above. That is
exactly the same if we have identified the page for reorg, since once we
have locked left and right, we just move rows to one or both of the
those other blocks, then perform the marking half-dead.

I know you've considered these things deeply, but my viewpoint is that
when what you have is already very good the only way forward consists of
considering seemingly foolish thoughts: backtracking.

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-11-2008, 07:25 AM
Tom Lane
 
Posts: n/a
Default Re: Unsplitting btree index leaf pages

Simon Riggs <simon@2ndquadrant.com> writes:
> Sorry, I missed that.


And you evidently still didn't understand it. Locking both pages does
not fix the problem, because it doesn't guarantee that there's not a
concurrent indexscan in flight from one to the other. If you move items
from one page to the other in the opposite direction from the way the
scan is going, then it will miss those items. If we try to fix this by
making scans lock one page before releasing the previous, then we'll
create a bunch of deadlock cases.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-11-2008, 07:25 AM
Martijn van Oosterhout
 
Posts: n/a
Default Re: Unsplitting btree index leaf pages

On Thu, Dec 22, 2005 at 10:40:24AM -0500, Tom Lane wrote:
> And you evidently still didn't understand it. Locking both pages does
> not fix the problem, because it doesn't guarantee that there's not a
> concurrent indexscan in flight from one to the other. If you move items
> from one page to the other in the opposite direction from the way the
> scan is going, then it will miss those items. If we try to fix this by
> making scans lock one page before releasing the previous, then we'll
> create a bunch of deadlock cases.


I've been thinking, maybe there's a lockless way of going about this?
Have some kind of index modification identifier that you set at the
beginning of the index scan. What you do is mark the tuples you want to
move with and IMI (I'm trying to avoid the word transaction here) and then
copy the tuples to the new page with IMI+1. Any currently running index
scan will notice the higher IMI and ignore them. When all old index
scans are done, you can remove the old ones.

It's sort of a mini-MVCC for indexes except you could probably just use
a few states: visible to all, visible to current scan, invisible to
current scan. Or use two bits to represent frozen, 1, 2 and 3. A plain
VACUUM could do the state transistions each time it moves through the
index.

The downsides are probably that it's a pain to make it work
concurrently and requires writing each index page at least twice. But
it's a thought...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFDqs1SIB7bNG8LQkwRAjw3AJ0eLt8AJCgSusBprZbvMg wbk6epeQCfcZRU
HrY31PeaWfUv2ZqSDnFlDac=
=vVj/
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-11-2008, 07:26 AM
Alvaro Herrera
 
Posts: n/a
Default Re: Unsplitting btree index leaf pages

Martijn van Oosterhout wrote:

> The downsides are probably that it's a pain to make it work
> concurrently and requires writing each index page at least twice. But
> it's a thought...


We already do something similar for page deletions. Empty pages are not
deleted right away, but they are marked with BTP_DEAD, and then deleted
on a subsequent vacuum. Or something like that, I don't remember the
exact details.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-11-2008, 07:26 AM
Tom Lane
 
Posts: n/a
Default Re: Unsplitting btree index leaf pages

Alvaro Herrera <alvherre@commandprompt.com> writes:
> We already do something similar for page deletions. Empty pages are not
> deleted right away, but they are marked with BTP_DEAD, and then deleted
> on a subsequent vacuum. Or something like that, I don't remember the
> exact details.


Right, and the reason for that is exactly that there might be a
concurrent indexscan already "in flight" to the newly-dead page.
We must wait to recycle the page until we are certain no such scans
remain.

It doesn't matter whether a concurrent indexscan visits the dead
page or not, *because it's empty* and so there's nothing to miss.
So there's no race condition. But if you try to move valid data
across pages then there is a race condition.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-11-2008, 07:26 AM
Simon Riggs
 
Posts: n/a
Default Re: Unsplitting btree index leaf pages

On Thu, 2005-12-22 at 10:40 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Sorry, I missed that.

>
> And you evidently still didn't understand it. Locking both pages does
> not fix the problem, because it doesn't guarantee that there's not a
> concurrent indexscan in flight from one to the other. If you move items
> from one page to the other in the opposite direction from the way the
> scan is going, then it will miss those items. If we try to fix this by
> making scans lock one page before releasing the previous, then we'll
> create a bunch of deadlock cases.


Thank you for explaining things; I'm sorry you had to do it twice.

I'm ever optimistic, so I try to resist the temptation to stay quiet in
case I make a mistake.

Best Regards, Simon Riggs



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-11-2008, 07:28 AM
Manfred Koizar
 
Posts: n/a
Default Re: Unsplitting btree index leaf pages

On Thu, 22 Dec 2005 10:40:24 -0500, Tom Lane <tgl@sss.pgh.pa.us>
wrote:
>If you move items
>from one page to the other in the opposite direction from the way the
>scan is going, then it will miss those items.


AFAIU the (PG implementaion of the) L&Y method is designed to make
scans immune against problems caused by items moving right within the
same page and against page splits, i.e. items moving to a *new* right
sibling. So making scans work with items moving to an *existing*
right sibling doesn't seem out of reach.

The code following this comment in _bt_restscan
/*
* The item we're looking for moved right at least one page, so
* move right. We are careful here to pin and read-lock the next
* non-dead page before releasing the current one. This ensures
* that a concurrent btbulkdelete scan cannot pass our position
* --- if it did, it might be able to reach and delete our target
* item before we can find it again.
*/
looks like it'd work for the case of page merging as well, as long as
we are careful to move items always from left to right.

BTW, if after having locked both pages we find that we have
super-exclusive locks, i.e. nobody else has pins on these pages, we
can reorganize much more agressively. It might even be safe to move
items to the left page. The parent page might need some special
handling, though.

Servus
Manfred

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-11-2008, 07:28 AM
Kevin Brown
 
Posts: n/a
Default Re: Unsplitting btree index leaf pages

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > We already do something similar for page deletions. Empty pages are not
> > deleted right away, but they are marked with BTP_DEAD, and then deleted
> > on a subsequent vacuum. Or something like that, I don't remember the
> > exact details.

>
> Right, and the reason for that is exactly that there might be a
> concurrent indexscan already "in flight" to the newly-dead page.
> We must wait to recycle the page until we are certain no such scans
> remain.
>
> It doesn't matter whether a concurrent indexscan visits the dead
> page or not, *because it's empty* and so there's nothing to miss.
> So there's no race condition. But if you try to move valid data
> across pages then there is a race condition.


Hmm...

Well, REINDEX is apparently a very expensive operation right now. But
how expensive would it be to go through the entire index and perform
the index page merge operation being discussed here, and nothing else?

If it's fast enough, might it be worthwhile to implement just this
alone as a separate maintenance command (e.g., VACUUM INDEX) that
acquires the appropriate lock (AccessExclusive, I'd expect) on the
index to prevent exactly the issues you're concerned about?

If it's fast enough even on large tables, it would be a nice
alternative to REINDEX, I'd think.


--
Kevin Brown kevin@sysexperts.com

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 11:16 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
UnixAdminTalk.com

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587