Unix Technical Forum

Page at a time index scan

This is a discussion on Page at a time index scan within the Pgsql Patches forums, part of the PostgreSQL category; --> Heikki Linnakangas <hlinnaka@iki.fi> writes: > The point remains, however. A page won't get deleted while a scan > might ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Patches

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #21 (permalink)  
Old 04-18-2008, 01:35 AM
Tom Lane
 
Posts: n/a
Default Re: Page at a time index scan

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> The point remains, however. A page won't get deleted while a scan
> might still be interested in it, because deleted pages are not
> immediately recycled (except on vacuum full), and the left and right
> sibling pointers stay intact until no transaction can be interested in it.


Right. AFAICS there isn't any assumption there that isn't already made
by indexscans, since we drop pin before moving to the adjacent page
anyway. You still have to be prepared to deal with the same situations.
(The new assumption is that index items won't be moved onto a
pre-existing right sibling page; the old scan logic didn't assume that.)

Heikki, were you planning to make a round of revisions in the patch,
or is this as far as you wanted to take it?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #22 (permalink)  
Old 04-18-2008, 01:35 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Page at a time index scan

On Wed, 3 May 2006, Tom Lane wrote:

> Heikki, were you planning to make a round of revisions in the patch,
> or is this as far as you wanted to take it?


Here's an updated patch. It's the same as the original, but merged with
the changes to the vacuum_cleanup API you committed, and some comment and
space fixes here and there.

To-do:
* release pin earlier after reading the page, as discussed
* allocate memory for markpos/restrpos only when needed
* calculate a more precise estimate for the maximum number of items on an
index page

Not big things, but I don't have the time to work on them at the moment.

- Heikki

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #23 (permalink)  
Old 04-18-2008, 01:35 AM
Tom Lane
 
Posts: n/a
Default Re: Page at a time index scan

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> Here's an updated patch.


Uh ... no patch actually attached?

> To-do:
> Not big things, but I don't have the time to work on them at the moment.


I can take it from here if you'll send me what you have.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #24 (permalink)  
Old 04-18-2008, 01:36 AM
Tom Lane
 
Posts: n/a
Default Re: Page at a time index scan

I've just realized that the locking considerations in this patch are
considerably more complicated than we thought. The discussion so far
considered only half of what the deletion interlock needs to accomplish.
We have to ensure that indexscans don't lose their place, which is
solved in the patch by stopping between pages rather than on a specific
item. But we also have to ensure that indexscans don't return pointers
to heap tuples that have already been deleted by VACUUM --- at least in
the case where the caller is using a non-MVCC snapshot. Else, if the
tuple is replaced by a newer tuple before the caller is able to visit
it, the caller might mistakenly accept that tuple as the one it wanted.

The problematic scenario looks like this:

1. indexscan slurps a bunch of heap TIDs into local memory. It keeps
pin, but not lock, on the index page it got the TIDs from.

2. someone else splits the index page, which is allowed since the page
is only pinned not locked. Now, some of the index items are on a new
page that the indexscan does not have pinned.

3. vacuum comes along and removes some of the moved items. Since they
are on a page that no one has pinned, even the super-exclusive lock
that vacuum takes won't prevent this.

4. vacuum removes the corresponding heap tuples.

5. someone else installs new, unrelated, tuples into the freed heap
slots.

6. indexscan finally visits the heap. It finds tuples that are valid
per its snapshot, but aren't what it thought it was finding (they don't
necessarily match the index key).

While we could prevent this by rechecking whether fetched heap tuples
match the index search condition, that costs an awful lot of cycles
for such an obviously rare problem. We need a locking-based solution
instead.

I believe an appropriate fix looks like this:

1. Indexscan is required to keep pin on the page it read the items
from (even though we realize they may not still be on that page).
The pin can only be dropped when we are no longer going to return any
more items read from the page.

2. btbulkdelete is required to get super-exclusive lock on *every leaf
page in the index*. It must not try to optimize this down to just the
pages containing deletable items.

With these rules, even though vacuum might have physically deleted the
index item that the indexscan is reporting to its caller, the
btbulkdelete call will not have been able to complete. (If bulkdelete
arrived at the original page before the indexscan did, then of course
it'd already have deleted the item and there's no problem. If it
arrives after, then it'll be blocked by not being able to get
super-exclusive lock.) Hence vacuum will not have removed the heap
tuple yet, even though the index item might be gone.

We could improve concurrency if we extend the API so that btgettuple
knows whether it is fetching for an MVCC-safe snapshot or not. When the
scan is using an MVCC-safe snapshot it's OK to release pin immediately.
However I'm not sure if this is worth the trouble --- it probably makes
sense to hold onto the pin anyway, in case we're told to mark some of
the tuples killed.

The patch as given gets point 2 right, but I think it doesn't
consistently do point 1, and in any case it's certainly not documenting
the issue properly.

BTW, I just realized another bug in the patch: btbulkdelete fails to
guarantee that it visits every page in the index. It was OK for
btvacuumcleanup to ignore pages added to the index after it starts,
but btbulkdelete has to deal with such pages.

One thing I'm kind of wondering is whether amgetmulti should still exist
as a separate API. Seems like if amgettuple is doing the equivalent
thing under-the-hood, then amgetmulti isn't saving us anything except a
few trips through FunctionCallN. It's not at all clear that it's worth
the API complexity of having two different fetch functions.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #25 (permalink)  
Old 04-18-2008, 01:36 AM
Tom Lane
 
Posts: n/a
Default Re: Page at a time index scan

I wrote:
> BTW, I just realized another bug in the patch: btbulkdelete fails to
> guarantee that it visits every page in the index. It was OK for
> btvacuumcleanup to ignore pages added to the index after it starts,
> but btbulkdelete has to deal with such pages.


Actually, as written this patch does not work. At all. btbulkdelete
has to guarantee that it removes every index entry it's told to, and
it cannot guarantee that in the presence of concurrent page splits.
A split could move index items from a page that btbulkdelete hasn't
visited to one it's already passed over. This is not possible with an
index-order traversal (because splits only move items to the right)
but it's definitely possible with a physical-order traversal.

I was toying with the idea of remembering deletable pages (which
btvacuumcleanup does anyway), which are the only ones that page splits
could move items to, and then rescanning those after the completion
of the primary pass. This has a couple of pretty unpleasant
consequences though:
* We have to remember *every* deletable page for correctness, compared
to the current situation where btvacuumcleanup can bound the number of
pages it tracks. This creates a situation where VACUUM may fail
outright if it doesn't have gobs of memory. Since one of the main
reasons for developing lazy VACUUM was to ensure we could vacuum
arbitrarily large tables in bounded memory, I'm not happy with this.
* The rescan could be far from cheap if there are many such pages.

Also, I'm not sure when you can stop: it certainly seems possible that
items could be moved down during the primary scan and then moved down
again while btbulkdelete is reconsidering the deletable pages. Without
locking out splits entirely, it's far from obvious that we can make it
work.

Thoughts?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #26 (permalink)  
Old 04-18-2008, 01:36 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Page at a time index scan

On Fri, 5 May 2006, Tom Lane wrote:

> I wrote:
>> BTW, I just realized another bug in the patch: btbulkdelete fails to
>> guarantee that it visits every page in the index. It was OK for
>> btvacuumcleanup to ignore pages added to the index after it starts,
>> but btbulkdelete has to deal with such pages.

>
> Actually, as written this patch does not work. At all. btbulkdelete
> has to guarantee that it removes every index entry it's told to, and
> it cannot guarantee that in the presence of concurrent page splits.
> A split could move index items from a page that btbulkdelete hasn't
> visited to one it's already passed over. This is not possible with an
> index-order traversal (because splits only move items to the right)
> but it's definitely possible with a physical-order traversal.


True.

The first solution that occurs to me is to force page splits to choose the
target page so that it's blkno > the original page's blkno during vacuum.
It would cause the index to become more fragmented more quickly, which is
bad but perhaps tolerable.

> I was toying with the idea of remembering deletable pages (which
> btvacuumcleanup does anyway), which are the only ones that page splits
> could move items to, and then rescanning those after the completion
> of the primary pass. This has a couple of pretty unpleasant
> consequences though:
> * We have to remember *every* deletable page for correctness, compared
> to the current situation where btvacuumcleanup can bound the number of
> pages it tracks. This creates a situation where VACUUM may fail
> outright if it doesn't have gobs of memory. Since one of the main
> reasons for developing lazy VACUUM was to ensure we could vacuum
> arbitrarily large tables in bounded memory, I'm not happy with this.
> * The rescan could be far from cheap if there are many such pages.


Yep, that's not good.

- Heikki

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #27 (permalink)  
Old 04-18-2008, 01:36 AM
Tom Lane
 
Posts: n/a
Default Re: Page at a time index scan

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> The first solution that occurs to me is to force page splits to choose the
> target page so that it's blkno > the original page's blkno during vacuum.


I thought about that too, but don't like it for three reasons:

* it encourages index bloat, the more the longer the vacuum runs. Not
good, especially if you've got aggressive vacuum cost delay settings.

* there's a locking problem with respect to how you turn that behavior on.

* there's a failure mode where the behavior doesn't get turned off if
vacuum fails partway through.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #28 (permalink)  
Old 04-18-2008, 01:36 AM
Tom Lane
 
Posts: n/a
Default Re: Page at a time index scan

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On Fri, 5 May 2006, Tom Lane wrote:
>>> BTW, I just realized another bug in the patch: btbulkdelete fails to
>>> guarantee that it visits every page in the index.


> The first solution that occurs to me is to force page splits to choose the
> target page so that it's blkno > the original page's blkno during vacuum.
> It would cause the index to become more fragmented more quickly, which is
> bad but perhaps tolerable.


I have a sketch of a solution that doesn't require any change in page
allocation behavior. Can anyone see any holes in this:

Assume that we have some way to recognize whether a page has been split
since the current btbulkdelete scan started. (A split would mark both the
original page and the new right-hand page as newly split.) When
btbulkdelete arrives at a page, it need take no special action unless the
page is newly split *and* its right-link points to a lower physical
address. If that's true, then after vacuuming the page, follow its
right-link and vacuum that page; repeat until arriving at a page that is
either not newly split or is above the current location of the outer loop.
Then return to the outer, sequential-scan loop.

We should also have btbulkdelete clear the newly-split marker whenever
it processes a page; this ensures that no page is processed more than
once, which is probably worth the possible extra I/O needed to clear the
marker. (The cycles to re-scan a page are maybe not that important, but
if we do reprocess pages we'll end up with a bogusly high tuple count.
OTOH I'm not sure we can guarantee an accurate tuple count anyway,
so maybe it doesn't matter.)

AFAICS, this works correctly even if the test for a newly-split page
sometimes yields false positives; thinking a page is newly split when
it is not might cost a little extra I/O but won't lead to wrong results.
This reduces the requirements for the newly-split marking mechanism.

So, how do we do that marking? Noting that we have 16 bits we could use
in BTPageOpaqueData without making that struct any bigger (because of
alignment considerations), I'm thinking about a 16-bit counter for each
index that is incremented at the start of each btbulkdelete operation and
copied into the opaque data whenever a page is split. If the value's
different from the current counter, the page definitely hasn't been split
during btbulkdelete. There's a 1-in-65536 chance of a false positive,
if the last split occurred some exact multiple of 65536 vacuums ago, but
that's probably low enough to be acceptable. (We could reduce the odds of
false positives in various ways, eg by saying that zero isn't a valid
counter value and having btbulkdelete reset a page's counter to zero
anytime it has to write the page anyway.)

This still has the same sort of locking issues I complained of in
regards to Heikki's idea, namely how do you make sure that everyone is
using the new counter value before you start scanning? That seems
soluble however. We'd just need to be willing to take one more lock
during every page split operation, which does not seem an intolerable
amount of overhead. Then we do something like take a sharelock while
fetching the counter value during split (and hold the lock until the
split's complete), and take a momentary exclusive lock while advancing
the counter during btbulkdelete startup.

Thoughts, better ideas?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #29 (permalink)  
Old 04-18-2008, 01:36 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Page at a time index scan

On Fri, 5 May 2006, Tom Lane wrote:

> I have a sketch of a solution that doesn't require any change in page
> allocation behavior. Can anyone see any holes in this:


Looks good to me.

> Assume that we have some way to recognize whether a page has been split
> since the current btbulkdelete scan started. (A split would mark both the
> original page and the new right-hand page as newly split.) When
> btbulkdelete arrives at a page, it need take no special action unless the
> page is newly split *and* its right-link points to a lower physical
> address. If that's true, then after vacuuming the page, follow its
> right-link and vacuum that page; repeat until arriving at a page that is
> either not newly split or is above the current location of the outer loop.
> Then return to the outer, sequential-scan loop.


It'd be a bit more efficient to finish the sequential-scan first, and
memorize the newly-split pages' right-links as they're encountered. Then
scan those pages as a separate second pass, or earlier if we run out of
memory reserved for memorizing them.

> We should also have btbulkdelete clear the newly-split marker whenever
> it processes a page; this ensures that no page is processed more than
> once, which is probably worth the possible extra I/O needed to clear the
> marker. (The cycles to re-scan a page are maybe not that important, but
> if we do reprocess pages we'll end up with a bogusly high tuple count.
> OTOH I'm not sure we can guarantee an accurate tuple count anyway,
> so maybe it doesn't matter.)


Yeah, seems worth it to always clear the marker. Definitely when the page
is dirtied anyway.

> AFAICS, this works correctly even if the test for a newly-split page
> sometimes yields false positives; thinking a page is newly split when
> it is not might cost a little extra I/O but won't lead to wrong results.
> This reduces the requirements for the newly-split marking mechanism.
>
> So, how do we do that marking? Noting that we have 16 bits we could use
> in BTPageOpaqueData without making that struct any bigger (because of
> alignment considerations), I'm thinking about a 16-bit counter for each
> index that is incremented at the start of each btbulkdelete operation and
> copied into the opaque data whenever a page is split. If the value's
> different from the current counter, the page definitely hasn't been split
> during btbulkdelete. There's a 1-in-65536 chance of a false positive,
> if the last split occurred some exact multiple of 65536 vacuums ago, but
> that's probably low enough to be acceptable. (We could reduce the odds of
> false positives in various ways, eg by saying that zero isn't a valid
> counter value and having btbulkdelete reset a page's counter to zero
> anytime it has to write the page anyway.)


If btbulkdelete always clears the marker (assuming zero isn't a valid
value), 16 bits is plenty. Unless a vacuum is aborted, there should never
be a value older than current value - 1 in the index. We could live with a
2-bit counter.

> This still has the same sort of locking issues I complained of in
> regards to Heikki's idea, namely how do you make sure that everyone is
> using the new counter value before you start scanning? That seems
> soluble however. We'd just need to be willing to take one more lock
> during every page split operation, which does not seem an intolerable
> amount of overhead. Then we do something like take a sharelock while
> fetching the counter value during split (and hold the lock until the
> split's complete), and take a momentary exclusive lock while advancing
> the counter during btbulkdelete startup.


That's not too bad. Where exactly were you thinking of putting the
counter and the lock?

> Thoughts, better ideas?


Well, we could enhance my proposal a bit to make the fragmentation effect
less severe. Instead of a simple flag indicating that a vacuum is in
progress, vacuum could announce the current page it's processing in a
per-index shmem variable. A page split must read that counter, holding a
shared lock, and choose the target page so that that boundary is not
crossed so that the new page is below the boundary and old page above.
Otherwise, it's free to choose what it wants. To make good target page
choices, it needs some help from FSM.

I think I like your proposal more, though. It seems better
concurrency-wise.

- Heikki

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #30 (permalink)  
Old 04-18-2008, 01:36 AM
Tom Lane
 
Posts: n/a
Default Re: Page at a time index scan

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> That's not too bad. Where exactly were you thinking of putting the
> counter and the lock?


My original thought was to keep it in btree metapages, but that's kind
of annoying since we just went to some effort to cache metapage
contents; which means the metapage will likely not be hanging around in
buffer cache. Right now I'm thinking we could have a single system-wide
counter instead of one per index, and keep it in shared memory. There's
no strong reason why it needs to survive system crash/restart AFAICS.

Aside from the counter proper, shared memory would need to contain
enough info so we could tell whether btbulkdelete was active for any
particular index, and the marker value to use for that index if so.
But this requires only enough space for 1 entry per active vacuum,
which is bounded by MaxBackends and will usually be a lot smaller.
A benefit of doing it that way is that when btbulkdelete isn't active,
splits can reliably tell so and write zero into the page markers.
That reduces the odds of false positives quite a lot, assuming that
bulkdelete is only active a small fraction of the time.

BTW, I'm currently working on cleaning up the page-at-a-time-scan
part of your patch, since we could apply that without changing
VACUUM at all. We need to do performance testing on that and make
sure it won't kill us for unique indexes, else this whole project
may have to be shelved :-(. I've found a few more problems, eg the
code doesn't handle changes in indexscan direction properly, but
no showstoppers.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:32 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com