Unix Technical Forum

Page at a time index scan

This is a discussion on Page at a time index scan within the Pgsql Patches forums, part of the PostgreSQL category; --> On Tue, 2 May 2006, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> On Tue, 2 May 2006, ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Patches

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 04-18-2008, 12:35 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Page at a time index scan

On Tue, 2 May 2006, Tom Lane wrote:

> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On Tue, 2 May 2006, Tom Lane wrote:
>>> Backwards scan may break this whole concept; are you sure you've thought
>>> it through?

>
>> I think so. The patch doesn't change the walk-left code. Do you have
>> something specific in mind?

>
> I'm worried about synchronization, particularly what happens if the page
> gets deleted from under you while you don't have it pinned.


AFAICS, shouldn't happen. The half-dead system makes sure that a page
won't get deleted while a scan might still be interested in it. It
doesn't depend on pins.

- Heikki

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #12 (permalink)  
Old 04-18-2008, 12:35 AM
Simon Riggs
 
Posts: n/a
Default Re: Page at a time index scan

On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
> > On Tue, 2 May 2006, Tom Lane wrote:
> >> Backwards scan may break this whole concept; are you sure you've thought
> >> it through?

>
> > I think so. The patch doesn't change the walk-left code. Do you have
> > something specific in mind?

>
> I'm worried about synchronization, particularly what happens if the page
> gets deleted from under you while you don't have it pinned.


Perhaps I should update my comments on "we don't need a pin at all"...

On a Forward scan we need to pin while we are reading a whole page,
though can release the pin afterwards. We don't need to keep the pin
while servicing btgetnext() requests from our private page buffer
though. (Which is what I meant to say.)

AFAICS we will need to return to the page for a backward scan, so we
could just keep the pin the whole way. It's not possible to cache the
left page pointer because block splits to our immediate left can update
them even after we read the page contents. (A forward scan need never
fear page splits in the same way because existing items can't move past
the existing page boundary).

We need never return to a page that *could* be deleted. While scanning
in either direction, if the complete page contains nothing but dead
items we can simply move straight onto the next page, having updated the
page status to half-dead. (The great thing about this patch is we should
be able to report that somehow, so an asynchronous task handler can come
and clean that page (only) now that we don't have a restriction on
individual page vacuuming. We can think about somehow later)

If only some of the index tuples are deleted, we should only return to
the page to update the deleted index tuples *if*:
- the page is still in the buffer pool. If its been evicted its because
space is tight so we shouldn't call it back just to dirty the page.
- we have a minimum threshold of deleted tuples. Otherwise we might
re-dirty the page for just a single hint bit, so we end up writing the
page out hundreds of times. (Guess: that should be 2 or 3)

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #13 (permalink)  
Old 04-18-2008, 12:35 AM
Tom Lane
 
Posts: n/a
Default Re: Page at a time index scan

Simon Riggs <simon@2ndquadrant.com> writes:
> AFAICS we will need to return to the page for a backward scan, so we
> could just keep the pin the whole way. It's not possible to cache the
> left page pointer because block splits to our immediate left can update
> them even after we read the page contents.


Sure we can cache the left pointer: it's not any different from the
normal walk-left problem, because as soon as you drop pin on the page
you are leaving, all the same hazards apply. This algorithm just has a
slightly wider window between dropping one pin and acquiring the next.
Walk-left is painful and (rarely) expensive, but it's a solved problem.

> We need never return to a page that *could* be deleted. While scanning
> in either direction, if the complete page contains nothing but dead
> items we can simply move straight onto the next page, having updated the
> page status to half-dead.


This is unnecessary and probably wrong.

> - we have a minimum threshold of deleted tuples. Otherwise we might
> re-dirty the page for just a single hint bit, so we end up writing the
> page out hundreds of times. (Guess: that should be 2 or 3)


You are optimizing the wrong thing here. If we choose not to mark an
entry dead then we will pay for that omission on every future scan of
the same entry. I don't think that outweighs the (doubtless rare)
situation where we expend an extra page fetch to reload the page.

It's worth noting that all of this stuff is predicated on the assumption
that index items never move across pre-existing page boundaries, in
either direction. We are therefore going to be permanently giving up
any prospect of index space reclamation by merging partly-filled pages
(unless maybe in VACUUM FULL). We didn't know how to do that anyway,
so I don't feel too bad about it, but if indexscans don't make any
attempt to explicitly re-locate their positions then that certainly
goes out the window.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #14 (permalink)  
Old 04-18-2008, 12:35 AM
Simon Riggs
 
Posts: n/a
Default Re: Page at a time index scan

On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote:
> > > I'm worried about synchronization, particularly what happens if the page
> > > gets deleted from under you while you don't have it pinned.


On Wed, 2006-05-03 at 10:17 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:


> > We need never return to a page that *could* be deleted. While scanning
> > in either direction, if the complete page contains nothing but dead
> > items we can simply move straight onto the next page, having updated the
> > page status to half-dead.

>
> This is unnecessary and probably wrong.


You'll need to be more specific about what you mean. Heikki's concurrent
post says roughly the same thing as what I just said, AFAICS.

Do you see a problem with page deletion? If so, where?

> It's worth noting that all of this stuff is predicated on the assumption
> that index items never move across pre-existing page boundaries, in
> either direction. We are therefore going to be permanently giving up
> any prospect of index space reclamation by merging partly-filled pages
> (unless maybe in VACUUM FULL). We didn't know how to do that anyway,
> so I don't feel too bad about it, but if indexscans don't make any
> attempt to explicitly re-locate their positions then that certainly
> goes out the window.


Seems like a step forwards to me, even if there is still wish to go
further; we've all been trying to improve this behaviour for some time,
so hats off to Heikki...

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #15 (permalink)  
Old 04-18-2008, 12:35 AM
Simon Riggs
 
Posts: n/a
Default Re: Page at a time index scan

On Wed, 2006-05-03 at 10:17 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:


> You are optimizing the wrong thing here. If we choose not to mark an
> entry dead then we will pay for that omission on every future scan of
> the same entry. I don't think that outweighs the (doubtless rare)
> situation where we expend an extra page fetch to reload the page.


Sounds a familiar conversation, which I shouldn't have raised here.

This depends upon whether the pages being accessed are in cache or not,
and whether we have sufficient I/O to pay the cost of a write. Reads
don't always go to disk, writes always do. I see that its difficult to
tell which is which, but that doesn't mean there aren't different cases.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #16 (permalink)  
Old 04-18-2008, 12:35 AM
Tom Lane
 
Posts: n/a
Default Re: Page at a time index scan

Simon Riggs <simon@2ndquadrant.com> writes:
> On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote:
>> This is unnecessary and probably wrong.


> You'll need to be more specific about what you mean.


There is no point in marking a page half-dead, as that doesn't save
anyone else from visiting it, and it's probably wrong to mark a leaf
page as half-dead at all. That state is associated with upper pages.

Even if it were a legal tree configuration, marking the page half-dead
would make it impossible to insert any more keys in the page, which
doesn't strike me as an appropriate behavior; it's likely to force
excess I/O later due to unnecessary page splits during future inserts.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #17 (permalink)  
Old 04-18-2008, 12:35 AM
Simon Riggs
 
Posts: n/a
Default Re: Page at a time index scan

On Wed, 2006-05-03 at 10:56 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote:
> >> This is unnecessary and probably wrong.

>
> > You'll need to be more specific about what you mean.

>
> There is no point in marking a page half-dead, as that doesn't save
> anyone else from visiting it, and it's probably wrong to mark a leaf
> page as half-dead at all. That state is associated with upper pages.
>
> Even if it were a legal tree configuration, marking the page half-dead
> would make it impossible to insert any more keys in the page, which
> doesn't strike me as an appropriate behavior; it's likely to force
> excess I/O later due to unnecessary page splits during future inserts.


OK.

So do you see a problem scenario like this?

A, B and C separate backends:
A1 Reads page, some row versions are *not* marked LP_DELETE but will be
later when A2 happens
B1 VACUUM removes dead rows, just happens to be all of them
B2 Recycles page into FSM
C1 Inserts new data into old page
A2 Attempts to update old page to notify about dead rows (UGH!)

B2 can only stopped by pinning...

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #18 (permalink)  
Old 04-18-2008, 12:35 AM
Tom Lane
 
Posts: n/a
Default Re: Page at a time index scan

Simon Riggs <simon@2ndquadrant.com> writes:
> So do you see a problem scenario like this?


> A, B and C separate backends:
> A1 Reads page, some row versions are *not* marked LP_DELETE but will be
> later when A2 happens
> B1 VACUUM removes dead rows, just happens to be all of them
> B2 Recycles page into FSM
> C1 Inserts new data into old page
> A2 Attempts to update old page to notify about dead rows (UGH!)


Can't happen; a page cannot be recycled until all concurrent
transactions are gone. In any case, the LP_DELETE marking code will
certainly take care to check that the entries it's trying to mark
are still the same ones it meant to mark.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #19 (permalink)  
Old 04-18-2008, 12:35 AM
Simon Riggs
 
Posts: n/a
Default Re: Page at a time index scan

On Wed, 2006-05-03 at 13:39 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > So do you see a problem scenario like this?

>
> > A, B and C separate backends:
> > A1 Reads page, some row versions are *not* marked LP_DELETE but will be
> > later when A2 happens
> > B1 VACUUM removes dead rows, just happens to be all of them
> > B2 Recycles page into FSM
> > C1 Inserts new data into old page
> > A2 Attempts to update old page to notify about dead rows (UGH!)

>
> Can't happen; a page cannot be recycled until all concurrent
> transactions are gone. In any case, the LP_DELETE marking code will
> certainly take care to check that the entries it's trying to mark
> are still the same ones it meant to mark.


!

So do you see a problem with page deletion, or not? If so, what is it?

This patch looks good to me, based upon everything said.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #20 (permalink)  
Old 04-18-2008, 12:35 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Page at a time index scan

On Wed, 3 May 2006, Heikki Linnakangas wrote:

> On Tue, 2 May 2006, Tom Lane wrote:
>
>> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>>> On Tue, 2 May 2006, Tom Lane wrote:
>>>> Backwards scan may break this whole concept; are you sure you've thought
>>>> it through?

>>
>>> I think so. The patch doesn't change the walk-left code. Do you have
>>> something specific in mind?

>>
>> I'm worried about synchronization, particularly what happens if the page
>> gets deleted from under you while you don't have it pinned.

>
> AFAICS, shouldn't happen. The half-dead system makes sure that a page won't
> get deleted while a scan might still be interested in it. It doesn't depend
> on pins.


As Tom pointed out elsewhere in this thread, the above explanation is
wrong because half-dead state only applies to upper-level pages. I thought
that half-dead means that the page is dead and removed from the tree, but
not yet recycled because some transaction might still be interested in it.
Now I see that that state is actually called "deleted".

The point remains, however. A page won't get deleted while a scan
might still be interested in it, because deleted pages are not
immediately recycled (except on vacuum full), and the left and right
sibling pointers stay intact until no transaction can be interested in it.

- Heikki

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 07:18 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com