This is a discussion on Page at a time index scan within the Pgsql Patches forums, part of the PostgreSQL category; --> On Tue, 2 May 2006, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> On Tue, 2 May 2006, ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On Tue, 2 May 2006, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> On Tue, 2 May 2006, Tom Lane wrote: >>> Backwards scan may break this whole concept; are you sure you've thought >>> it through? > >> I think so. The patch doesn't change the walk-left code. Do you have >> something specific in mind? > > I'm worried about synchronization, particularly what happens if the page > gets deleted from under you while you don't have it pinned. AFAICS, shouldn't happen. The half-dead system makes sure that a page won't get deleted while a scan might still be interested in it. It doesn't depend on pins. - Heikki ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On Tue, 2 May 2006, Tom Lane wrote: > >> Backwards scan may break this whole concept; are you sure you've thought > >> it through? > > > I think so. The patch doesn't change the walk-left code. Do you have > > something specific in mind? > > I'm worried about synchronization, particularly what happens if the page > gets deleted from under you while you don't have it pinned. Perhaps I should update my comments on "we don't need a pin at all"... On a Forward scan we need to pin while we are reading a whole page, though can release the pin afterwards. We don't need to keep the pin while servicing btgetnext() requests from our private page buffer though. (Which is what I meant to say.) AFAICS we will need to return to the page for a backward scan, so we could just keep the pin the whole way. It's not possible to cache the left page pointer because block splits to our immediate left can update them even after we read the page contents. (A forward scan need never fear page splits in the same way because existing items can't move past the existing page boundary). We need never return to a page that *could* be deleted. While scanning in either direction, if the complete page contains nothing but dead items we can simply move straight onto the next page, having updated the page status to half-dead. (The great thing about this patch is we should be able to report that somehow, so an asynchronous task handler can come and clean that page (only) now that we don't have a restriction on individual page vacuuming. We can think about somehow later) If only some of the index tuples are deleted, we should only return to the page to update the deleted index tuples *if*: - the page is still in the buffer pool. If its been evicted its because space is tight so we shouldn't call it back just to dirty the page. - we have a minimum threshold of deleted tuples. Otherwise we might re-dirty the page for just a single hint bit, so we end up writing the page out hundreds of times. (Guess: that should be 2 or 3) -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > AFAICS we will need to return to the page for a backward scan, so we > could just keep the pin the whole way. It's not possible to cache the > left page pointer because block splits to our immediate left can update > them even after we read the page contents. Sure we can cache the left pointer: it's not any different from the normal walk-left problem, because as soon as you drop pin on the page you are leaving, all the same hazards apply. This algorithm just has a slightly wider window between dropping one pin and acquiring the next. Walk-left is painful and (rarely) expensive, but it's a solved problem. > We need never return to a page that *could* be deleted. While scanning > in either direction, if the complete page contains nothing but dead > items we can simply move straight onto the next page, having updated the > page status to half-dead. This is unnecessary and probably wrong. > - we have a minimum threshold of deleted tuples. Otherwise we might > re-dirty the page for just a single hint bit, so we end up writing the > page out hundreds of times. (Guess: that should be 2 or 3) You are optimizing the wrong thing here. If we choose not to mark an entry dead then we will pay for that omission on every future scan of the same entry. I don't think that outweighs the (doubtless rare) situation where we expend an extra page fetch to reload the page. It's worth noting that all of this stuff is predicated on the assumption that index items never move across pre-existing page boundaries, in either direction. We are therefore going to be permanently giving up any prospect of index space reclamation by merging partly-filled pages (unless maybe in VACUUM FULL). We didn't know how to do that anyway, so I don't feel too bad about it, but if indexscans don't make any attempt to explicitly re-locate their positions then that certainly goes out the window. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote: > > > I'm worried about synchronization, particularly what happens if the page > > > gets deleted from under you while you don't have it pinned. On Wed, 2006-05-03 at 10:17 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > We need never return to a page that *could* be deleted. While scanning > > in either direction, if the complete page contains nothing but dead > > items we can simply move straight onto the next page, having updated the > > page status to half-dead. > > This is unnecessary and probably wrong. You'll need to be more specific about what you mean. Heikki's concurrent post says roughly the same thing as what I just said, AFAICS. Do you see a problem with page deletion? If so, where? > It's worth noting that all of this stuff is predicated on the assumption > that index items never move across pre-existing page boundaries, in > either direction. We are therefore going to be permanently giving up > any prospect of index space reclamation by merging partly-filled pages > (unless maybe in VACUUM FULL). We didn't know how to do that anyway, > so I don't feel too bad about it, but if indexscans don't make any > attempt to explicitly re-locate their positions then that certainly > goes out the window. Seems like a step forwards to me, even if there is still wish to go further; we've all been trying to improve this behaviour for some time, so hats off to Heikki... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Wed, 2006-05-03 at 10:17 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > You are optimizing the wrong thing here. If we choose not to mark an > entry dead then we will pay for that omission on every future scan of > the same entry. I don't think that outweighs the (doubtless rare) > situation where we expend an extra page fetch to reload the page. Sounds a familiar conversation, which I shouldn't have raised here. This depends upon whether the pages being accessed are in cache or not, and whether we have sufficient I/O to pay the cost of a write. Reads don't always go to disk, writes always do. I see that its difficult to tell which is which, but that doesn't mean there aren't different cases. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote: >> This is unnecessary and probably wrong. > You'll need to be more specific about what you mean. There is no point in marking a page half-dead, as that doesn't save anyone else from visiting it, and it's probably wrong to mark a leaf page as half-dead at all. That state is associated with upper pages. Even if it were a legal tree configuration, marking the page half-dead would make it impossible to insert any more keys in the page, which doesn't strike me as an appropriate behavior; it's likely to force excess I/O later due to unnecessary page splits during future inserts. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On Wed, 2006-05-03 at 10:56 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2006-05-02 at 15:35 -0400, Tom Lane wrote: > >> This is unnecessary and probably wrong. > > > You'll need to be more specific about what you mean. > > There is no point in marking a page half-dead, as that doesn't save > anyone else from visiting it, and it's probably wrong to mark a leaf > page as half-dead at all. That state is associated with upper pages. > > Even if it were a legal tree configuration, marking the page half-dead > would make it impossible to insert any more keys in the page, which > doesn't strike me as an appropriate behavior; it's likely to force > excess I/O later due to unnecessary page splits during future inserts. OK. So do you see a problem scenario like this? A, B and C separate backends: A1 Reads page, some row versions are *not* marked LP_DELETE but will be later when A2 happens B1 VACUUM removes dead rows, just happens to be all of them B2 Recycles page into FSM C1 Inserts new data into old page A2 Attempts to update old page to notify about dead rows (UGH!) B2 can only stopped by pinning... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > So do you see a problem scenario like this? > A, B and C separate backends: > A1 Reads page, some row versions are *not* marked LP_DELETE but will be > later when A2 happens > B1 VACUUM removes dead rows, just happens to be all of them > B2 Recycles page into FSM > C1 Inserts new data into old page > A2 Attempts to update old page to notify about dead rows (UGH!) Can't happen; a page cannot be recycled until all concurrent transactions are gone. In any case, the LP_DELETE marking code will certainly take care to check that the entries it's trying to mark are still the same ones it meant to mark. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Wed, 2006-05-03 at 13:39 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > So do you see a problem scenario like this? > > > A, B and C separate backends: > > A1 Reads page, some row versions are *not* marked LP_DELETE but will be > > later when A2 happens > > B1 VACUUM removes dead rows, just happens to be all of them > > B2 Recycles page into FSM > > C1 Inserts new data into old page > > A2 Attempts to update old page to notify about dead rows (UGH!) > > Can't happen; a page cannot be recycled until all concurrent > transactions are gone. In any case, the LP_DELETE marking code will > certainly take care to check that the entries it's trying to mark > are still the same ones it meant to mark. ! So do you see a problem with page deletion, or not? If so, what is it? This patch looks good to me, based upon everything said. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| ||||
| On Wed, 3 May 2006, Heikki Linnakangas wrote: > On Tue, 2 May 2006, Tom Lane wrote: > >> Heikki Linnakangas <hlinnaka@iki.fi> writes: >>> On Tue, 2 May 2006, Tom Lane wrote: >>>> Backwards scan may break this whole concept; are you sure you've thought >>>> it through? >> >>> I think so. The patch doesn't change the walk-left code. Do you have >>> something specific in mind? >> >> I'm worried about synchronization, particularly what happens if the page >> gets deleted from under you while you don't have it pinned. > > AFAICS, shouldn't happen. The half-dead system makes sure that a page won't > get deleted while a scan might still be interested in it. It doesn't depend > on pins. As Tom pointed out elsewhere in this thread, the above explanation is wrong because half-dead state only applies to upper-level pages. I thought that half-dead means that the page is dead and removed from the tree, but not yet recycled because some transaction might still be interested in it. Now I see that that state is actually called "deleted". The point remains, however. A page won't get deleted while a scan might still be interested in it, because deleted pages are not immediately recycled (except on vacuum full), and the left and right sibling pointers stay intact until no transaction can be interested in it. - Heikki ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |