This is a discussion on Page at a time index scan within the Pgsql Patches forums, part of the PostgreSQL category; --> I've committed a rewritten version of this patch. Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Fri, 5 May 2006, Tom ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I've committed a rewritten version of this patch. Heikki Linnakangas <hlinnaka@iki.fi> writes: > On Fri, 5 May 2006, Tom Lane wrote: >> btbulkdelete arrives at a page, it need take no special action unless the >> page is newly split *and* its right-link points to a lower physical >> address. If that's true, then after vacuuming the page, follow its >> right-link and vacuum that page; repeat until arriving at a page that is >> either not newly split or is above the current location of the outer loop. >> Then return to the outer, sequential-scan loop. > It'd be a bit more efficient to finish the sequential-scan first, and > memorize the newly-split pages' right-links as they're encountered. Then > scan those pages as a separate second pass, or earlier if we run out of > memory reserved for memorizing them. I didn't do this. Aside from the extra memory requirement, it's not apparent to me that it'd make things faster. The disadvantage is that it would require more page reads than the other way: if you visit the split page immediately, and note that its right-link is above the current outer loop location, then you can skip following the right-link because you know you'll visit the page later. If you postpone then you have to chase every chain until actually reading a page with an old cycle ID. I think this extra I/O would likely outweigh any savings from not interrupting the main scan. > If btbulkdelete always clears the marker (assuming zero isn't a valid > value), 16 bits is plenty. Unless a vacuum is aborted, there should never > be a value older than current value - 1 in the index. We could live with a > 2-bit counter. For the moment, the code is only clearing the marker if it's equal to the current cycle ID. This is sufficient to recognize definitely-already-processed pages, but it doesn't prevent false positives in general. If we ever need the space we could narrow the counter, at the cost of having to expend more I/O to keep the values cleared. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Fri, 2006-05-05 at 18:04 -0400, Tom Lane wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > On Fri, 5 May 2006, Tom Lane wrote: > >>> BTW, I just realized another bug in the patch: btbulkdelete fails to > >>> guarantee that it visits every page in the index. > > > The first solution that occurs to me is to force page splits to choose the > > target page so that it's blkno > the original page's blkno during vacuum. > > It would cause the index to become more fragmented more quickly, which is > > bad but perhaps tolerable. > > I have a sketch of a solution that doesn't require any change in page > allocation behavior. Can anyone see any holes in this: > > Assume that we have some way to recognize whether a page has been split > since the current btbulkdelete scan started. (A split would mark both the > original page and the new right-hand page as newly split.) When > btbulkdelete arrives at a page, it need take no special action unless the > page is newly split *and* its right-link points to a lower physical > address. If that's true, then after vacuuming the page, follow its > right-link and vacuum that page; repeat until arriving at a page that is > either not newly split or is above the current location of the outer loop. > Then return to the outer, sequential-scan loop. > > We should also have btbulkdelete clear the newly-split marker whenever > it processes a page; this ensures that no page is processed more than > once, which is probably worth the possible extra I/O needed to clear the > marker. (The cycles to re-scan a page are maybe not that important, but > if we do reprocess pages we'll end up with a bogusly high tuple count. > OTOH I'm not sure we can guarantee an accurate tuple count anyway, > so maybe it doesn't matter.) > > AFAICS, this works correctly even if the test for a newly-split page > sometimes yields false positives; thinking a page is newly split when > it is not might cost a little extra I/O but won't lead to wrong results. > This reduces the requirements for the newly-split marking mechanism. > > So, how do we do that marking? Noting that we have 16 bits we could use > in BTPageOpaqueData without making that struct any bigger (because of > alignment considerations), I'm thinking about a 16-bit counter for each > index that is incremented at the start of each btbulkdelete operation and > copied into the opaque data whenever a page is split. If the value's > different from the current counter, the page definitely hasn't been split > during btbulkdelete. There's a 1-in-65536 chance of a false positive, > if the last split occurred some exact multiple of 65536 vacuums ago, but > that's probably low enough to be acceptable. (We could reduce the odds of > false positives in various ways, eg by saying that zero isn't a valid > counter value and having btbulkdelete reset a page's counter to zero > anytime it has to write the page anyway.) I read your earlier post about needing to lock everything and spent some time thinking about this. The issue of needing to lock everything means that we would never be able to do a partial vacuum of an index i.e. remove one page without a scan. I'm more concerned about making partial vacuum work than I am about speeding up an all-block vacuum. I'd been mulling over the locking problems and came up with something that looks identical to your proposal *except* for the value that gets written into the BTPageOpaqueData on the right-hand page. My thinking was to write the blockid of the original left hand page, so as to record the original ancestor since split. Thus if multiple splits occurred, then the original ancestor blockid would remain on record until VACUUM. In more detail: When we split a page if the ancestor blockid is not set, then we set it to be the blockid of the old page (new left hand page). If the ancestor blockid is already set then we copy that to the new right hand page. Every split will write a value to BTPageOpaqueData, though the values to use are already available without extra work. AFAICS this doesn't change the ability to scan the index in physical order, so we still get the benefits for that action. A *partial* vacuum of an index block can be achieved by visiting the block to be vacuumed, then following the link directly to the pre-split ancestor block (if there is one), then moving right again back to the target block. btbulkdelete always clears the marker when it cleans. This opens the door for the approach of notifying when a page is deletable and then having a background agent remove just that page, as patch-proposed recently by Junji TERAMOTO. I'm *not* presuming we have an agreed solution for partial vacuuming, but I do want to keep that door open by implementing a locking protocol that could support it. > This still has the same sort of locking issues I complained of in > regards to Heikki's idea, namely how do you make sure that everyone is > using the new counter value before you start scanning? That seems > soluble however. We'd just need to be willing to take one more lock > during every page split operation, which does not seem an intolerable > amount of overhead. Then we do something like take a sharelock while > fetching the counter value during split (and hold the lock until the > split's complete), and take a momentary exclusive lock while advancing > the counter during btbulkdelete startup. I'm not very happy about an extra lock during page splitting, which adds a performance hit even for tables that never will need regular vacuuming (apart from occaisional wrap-around avoidance). AFAICS if we just store the ancestor blockid we don't need to do the business with shared memory and extra locks etc.. The size of the field we hold for BTPageOpaqueData seems fairly unimportant, since we leave lots of space in index pages anyway - a few extra bytes would only make a noise level change in performance. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > I read your earlier post about needing to lock everything and spent some > time thinking about this. The issue of needing to lock everything means > that we would never be able to do a partial vacuum of an index i.e. > remove one page without a scan. I'm more concerned about making partial > vacuum work than I am about speeding up an all-block vacuum. [ shrug... ] That's an illusory goal anyway. Retail tuple removal is just too inefficient. (No, I don't believe in that proposed patch.) > My thinking was to write the blockid of the original left hand page, so > as to record the original ancestor since split. Thus if multiple splits > occurred, then the original ancestor blockid would remain on record > until VACUUM. In more detail: When we split a page if the ancestor > blockid is not set, then we set it to be the blockid of the old page > (new left hand page). If the ancestor blockid is already set then we > copy that to the new right hand page. Every split will write a value to > BTPageOpaqueData, though the values to use are already available without > extra work. Doesn't work, at least not for making it possible to vacuum part of the index. The conflicting indexscan could have stopped on a page, and then that page could have split, before your "partial vacuum" ever started. So tracing back only as far as the data has split since vacuum started is not enough to prevent conflict. (The other little problem is that we'd have to enlarge the BTOpaque overhead, because a block id doesn't fit in the available 16 bits.) > I'm not very happy about an extra lock during page splitting, which adds > a performance hit even for tables that never will need regular vacuuming > (apart from occaisional wrap-around avoidance). While I'd rather not have done that, I don't believe that it makes for any material performance degradation. Normal splits all take the lock in shared mode and hence suffer no contention. Your proposal wouldn't make for less locking anyway, since it still assumes that there's a way to tell whether vacuum is active for a given index, which is just about the same amount of overhead as the code-as-committed. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| On Mon, 2006-05-08 at 10:18 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > I read your earlier post about needing to lock everything and spent some > > time thinking about this. The issue of needing to lock everything means > > that we would never be able to do a partial vacuum of an index i.e. > > remove one page without a scan. I'm more concerned about making partial > > vacuum work than I am about speeding up an all-block vacuum. > > [ shrug... ] That's an illusory goal anyway. Retail tuple removal is > just too inefficient. (No, I don't believe in that proposed patch.) Current VACUUM optimizes for the case where random updates/deletes occur. If there are hotspots then scanning the whole relation is O(N) or even O(N^2) if we have to scan the indexes multiple times. We think we have a way to improve heap VACUUMs (bitmaps...) but we are still looking for an equivalent for indexes. > > My thinking was to write the blockid of the original left hand page, so > > as to record the original ancestor since split. Thus if multiple splits > > occurred, then the original ancestor blockid would remain on record > > until VACUUM. In more detail: When we split a page if the ancestor > > blockid is not set, then we set it to be the blockid of the old page > > (new left hand page). If the ancestor blockid is already set then we > > copy that to the new right hand page. Every split will write a value to > > BTPageOpaqueData, though the values to use are already available without > > extra work. > > Doesn't work, at least not for making it possible to vacuum part of the > index. The conflicting indexscan could have stopped on a page, and then > that page could have split, before your "partial vacuum" ever started. > So tracing back only as far as the data has split since vacuum started > is not enough to prevent conflict. That wasn't the proposal. Every split would be marked and stay marked until those blocks were VACUUMed. The data used to mark is readily available and doesn't rely on whether or not VACUUM is running. IMHO this does work. > (The other little problem is that we'd have to enlarge the BTOpaque > overhead, because a block id doesn't fit in the available 16 bits.) ISTM it is indeed a little problem. After CREATE INDEX, most index pages don't stay completely full, so worrying about alignment wastage might very occasionally save an I/O, but not enough for us to care. > > I'm not very happy about an extra lock during page splitting, which adds > > a performance hit even for tables that never will need regular vacuuming > > (apart from occaisional wrap-around avoidance). > > While I'd rather not have done that, I don't believe that it makes for > any material performance degradation. Normal splits all take the lock > in shared mode and hence suffer no contention. Your proposal wouldn't > make for less locking anyway, since it still assumes that there's a way > to tell whether vacuum is active for a given index, which is just about > the same amount of overhead as the code-as-committed. The proposed scheme doesn't rely on knowing whether a VACUUM is active or not. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > That wasn't the proposal. Every split would be marked and stay marked > until those blocks were VACUUMed. The data used to mark is readily > available and doesn't rely on whether or not VACUUM is running. > IMHO this does work. OK, I misunderstood what you had in mind, but now that I do understand it doesn't seem terribly efficient. What you're suggesting is that we take as a "vacuum group" all the pages that have been split off from a single original page since that page was last vacuumed, and that this group must be vacuumed as a whole. That works, but it seems that the groups would get awfully large. In particular, this substantially penalizes btbulkdelete in hopes of someday improving matters for what remains an entirely fictional partial vacuum. As it stands today, btbulkdelete only has to worry about page groups formed since it began to run, not since the last vacuum. Changing the data representation like this would force it to retrace much more often and over much larger page groups. >> (The other little problem is that we'd have to enlarge the BTOpaque >> overhead, because a block id doesn't fit in the available 16 bits.) > ISTM it is indeed a little problem. After CREATE INDEX, most index pages > don't stay completely full, so worrying about alignment wastage might > very occasionally save an I/O, but not enough for us to care. I don't think an extra 4 or 8 bytes need be a show-stopper, but it does force an initdb and thus make it harder to compare the two techniques. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On Mon, 2006-05-08 at 11:26 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > That wasn't the proposal. Every split would be marked and stay marked > > until those blocks were VACUUMed. The data used to mark is readily > > available and doesn't rely on whether or not VACUUM is running. > > IMHO this does work. > > OK, I misunderstood what you had in mind, but now that I do understand > it doesn't seem terribly efficient. What you're suggesting is that we > take as a "vacuum group" all the pages that have been split off from a > single original page since that page was last vacuumed, and that this > group must be vacuumed as a whole. That works, but it seems that the > groups would get awfully large. In particular, this substantially > penalizes btbulkdelete in hopes of someday improving matters for what > remains an entirely fictional partial vacuum. OK, so we have the germ of a new mechanism - and I very much agree that the idea of a partial vacuum is at present entirely fictional...but we at least have a starting place. > As it stands today, > btbulkdelete only has to worry about page groups formed since it began > to run, not since the last vacuum. Changing the data representation > like this would force it to retrace much more often and over much larger > page groups. Yes, I saw the potential issue you mention - but for many cases the index grows forwards and so we wouldn't care in either case. Page splits that go to lower blockids are limited by available space, so would be less of a problem. I'm balancing the additional cost of page splits against the additional cost on the vacuum. I would prefer to keep in-line ops faster and pay a little extra on the out-of-line operations, if thats what it takes. I note your point that there is little contention, but there is still a cost and in many cases this cost is being paid on tables that never will be VACUUMed. For insert-intensive apps, this adds cost, for little benefit. For update-intensive apps, we're VACUUMing continually anyway so there's no benefit from doing this only-during-VACUUM. So we just optimised for slowly-but-continually churning tables (i.e. DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM performance for those that don't need it that often. That might be the common case, but it isn't the one thats hurting most. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > So we just optimised for slowly-but-continually churning tables (i.e. > DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM > performance for those that don't need it that often. That might be the > common case, but it isn't the one thats hurting most. No, we just improved VACUUM performance for those that need it most. I was just doing some desultory experiments with today's code versus yesterday's (it's easy to make a direct comparison because they're initdb-compatible ... just stop one postmaster executable and start another on the same database). I made a table of 16M rows with an index over a random-data integer column. With a thoroughly disordered index (built on-the-fly as the random data was inserted), the time to VACUUM after deleting a small number of rows was 615 seconds with yesterday's code, 31 seconds today. With a perfectly-ordered index (identical table, but CREATE INDEX after all the data is in place), the times were about 28 and 26 seconds respectively. (I wouldn't put a *whole* lot of faith in these numbers, since they're from a Dell machine with a toy ATA drive, but anyway they do show that sequential access to the index makes a huge difference.) But perfectly-ordered indexes are impossible to maintain unless your data is either static or insert-at- the-end-only. Anyway, while the extra LWLock and shared-memory access during a split is annoying, I think it's really negligible (and so does pgbench). A page split is going to involve many times that much work --- you've got to acquire lock on at least four different buffers, visit the FSM, possibly ask the kernel for another disk page, do XLogInsert twice, etc. Any one of those things involves significantly more CPU effort and contention than _bt_vacuum_cycleid(). So while I'd be happy to get rid of it, not at the price of slowing down VACUUM again. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Tom, On 5/8/06 11:46 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > I made a table of 16M rows with an > index over a random-data integer column. With a thoroughly disordered > index (built on-the-fly as the random data was inserted), the time to > VACUUM after deleting a small number of rows was 615 seconds with > yesterday's code, 31 seconds today. With a perfectly-ordered index > (identical table, but CREATE INDEX after all the data is in place), the > times were about 28 and 26 seconds respectively. Very impressive! This corroborates findings we've had with index maintenance in the field - thanks for finding/fixing this. - Luke ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| ||||
| On Mon, 2006-05-08 at 14:46 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > So we just optimised for slowly-but-continually churning tables (i.e. > > DELETEs match INSERTs, or just UPDATEs). i.e. we just improved VACUUM > > performance for those that don't need it that often. That might be the > > common case, but it isn't the one thats hurting most. > > No, we just improved VACUUM performance for those that need it most. > I was just doing some desultory experiments with today's code versus > yesterday's (it's easy to make a direct comparison because they're > initdb-compatible ... just stop one postmaster executable and start > another on the same database). I made a table of 16M rows with an > index over a random-data integer column. With a thoroughly disordered > index (built on-the-fly as the random data was inserted), the time to > VACUUM after deleting a small number of rows was 615 seconds with > yesterday's code, 31 seconds today. With a perfectly-ordered index > (identical table, but CREATE INDEX after all the data is in place), the > times were about 28 and 26 seconds respectively. (I wouldn't put a > *whole* lot of faith in these numbers, since they're from a Dell machine > with a toy ATA drive, but anyway they do show that sequential access to > the index makes a huge difference.) But perfectly-ordered indexes are > impossible to maintain unless your data is either static or insert-at- > the-end-only. You and Heikki have achieved a marvelous thing, well done. > Anyway, while the extra LWLock and shared-memory access during a split > is annoying, I think it's really negligible (and so does pgbench). > A page split is going to involve many times that much work --- you've > got to acquire lock on at least four different buffers, visit the FSM, > possibly ask the kernel for another disk page, do XLogInsert twice, etc. > Any one of those things involves significantly more CPU effort and > contention than _bt_vacuum_cycleid(). So while I'd be happy to get rid > of it, not at the price of slowing down VACUUM again. I'll raise the partial vacuum topic on -hackers. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |