This is a discussion on Re: Bug: Buffer cache is not scan resistant within the pgsql Hackers forums, part of the PostgreSQL category; --> Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and did not see the substantial ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and did not see the substantial gain as with reducing the working set size. My conclusion: Fixing the OS is not sufficient to alleviate the issue. We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this effect. - Luke Msg is shrt cuz m on ma treo -----Original Message----- From: Sherry Moore [mailto:sherry.moore@sun.com] Sent: Tuesday, March 06, 2007 10:05 PM Eastern Standard Time To: Simon Riggs Cc: Sherry Moore; Tom Lane; Luke Lonergan; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant Hi Simon, > and what you haven't said > > - all of this is orthogonal to the issue of buffer cache spoiling in > PostgreSQL itself. That issue does still exist as a non-OS issue, but > we've been discussing in detail the specific case of L2 cache effects > with specific kernel calls. All of the test results have been > stand-alone, so we've not done any measurements in that area. I say this > because you make the point that reducing the working set size of write > workloads has no effect on the L2 cache issue, but ISTM its still > potentially a cache spoiling issue. What I wanted to point out was that (reiterating to avoid requoting), - My test was simply to demonstrate that the observed performance difference with VACUUM was caused by whether the size of the user buffer caused L2 thrashing. - In general, application should reduce the size of the working set to reduce the penalty of TLB misses and cache misses. - If the application access pattern meets the NTA trigger condition, the benefit of reducing the working set size will be much smaller. Whatever I said is probably orthogonal to the buffer cache issue you guys have been discussing, but I haven't read all the email exchange on the subject. Thanks, Sherry -- Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym |
| |||
| On Tue, 2007-03-06 at 22:32 -0500, Luke Lonergan wrote: > Incidentally, we tried triggering NTA (L2 cache bypass) > unconditionally and in various patterns and did not see the > substantial gain as with reducing the working set size. > > My conclusion: Fixing the OS is not sufficient to alleviate the issue. > We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data > rates due to this effect. > I've implemented buffer recycling, as previously described, patch being posted now to -patches as "scan_recycle_buffers". This version includes buffer recycling - for SeqScans larger than shared buffers, with the objective of improving L2 cache efficiency *and* reducing the effects of shared buffer cache spoiling (both as previously discussed on this thread) - for VACUUMs of any size, with the objective of reducing WAL thrashing whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as originally suggested by Itagaki-san, just with a different implementation). Behaviour is not activated by default in this patch. To request buffer recycling, set the USERSET GUC SET scan_recycle_buffers = N tested with 1,4,8,16, but only > 8 seems sensible, IMHO. Patch effects StrategyGetBuffer, so only effects the disk->cache path. The idea is that if its already in shared buffer cache then we get substantial benefit already and nothing else is needed. So for the general case, the patch adds a single if test into the I/O path. The parameter is picked up at the start of SeqScan and VACUUM (currently). Any change mid-scan will be ignored. IMHO its possible to do this and to allow Synch Scans at the same time, with some thought. There is no need for us to rely on cache spoiling behaviour of scans to implement that feature as well. Independent performance tests requested, so that we can discuss this objectively. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| "Simon Riggs" <simon@2ndquadrant.com> wrote: > I've implemented buffer recycling, as previously described, patch being > posted now to -patches as "scan_recycle_buffers". > > - for VACUUMs of any size, with the objective of reducing WAL thrashing > whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as > originally suggested by Itagaki-san, just with a different > implementation). I tested your patch with VACUUM FREEZE. The performance was improved when I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic, but this patch should be useful for normal VACUUMs with backgrond jobs! N | time | WAL flush(*) -----+-------+----------- 0 | 58.7s | 0.01% 1 | 80.3s | 81.76% 8 | 73.4s | 16.73% 16 | 64.2s | 9.24% 32 | 59.0s | 4.88% 64 | 56.7s | 2.63% 128 | 55.1s | 1.41% (*) WAL flush is the ratio of the need of fsync to buffer recycle. # SET scan_recycle_buffers = 0; # UPDATE accounts SET aid=aid WHERE random() < 0.005; # CHECKPOINT; # SET scan_recycle_buffers = <N>; # VACUUM FREEZE accounts; BTW, does the patch change the default usage of buffer in vacuum? From what I've seen, scan_recycle_buffers = 1 is the same as before. With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: > "Simon Riggs" <simon@2ndquadrant.com> wrote: > > > I've implemented buffer recycling, as previously described, patch being > > posted now to -patches as "scan_recycle_buffers". > > > > - for VACUUMs of any size, with the objective of reducing WAL thrashing > > whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as > > originally suggested by Itagaki-san, just with a different > > implementation). > > I tested your patch with VACUUM FREEZE. The performance was improved when > I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic, > but this patch should be useful for normal VACUUMs with backgrond jobs! Thanks. > N | time | WAL flush(*) > -----+-------+----------- > 0 | 58.7s | 0.01% > 1 | 80.3s | 81.76% > 8 | 73.4s | 16.73% > 16 | 64.2s | 9.24% > 32 | 59.0s | 4.88% > 64 | 56.7s | 2.63% > 128 | 55.1s | 1.41% > > (*) WAL flush is the ratio of the need of fsync to buffer recycle. Do you have the same measurement without patch applied? I'd be interested in the current state also (the N=0 path is modified as well for VACUUM, in this patch). > # SET scan_recycle_buffers = 0; > # UPDATE accounts SET aid=aid WHERE random() < 0.005; > # CHECKPOINT; > # SET scan_recycle_buffers = <N>; > # VACUUM FREEZE accounts; Very good results, thanks. I'd be interested in the same thing for just VACUUM and for varying ratios of dirty/clean blocks during vacuum. > BTW, does the patch change the default usage of buffer in vacuum? From what > I've seen, scan_recycle_buffers = 1 is the same as before. That was the intention. > With the default > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, > just like existing sequential scans. Is this intended? Yes, but its not very useful for testing to have done that. I'll do another version within the hour that sets N=0 (only) back to current behaviour for VACUUM. One of the objectives of the patch was to prevent VACUUM from tripping up other backends. I'm confident that it will improve that situation for OLTP workloads running regular concurrent VACUUMs, but we will need to wait a couple of days to get those results in also. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote: > On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: > > With the default > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, > > just like existing sequential scans. Is this intended? > > Yes, but its not very useful for testing to have done that. I'll do > another version within the hour that sets N=0 (only) back to current > behaviour for VACUUM. New test version enclosed, where scan_recycle_buffers = 0 doesn't change existing VACUUM behaviour. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > I tested your patch with VACUUM FREEZE. The performance was improved when > I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic, > but this patch should be useful for normal VACUUMs with backgrond jobs! Proving that you can see a different in a worst-case scenario is not the same as proving that the patch is useful in normal cases. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Mon, 2007-03-12 at 10:30 -0400, Tom Lane wrote: > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > I tested your patch with VACUUM FREEZE. The performance was improved when > > I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic, > > but this patch should be useful for normal VACUUMs with backgrond jobs! > > Proving that you can see a different in a worst-case scenario is not the > same as proving that the patch is useful in normal cases. I agree, but I think that this VACUUM FREEZE test does actually represent the normal case, here's why: The poor buffer manager behaviour occurs if the block is dirty as a result of WAL-logged changes. It only takes the removal of a single row for us to have WAL logged this and dirtied the block. If we invoke VACUUM from autovacuum, we do this by default when 20% of the rows have been updated, which means with many distributions of UPDATEs we'll have touched a very large proportion of blocks before we VACUUM. That isn't true for *all* distributions of UPDATEs, but it will soon be. Dead Space Map will deliver only dirty blocks for us. So running a VACUUM FREEZE is a reasonable simulation of running a large VACUUM on a real production system with default autovacuum enabled, as will be the case for 8.3. It is possible that we run VACUUM when fewer dirty blocks are generated, but this won't be the common situation and not something we should optimise for. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| "Simon Riggs" <simon@2ndquadrant.com> wrote: > > > With the default > > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, > > > just like existing sequential scans. Is this intended? > > > New test version enclosed, where scan_recycle_buffers = 0 doesn't change > existing VACUUM behaviour. This is a result with scan_recycle_buffers.v3.patch. I used normal VACUUM with background load using slowdown-ed pgbench in this instance. I believe the patch is useful in normal cases, not only for VACUUM FREEZE. N | time | WAL flush(*) -----+--------+----------- 0 | 112.8s | 44.3% 1 | 148.9s | 52.1% 8 | 105.1s | 17.6% 16 | 96.9s | 8.7% 32 | 103.9s | 6.3% 64 | 89.4s | 6.6% 128 | 80.0s | 3.8% Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Simon, You may know we've built something similar and have seen similar gains. We're planning a modification that I think you should consider: when there is a sequential scan of a table larger than the size of shared_buffers, we are allowing the scan to write through the shared_buffers cache. The hypothesis is that if a relation is of a size equal to or less than the size of shared_buffers, it is "cacheable" and should use the standard LRU approach to provide for reuse. - Luke On 3/12/07 3:08 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote: > On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote: >> On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: > >>> With the default >>> value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in >>> pool, >>> just like existing sequential scans. Is this intended? >> >> Yes, but its not very useful for testing to have done that. I'll do >> another version within the hour that sets N=0 (only) back to current >> behaviour for VACUUM. > > New test version enclosed, where scan_recycle_buffers = 0 doesn't change > existing VACUUM behaviour. ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| ||||
| On Mon, 2007-03-12 at 22:16 -0700, Luke Lonergan wrote: > You may know we've built something similar and have seen similar gains. Cool > We're planning a modification that I think you should consider: when there > is a sequential scan of a table larger than the size of shared_buffers, we > are allowing the scan to write through the shared_buffers cache. Write? For which operations? I was thinking to do this for bulk writes also, but it would require changes to bgwriter's cleaning sequence. Are you saying to write say ~32 buffers then fsync them, rather than letting bgwriter do that? Then allow those buffers to be reused? > The hypothesis is that if a relation is of a size equal to or less than the > size of shared_buffers, it is "cacheable" and should use the standard LRU > approach to provide for reuse. Sounds reasonable. Please say more. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |