Unix Technical Forum

Re: Bug: Buffer cache is not scan resistant

This is a discussion on Re: Bug: Buffer cache is not scan resistant within the pgsql Hackers forums, part of the PostgreSQL category; --> Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and did not see the substantial ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-12-2008, 07:37 AM
Luke Lonergan
 
Posts: n/a
Default Re: Bug: Buffer cache is not scan resistant

Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and did not see the substantial gain as with reducing the working set size.

My conclusion: Fixing the OS is not sufficient to alleviate the issue. We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this effect.

- Luke

Msg is shrt cuz m on ma treo

-----Original Message-----
From: Sherry Moore [mailto:sherry.moore@sun.com]
Sent: Tuesday, March 06, 2007 10:05 PM Eastern Standard Time
To: Simon Riggs
Cc: Sherry Moore; Tom Lane; Luke Lonergan; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady
Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant

Hi Simon,

> and what you haven't said
>
> - all of this is orthogonal to the issue of buffer cache spoiling in
> PostgreSQL itself. That issue does still exist as a non-OS issue, but
> we've been discussing in detail the specific case of L2 cache effects
> with specific kernel calls. All of the test results have been
> stand-alone, so we've not done any measurements in that area. I say this
> because you make the point that reducing the working set size of write
> workloads has no effect on the L2 cache issue, but ISTM its still
> potentially a cache spoiling issue.


What I wanted to point out was that (reiterating to avoid requoting),

- My test was simply to demonstrate that the observed performance
difference with VACUUM was caused by whether the size of the
user buffer caused L2 thrashing.

- In general, application should reduce the size of the working set
to reduce the penalty of TLB misses and cache misses.

- If the application access pattern meets the NTA trigger condition,
the benefit of reducing the working set size will be much smaller.

Whatever I said is probably orthogonal to the buffer cache issue you
guys have been discussing, but I haven't read all the email exchange
on the subject.

Thanks,
Sherry
--
Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-12-2008, 07:41 AM
Simon Riggs
 
Posts: n/a
Default Re: Bug: Buffer cache is not scan resistant

On Tue, 2007-03-06 at 22:32 -0500, Luke Lonergan wrote:
> Incidentally, we tried triggering NTA (L2 cache bypass)
> unconditionally and in various patterns and did not see the
> substantial gain as with reducing the working set size.
>
> My conclusion: Fixing the OS is not sufficient to alleviate the issue.
> We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data
> rates due to this effect.
>

I've implemented buffer recycling, as previously described, patch being
posted now to -patches as "scan_recycle_buffers".

This version includes buffer recycling

- for SeqScans larger than shared buffers, with the objective of
improving L2 cache efficiency *and* reducing the effects of shared
buffer cache spoiling (both as previously discussed on this thread)

- for VACUUMs of any size, with the objective of reducing WAL thrashing
whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as
originally suggested by Itagaki-san, just with a different
implementation).

Behaviour is not activated by default in this patch. To request buffer
recycling, set the USERSET GUC
SET scan_recycle_buffers = N
tested with 1,4,8,16, but only > 8 seems sensible, IMHO.

Patch effects StrategyGetBuffer, so only effects the disk->cache path.
The idea is that if its already in shared buffer cache then we get
substantial benefit already and nothing else is needed. So for the
general case, the patch adds a single if test into the I/O path.

The parameter is picked up at the start of SeqScan and VACUUM
(currently). Any change mid-scan will be ignored.

IMHO its possible to do this and to allow Synch Scans at the same time,
with some thought. There is no need for us to rely on cache spoiling
behaviour of scans to implement that feature as well.

Independent performance tests requested, so that we can discuss this
objectively.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-12-2008, 07:43 AM
ITAGAKI Takahiro
 
Posts: n/a
Default Re: Bug: Buffer cache is not scan resistant


"Simon Riggs" <simon@2ndquadrant.com> wrote:

> I've implemented buffer recycling, as previously described, patch being
> posted now to -patches as "scan_recycle_buffers".
>
> - for VACUUMs of any size, with the objective of reducing WAL thrashing
> whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as
> originally suggested by Itagaki-san, just with a different
> implementation).


I tested your patch with VACUUM FREEZE. The performance was improved when
I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic,
but this patch should be useful for normal VACUUMs with backgrond jobs!

N | time | WAL flush(*)
-----+-------+-----------
0 | 58.7s | 0.01%
1 | 80.3s | 81.76%
8 | 73.4s | 16.73%
16 | 64.2s | 9.24%
32 | 59.0s | 4.88%
64 | 56.7s | 2.63%
128 | 55.1s | 1.41%

(*) WAL flush is the ratio of the need of fsync to buffer recycle.

# SET scan_recycle_buffers = 0;
# UPDATE accounts SET aid=aid WHERE random() < 0.005;
# CHECKPOINT;
# SET scan_recycle_buffers = <N>;
# VACUUM FREEZE accounts;


BTW, does the patch change the default usage of buffer in vacuum? From what
I've seen, scan_recycle_buffers = 1 is the same as before. With the default
value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
just like existing sequential scans. Is this intended?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-12-2008, 07:43 AM
Simon Riggs
 
Posts: n/a
Default Re: Bug: Buffer cache is not scan resistant

On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:
> "Simon Riggs" <simon@2ndquadrant.com> wrote:
>
> > I've implemented buffer recycling, as previously described, patch being
> > posted now to -patches as "scan_recycle_buffers".
> >
> > - for VACUUMs of any size, with the objective of reducing WAL thrashing
> > whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as
> > originally suggested by Itagaki-san, just with a different
> > implementation).

>
> I tested your patch with VACUUM FREEZE. The performance was improved when
> I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic,
> but this patch should be useful for normal VACUUMs with backgrond jobs!


Thanks.

> N | time | WAL flush(*)
> -----+-------+-----------
> 0 | 58.7s | 0.01%
> 1 | 80.3s | 81.76%
> 8 | 73.4s | 16.73%
> 16 | 64.2s | 9.24%
> 32 | 59.0s | 4.88%
> 64 | 56.7s | 2.63%
> 128 | 55.1s | 1.41%
>
> (*) WAL flush is the ratio of the need of fsync to buffer recycle.


Do you have the same measurement without patch applied? I'd be
interested in the current state also (the N=0 path is modified as well
for VACUUM, in this patch).

> # SET scan_recycle_buffers = 0;
> # UPDATE accounts SET aid=aid WHERE random() < 0.005;
> # CHECKPOINT;
> # SET scan_recycle_buffers = <N>;
> # VACUUM FREEZE accounts;


Very good results, thanks. I'd be interested in the same thing for just
VACUUM and for varying ratios of dirty/clean blocks during vacuum.

> BTW, does the patch change the default usage of buffer in vacuum? From what
> I've seen, scan_recycle_buffers = 1 is the same as before.


That was the intention.

> With the default
> value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
> just like existing sequential scans. Is this intended?


Yes, but its not very useful for testing to have done that. I'll do
another version within the hour that sets N=0 (only) back to current
behaviour for VACUUM.

One of the objectives of the patch was to prevent VACUUM from tripping
up other backends. I'm confident that it will improve that situation for
OLTP workloads running regular concurrent VACUUMs, but we will need to
wait a couple of days to get those results in also.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-12-2008, 07:43 AM
Simon Riggs
 
Posts: n/a
Default Re: Bug: Buffer cache is not scan resistant

On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote:
> On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:


> > With the default
> > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
> > just like existing sequential scans. Is this intended?

>
> Yes, but its not very useful for testing to have done that. I'll do
> another version within the hour that sets N=0 (only) back to current
> behaviour for VACUUM.


New test version enclosed, where scan_recycle_buffers = 0 doesn't change
existing VACUUM behaviour.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-12-2008, 07:43 AM
Tom Lane
 
Posts: n/a
Default Re: Bug: Buffer cache is not scan resistant

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> I tested your patch with VACUUM FREEZE. The performance was improved when
> I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic,
> but this patch should be useful for normal VACUUMs with backgrond jobs!


Proving that you can see a different in a worst-case scenario is not the
same as proving that the patch is useful in normal cases.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-12-2008, 07:43 AM
Simon Riggs
 
Posts: n/a
Default Re: Bug: Buffer cache is not scan resistant

On Mon, 2007-03-12 at 10:30 -0400, Tom Lane wrote:
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> > I tested your patch with VACUUM FREEZE. The performance was improved when
> > I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic,
> > but this patch should be useful for normal VACUUMs with backgrond jobs!

>
> Proving that you can see a different in a worst-case scenario is not the
> same as proving that the patch is useful in normal cases.


I agree, but I think that this VACUUM FREEZE test does actually
represent the normal case, here's why:

The poor buffer manager behaviour occurs if the block is dirty as a
result of WAL-logged changes. It only takes the removal of a single row
for us to have WAL logged this and dirtied the block.

If we invoke VACUUM from autovacuum, we do this by default when 20% of
the rows have been updated, which means with many distributions of
UPDATEs we'll have touched a very large proportion of blocks before we
VACUUM. That isn't true for *all* distributions of UPDATEs, but it will
soon be. Dead Space Map will deliver only dirty blocks for us.

So running a VACUUM FREEZE is a reasonable simulation of running a large
VACUUM on a real production system with default autovacuum enabled, as
will be the case for 8.3.

It is possible that we run VACUUM when fewer dirty blocks are generated,
but this won't be the common situation and not something we should
optimise for.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com



---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-12-2008, 07:44 AM
ITAGAKI Takahiro
 
Posts: n/a
Default Re: Bug: Buffer cache is not scan resistant


"Simon Riggs" <simon@2ndquadrant.com> wrote:

> > > With the default
> > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
> > > just like existing sequential scans. Is this intended?

> >

> New test version enclosed, where scan_recycle_buffers = 0 doesn't change
> existing VACUUM behaviour.


This is a result with scan_recycle_buffers.v3.patch. I used normal VACUUM
with background load using slowdown-ed pgbench in this instance. I believe
the patch is useful in normal cases, not only for VACUUM FREEZE.

N | time | WAL flush(*)
-----+--------+-----------
0 | 112.8s | 44.3%
1 | 148.9s | 52.1%
8 | 105.1s | 17.6%
16 | 96.9s | 8.7%
32 | 103.9s | 6.3%
64 | 89.4s | 6.6%
128 | 80.0s | 3.8%

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-12-2008, 07:44 AM
Luke Lonergan
 
Posts: n/a
Default Re: Bug: Buffer cache is not scan resistant

Simon,

You may know we've built something similar and have seen similar gains.
We're planning a modification that I think you should consider: when there
is a sequential scan of a table larger than the size of shared_buffers, we
are allowing the scan to write through the shared_buffers cache.

The hypothesis is that if a relation is of a size equal to or less than the
size of shared_buffers, it is "cacheable" and should use the standard LRU
approach to provide for reuse.

- Luke

On 3/12/07 3:08 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote:

> On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote:
>> On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:

>
>>> With the default
>>> value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in
>>> pool,
>>> just like existing sequential scans. Is this intended?

>>
>> Yes, but its not very useful for testing to have done that. I'll do
>> another version within the hour that sets N=0 (only) back to current
>> behaviour for VACUUM.

>
> New test version enclosed, where scan_recycle_buffers = 0 doesn't change
> existing VACUUM behaviour.




---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-12-2008, 07:44 AM
Simon Riggs
 
Posts: n/a
Default Re: Bug: Buffer cache is not scan resistant

On Mon, 2007-03-12 at 22:16 -0700, Luke Lonergan wrote:

> You may know we've built something similar and have seen similar gains.


Cool

> We're planning a modification that I think you should consider: when there
> is a sequential scan of a table larger than the size of shared_buffers, we
> are allowing the scan to write through the shared_buffers cache.


Write? For which operations?

I was thinking to do this for bulk writes also, but it would require
changes to bgwriter's cleaning sequence. Are you saying to write say ~32
buffers then fsync them, rather than letting bgwriter do that? Then
allow those buffers to be reused?

> The hypothesis is that if a relation is of a size equal to or less than the
> size of shared_buffers, it is "cacheable" and should use the standard LRU
> approach to provide for reuse.


Sounds reasonable. Please say more.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 08:14 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com