Unix Technical Forum

Re: Read/Write block sizes (Was: Caching by Postgres)

This is a discussion on Re: Read/Write block sizes (Was: Caching by Postgres) within the Pgsql Performance forums, part of the PostgreSQL category; --> > Does that include increasing the size of read/write blocks? I've > noticedthat with a large enough table it ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-18-2008, 12:16 PM
Jignesh Shah
 
Posts: n/a
Default Re: Read/Write block sizes (Was: Caching by Postgres)

> Does that include increasing the size of read/write blocks? I've
> noticedthat with a large enough table it takes a while to do a
> sequential scan,
> even if it's cached; I wonder if the fact that it takes a million
> read(2) calls to get through an 8G table is part of that.
>



Actually some of that readaheads,etc the OS does already if it does some sort of throttling/clubbing of reads/writes. But its not enough for such types of workloads.

Here is what I think will help:

* Support for different Blocksize TABLESPACE without recompiling the code.. (Atlease support for a different Blocksize for the whole database without recompiling the code)

* Support for bigger sizes of WAL files instead of 16MB files WITHOUT recompiling the code.. Should be a tuneable if you ask me (with checkpoint_segments at 256.. you have too many 16MB files in the log directory) (This will help OLTP benchmarks more since now they don't spend time rotating log files)

* Introduce a multiblock or extent tunable variable where you can define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk and store it in the bufferpool.. (Maybe writes too) (Most devices now support upto 1MB chunks for reads and writes)

*There should be a way to preallocate files for TABLES in TABLESPACES otherwise with multiple table writes in the same filesystem ends with fragmented files which causes poor "READS" from the files.

* With 64bit 1GB file chunks is also moot.. Maybe it should be tuneable too like 100GB without recompiling the code.


Why recompiling is bad? Most companies that will support Postgres will support their own binaries and they won't prefer different versions of binaries for different blocksizes, different WAL file sizes, etc... and hence more function using the same set of binaries is more desirable in enterprise environments


Regards,
Jignesh



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-18-2008, 12:16 PM
Chris Browne
 
Posts: n/a
Default Re: Read/Write block sizes

J.K.Shah@Sun.COM (Jignesh Shah) writes:
>> Does that include increasing the size of read/write blocks? I've
>> noticedthat with a large enough table it takes a while to do a
>> sequential scan, even if it's cached; I wonder if the fact that it
>> takes a million read(2) calls to get through an 8G table is part of
>> that.

>
> Actually some of that readaheads,etc the OS does already if it does
> some sort of throttling/clubbing of reads/writes. But its not enough
> for such types of workloads.
>
> Here is what I think will help:
>
> * Support for different Blocksize TABLESPACE without recompiling the
> code.. (Atlease support for a different Blocksize for the whole
> database without recompiling the code)
>
> * Support for bigger sizes of WAL files instead of 16MB files
> WITHOUT recompiling the code.. Should be a tuneable if you ask me
> (with checkpoint_segments at 256.. you have too many 16MB files in
> the log directory) (This will help OLTP benchmarks more since now
> they don't spend time rotating log files)
>
> * Introduce a multiblock or extent tunable variable where you can
> define a multiple of 8K (or BlockSize tuneable) to read a bigger
> chunk and store it in the bufferpool.. (Maybe writes too) (Most
> devices now support upto 1MB chunks for reads and writes)
>
> *There should be a way to preallocate files for TABLES in
> TABLESPACES otherwise with multiple table writes in the same
> filesystem ends with fragmented files which causes poor "READS" from
> the files.
>
> * With 64bit 1GB file chunks is also moot.. Maybe it should be
> tuneable too like 100GB without recompiling the code.
>
> Why recompiling is bad? Most companies that will support Postgres
> will support their own binaries and they won't prefer different
> versions of binaries for different blocksizes, different WAL file
> sizes, etc... and hence more function using the same set of binaries
> is more desirable in enterprise environments


Every single one of these still begs the question of whether the
changes will have a *material* impact on performance.

What we have been finding, as RAID controllers get smarter, is that it
is getting increasingly futile to try to attach knobs to 'disk stuff;'
it is *way* more effective to add a few more spindles to an array than
it is to fiddle with which disks are to be allocated to what database
'objects.'

The above suggested 'knobs' are all going to add to complexity and it
is NOT evident that any of them will forcibly help.

I could be wrong; code contributions combined with Actual Benchmarking
would be the actual proof of the merits of the ideas.

But it also suggests another question, namely...

Will these represent more worthwhile improvements to speed than
working on other optimizations that are on the TODO list?

If someone spends 100h working on one of these items, and gets a 2%
performance improvement, that's almost certain to be less desirable
than spending 50h on something else that gets a 4% improvement.

And we might discover that memory management improvements in Linux
2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such
improvements "for free" behind our backs without *any* need to write
database code. :-)
--
let name="cbbrowne" and tld="ntlug.org" in name ^ "@" ^ tld;;
http://www3.sympatico.ca/cbbrowne/postgresql.html
Wiener's Law of Libraries:
There are no answers, only cross references.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-18-2008, 12:16 PM
Michael Stone
 
Posts: n/a
Default Re: Read/Write block sizes (Was: Caching by Postgres)

On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote:
>Actually some of that readaheads,etc the OS does already if it does
>some sort of throttling/clubbing of reads/writes.


Note that I specified the fully cached case--even with the workload in
RAM the system still has to process a heck of a lot of read calls.

>* Introduce a multiblock or extent tunable variable where you can
>define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk
>and store it in the bufferpool.. (Maybe writes too) (Most devices now
>support upto 1MB chunks for reads and writes)


Yeah. The problem with relying on OS readahead is that the OS doesn't
know whether you're doing a sequential scan or an index scan; if you
have the OS agressively readahead you'll kill your seek performance.
OTOH, if you don't do readaheads you'll kill your sequential scan
performance. At the app level you know which makes sense for each
operation.

Mike Stone

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-18-2008, 12:16 PM
Michael Stone
 
Posts: n/a
Default Re: Read/Write block sizes

On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:
>What we have been finding, as RAID controllers get smarter, is that it
>is getting increasingly futile to try to attach knobs to 'disk stuff;'
>it is *way* more effective to add a few more spindles to an array than
>it is to fiddle with which disks are to be allocated to what database
>'objects.'


That statement doesn't say anything about trying to maximize performance
to or from a disk array. Yes, controllers are getting smarter--but they
aren't omnicient. IME an I/O bound sequential table scan doesn't get
data moving off the disk nearly as fast as say, a dd with a big ibs.
Why? There's obviously a lot of factors at work, but one of those
factors is that the raid controller can optimize "grab this meg" a lot
more than it can optimize "grab this 8k".

Mike Stone

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-18-2008, 12:16 PM
Jim C. Nasby
 
Posts: n/a
Default Re: Read/Write block sizes

On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:
> J.K.Shah@Sun.COM (Jignesh Shah) writes:
> >> Does that include increasing the size of read/write blocks? I've
> >> noticedthat with a large enough table it takes a while to do a
> >> sequential scan, even if it's cached; I wonder if the fact that it
> >> takes a million read(2) calls to get through an 8G table is part of
> >> that.

> >
> > Actually some of that readaheads,etc the OS does already if it does
> > some sort of throttling/clubbing of reads/writes. But its not enough
> > for such types of workloads.
> >
> > Here is what I think will help:
> >
> > * Support for different Blocksize TABLESPACE without recompiling the
> > code.. (Atlease support for a different Blocksize for the whole
> > database without recompiling the code)
> >
> > * Support for bigger sizes of WAL files instead of 16MB files
> > WITHOUT recompiling the code.. Should be a tuneable if you ask me
> > (with checkpoint_segments at 256.. you have too many 16MB files in
> > the log directory) (This will help OLTP benchmarks more since now
> > they don't spend time rotating log files)
> >
> > * Introduce a multiblock or extent tunable variable where you can
> > define a multiple of 8K (or BlockSize tuneable) to read a bigger
> > chunk and store it in the bufferpool.. (Maybe writes too) (Most
> > devices now support upto 1MB chunks for reads and writes)
> >
> > *There should be a way to preallocate files for TABLES in
> > TABLESPACES otherwise with multiple table writes in the same
> > filesystem ends with fragmented files which causes poor "READS" from
> > the files.
> >
> > * With 64bit 1GB file chunks is also moot.. Maybe it should be
> > tuneable too like 100GB without recompiling the code.
> >
> > Why recompiling is bad? Most companies that will support Postgres
> > will support their own binaries and they won't prefer different
> > versions of binaries for different blocksizes, different WAL file
> > sizes, etc... and hence more function using the same set of binaries
> > is more desirable in enterprise environments

>
> Every single one of these still begs the question of whether the
> changes will have a *material* impact on performance.


How many of these things are currently easy to change with a recompile?
I should be able to start testing some of these ideas in the near
future, if they only require minor code or configure changes.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com 512-569-9461

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-18-2008, 12:16 PM
Steve Poe
 
Posts: n/a
Default Re: Read/Write block sizes

Chris,

Unless I am wrong, you're making the assumpting the amount of time spent
and ROI is known. Maybe those who've been down this path know how to get
that additional 2-4% in 30 minutes or less?

While each person and business' performance gains (or not) could vary,
someone spending the 50-100h to gain 2-4% over a course of a month for a
24x7 operation would seem worth the investment?

I would assume that dbt2 with STP helps minimize the amount of hours
someone has to invest to determine performance gains with configurable
options?

Steve Poe

> If someone spends 100h working on one of these items, and gets a 2%
> performance improvement, that's almost certain to be less desirable
> than spending 50h on something else that gets a 4% improvement.
>
> And we might discover that memory management improvements in Linux
> 2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such
> improvements "for free" behind our backs without *any* need to write
> database code. :-)



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-18-2008, 12:16 PM
Jeffrey W. Baker
 
Posts: n/a
Default Re: Read/Write block sizes (Was: Caching by Postgres)

On Tue, 2005-08-23 at 19:12 -0400, Michael Stone wrote:
> On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote:
> >Actually some of that readaheads,etc the OS does already if it does
> >some sort of throttling/clubbing of reads/writes.

>
> Note that I specified the fully cached case--even with the workload in
> RAM the system still has to process a heck of a lot of read calls.
>
> >* Introduce a multiblock or extent tunable variable where you can
> >define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk
> >and store it in the bufferpool.. (Maybe writes too) (Most devices now
> >support upto 1MB chunks for reads and writes)

>
> Yeah. The problem with relying on OS readahead is that the OS doesn't
> know whether you're doing a sequential scan or an index scan; if you
> have the OS agressively readahead you'll kill your seek performance.
> OTOH, if you don't do readaheads you'll kill your sequential scan
> performance. At the app level you know which makes sense for each
> operation.


This is why we have MADVISE_RANDOM and MADVISE_SEQUENTIAL.

-jwb

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-18-2008, 12:16 PM
Jignesh K. Shah
 
Posts: n/a
Default Re: Read/Write block sizes

Hi Jim,

| How many of these things are currently easy to change with a recompile?
| I should be able to start testing some of these ideas in the near
| future, if they only require minor code or configure changes.


The following
* Data File Size 1GB
* WAL File Size of 16MB
* Block Size of 8K

Are very easy to change with a recompile.. A Tunable will be greatly
prefered as it will allow one binary for different tunings

* MultiBlock read/write

Is not available but will greatly help in reducing the number of system
calls which will only increase as the size of the database increases if
something is not done about i.

* Pregrown files... maybe not important at this point since TABLESPACE
can currently work around it a bit (Just need to create a different file
system for each tablespace

But if you really think hardware & OS is the answer for all small
things...... I think we should now start to look on how to make Postgres
Multi-threaded or multi-processed for each connection. With the influx
of "Dual-Core" or "Multi-Core" being the fad.... Postgres can have the
cutting edge if somehow exploiting cores is designed.

Somebody mentioned that adding CPU to Postgres workload halved the
average CPU usage...
YEAH... PostgreSQL uses only 1 CPU per connection (assuming 100%
usage) so if you add another CPU it is idle anyway and the system will
report only 50% :-) BUT the importing to measure is.. whether the query
time was cut down or not? ( No flames I am sure you were talking about
multi-connection multi-user environment :-) ) But my point is then this
approach is worth the ROI and the time and effort spent to solve this
problem.

I actually vote for a multi-threaded solution for each connection while
still maintaining seperate process for each connections... This way the
fundamental architecture of Postgres doesn't change, however a
multi-threaded connection can then start to exploit different cores..
(Maybe have tunables for number of threads to read data files who
knows.. If somebody is interested in actually working a design ..
contact me and I will be interested in assisting this work.

Regards,
Jignesh


Jim C. Nasby wrote:

>On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote:
>
>
>>J.K.Shah@Sun.COM (Jignesh Shah) writes:
>>
>>
>>>>Does that include increasing the size of read/write blocks? I've
>>>>noticedthat with a large enough table it takes a while to do a
>>>>sequential scan, even if it's cached; I wonder if the fact that it
>>>>takes a million read(2) calls to get through an 8G table is part of
>>>>that.
>>>>
>>>>
>>>Actually some of that readaheads,etc the OS does already if it does
>>>some sort of throttling/clubbing of reads/writes. But its not enough
>>>for such types of workloads.
>>>
>>>Here is what I think will help:
>>>
>>>* Support for different Blocksize TABLESPACE without recompiling the
>>>code.. (Atlease support for a different Blocksize for the whole
>>>database without recompiling the code)
>>>
>>>* Support for bigger sizes of WAL files instead of 16MB files
>>>WITHOUT recompiling the code.. Should be a tuneable if you ask me
>>>(with checkpoint_segments at 256.. you have too many 16MB files in
>>>the log directory) (This will help OLTP benchmarks more since now
>>>they don't spend time rotating log files)
>>>
>>>* Introduce a multiblock or extent tunable variable where you can
>>>define a multiple of 8K (or BlockSize tuneable) to read a bigger
>>>chunk and store it in the bufferpool.. (Maybe writes too) (Most
>>>devices now support upto 1MB chunks for reads and writes)
>>>
>>>*There should be a way to preallocate files for TABLES in
>>>TABLESPACES otherwise with multiple table writes in the same
>>>filesystem ends with fragmented files which causes poor "READS" from
>>>the files.
>>>
>>>* With 64bit 1GB file chunks is also moot.. Maybe it should be
>>>tuneable too like 100GB without recompiling the code.
>>>
>>>Why recompiling is bad? Most companies that will support Postgres
>>>will support their own binaries and they won't prefer different
>>>versions of binaries for different blocksizes, different WAL file
>>>sizes, etc... and hence more function using the same set of binaries
>>>is more desirable in enterprise environments
>>>
>>>

>>Every single one of these still begs the question of whether the
>>changes will have a *material* impact on performance.
>>
>>



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-18-2008, 12:16 PM
Josh Berkus
 
Posts: n/a
Default Re: Read/Write block sizes

Steve,

> I would assume that dbt2 with STP helps minimize the amount of hours
> someone has to invest to determine performance gains with configurable
> options?


Actually, these I/O operation issues show up mainly with DW workloads, so the
STP isn't much use there. If I can ever get some of these machines back
from the build people, I'd like to start testing some stuff.

One issue with testing this is that currently PostgreSQL doesn't support block
sizes above 128K. We've already done testing on that (well, Mark has) and
the performance gains aren't even worth the hassle of remembering you're on a
different block size (like, +4%).

What the Sun people have done with other DB systems is show that substantial
performance gains are possible on large databases (>100G) using block sizes
of 1MB. I believe that's possible (and that it probably makes more of a
difference on Solaris than on BSD) but we can't test it without some hackery
first.

--
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-18-2008, 12:16 PM
Alan Stange
 
Posts: n/a
Default Re: Read/Write block sizes

Josh Berkus wrote:

>Steve,
>
>
>
>>I would assume that dbt2 with STP helps minimize the amount of hours
>>someone has to invest to determine performance gains with configurable
>>options?
>>
>>

>
>Actually, these I/O operation issues show up mainly with DW workloads, so the
>STP isn't much use there. If I can ever get some of these machines back
>from the build people, I'd like to start testing some stuff.
>
>One issue with testing this is that currently PostgreSQL doesn't support block
>sizes above 128K. We've already done testing on that (well, Mark has) and
>the performance gains aren't even worth the hassle of remembering you're on a
>different block size (like, +4%).
>
>

What size database was this on?

>What the Sun people have done with other DB systems is show that substantial
>performance gains are possible on large databases (>100G) using block sizes
>of 1MB. I believe that's possible (and that it probably makes more of a
>difference on Solaris than on BSD) but we can't test it without some hackery
>first.
>

We're running on a 100+GB database, with long streams of 8KB reads with
the occasional _llseek(). I've been thinking about running with a
larger blocksize with the expectation that we'd see fewer system calls
and a bit more throughput.

read() calls are a very expensive way to get 8KB of memory (that we know
is already resident) during scans. One has to trap into the kernel, do
the usual process state accounting, find the block, copy the memory to
userspace, return back from the kernel to user space reversing all the
process accounting, pick out the bytes one needs, and repeat all over
again. That's quite a few sacrificial cache lines for 8KB. Yeah,
sure, Linux syscalls are fast, but they aren't that fast, and other
operating systems (windows and solaris) have a bit more overhead on
syscalls.

Regarding large blocks sizes on Solaris: the Solaris folks can also use
large memory pages and avoid a lot of the TLB overhead from the VM
system. The various trapstat and cpustat commands can be quite
interesting to look at when running any large application on a Solaris
system.

It should be noted that having a large shared memory segment can be a
performance looser just from the standpoint of TLB thrashing. O(GB)
memory access patterns can take a huge performance hit in user space
with 4K pages compared to the kernel which would be mapping the "segmap"
(in Solaris parlance) with 4MB pages.

Anyway, I guess my point is that the balance between kernel managed vs.
postgresql managed buffer isn't obvious at all.

-- Alan

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 09:34 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com