This is a discussion on Re: Read/Write block sizes (Was: Caching by Postgres) within the Pgsql Performance forums, part of the PostgreSQL category; --> > Does that include increasing the size of read/write blocks? I've > noticedthat with a large enough table it ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| > Does that include increasing the size of read/write blocks? I've > noticedthat with a large enough table it takes a while to do a > sequential scan, > even if it's cached; I wonder if the fact that it takes a million > read(2) calls to get through an 8G table is part of that. > Actually some of that readaheads,etc the OS does already if it does some sort of throttling/clubbing of reads/writes. But its not enough for such types of workloads. Here is what I think will help: * Support for different Blocksize TABLESPACE without recompiling the code.. (Atlease support for a different Blocksize for the whole database without recompiling the code) * Support for bigger sizes of WAL files instead of 16MB files WITHOUT recompiling the code.. Should be a tuneable if you ask me (with checkpoint_segments at 256.. you have too many 16MB files in the log directory) (This will help OLTP benchmarks more since now they don't spend time rotating log files) * Introduce a multiblock or extent tunable variable where you can define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk and store it in the bufferpool.. (Maybe writes too) (Most devices now support upto 1MB chunks for reads and writes) *There should be a way to preallocate files for TABLES in TABLESPACES otherwise with multiple table writes in the same filesystem ends with fragmented files which causes poor "READS" from the files. * With 64bit 1GB file chunks is also moot.. Maybe it should be tuneable too like 100GB without recompiling the code. Why recompiling is bad? Most companies that will support Postgres will support their own binaries and they won't prefer different versions of binaries for different blocksizes, different WAL file sizes, etc... and hence more function using the same set of binaries is more desirable in enterprise environments Regards, Jignesh ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| J.K.Shah@Sun.COM (Jignesh Shah) writes: >> Does that include increasing the size of read/write blocks? I've >> noticedthat with a large enough table it takes a while to do a >> sequential scan, even if it's cached; I wonder if the fact that it >> takes a million read(2) calls to get through an 8G table is part of >> that. > > Actually some of that readaheads,etc the OS does already if it does > some sort of throttling/clubbing of reads/writes. But its not enough > for such types of workloads. > > Here is what I think will help: > > * Support for different Blocksize TABLESPACE without recompiling the > code.. (Atlease support for a different Blocksize for the whole > database without recompiling the code) > > * Support for bigger sizes of WAL files instead of 16MB files > WITHOUT recompiling the code.. Should be a tuneable if you ask me > (with checkpoint_segments at 256.. you have too many 16MB files in > the log directory) (This will help OLTP benchmarks more since now > they don't spend time rotating log files) > > * Introduce a multiblock or extent tunable variable where you can > define a multiple of 8K (or BlockSize tuneable) to read a bigger > chunk and store it in the bufferpool.. (Maybe writes too) (Most > devices now support upto 1MB chunks for reads and writes) > > *There should be a way to preallocate files for TABLES in > TABLESPACES otherwise with multiple table writes in the same > filesystem ends with fragmented files which causes poor "READS" from > the files. > > * With 64bit 1GB file chunks is also moot.. Maybe it should be > tuneable too like 100GB without recompiling the code. > > Why recompiling is bad? Most companies that will support Postgres > will support their own binaries and they won't prefer different > versions of binaries for different blocksizes, different WAL file > sizes, etc... and hence more function using the same set of binaries > is more desirable in enterprise environments Every single one of these still begs the question of whether the changes will have a *material* impact on performance. What we have been finding, as RAID controllers get smarter, is that it is getting increasingly futile to try to attach knobs to 'disk stuff;' it is *way* more effective to add a few more spindles to an array than it is to fiddle with which disks are to be allocated to what database 'objects.' The above suggested 'knobs' are all going to add to complexity and it is NOT evident that any of them will forcibly help. I could be wrong; code contributions combined with Actual Benchmarking would be the actual proof of the merits of the ideas. But it also suggests another question, namely... Will these represent more worthwhile improvements to speed than working on other optimizations that are on the TODO list? If someone spends 100h working on one of these items, and gets a 2% performance improvement, that's almost certain to be less desirable than spending 50h on something else that gets a 4% improvement. And we might discover that memory management improvements in Linux 2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such improvements "for free" behind our backs without *any* need to write database code. :-) -- let name="cbbrowne" and tld="ntlug.org" in name ^ "@" ^ tld;; http://www3.sympatico.ca/cbbrowne/postgresql.html Wiener's Law of Libraries: There are no answers, only cross references. |
| |||
| On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote: >Actually some of that readaheads,etc the OS does already if it does >some sort of throttling/clubbing of reads/writes. Note that I specified the fully cached case--even with the workload in RAM the system still has to process a heck of a lot of read calls. >* Introduce a multiblock or extent tunable variable where you can >define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk >and store it in the bufferpool.. (Maybe writes too) (Most devices now >support upto 1MB chunks for reads and writes) Yeah. The problem with relying on OS readahead is that the OS doesn't know whether you're doing a sequential scan or an index scan; if you have the OS agressively readahead you'll kill your seek performance. OTOH, if you don't do readaheads you'll kill your sequential scan performance. At the app level you know which makes sense for each operation. Mike Stone ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote: >What we have been finding, as RAID controllers get smarter, is that it >is getting increasingly futile to try to attach knobs to 'disk stuff;' >it is *way* more effective to add a few more spindles to an array than >it is to fiddle with which disks are to be allocated to what database >'objects.' That statement doesn't say anything about trying to maximize performance to or from a disk array. Yes, controllers are getting smarter--but they aren't omnicient. IME an I/O bound sequential table scan doesn't get data moving off the disk nearly as fast as say, a dd with a big ibs. Why? There's obviously a lot of factors at work, but one of those factors is that the raid controller can optimize "grab this meg" a lot more than it can optimize "grab this 8k". Mike Stone ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote: > J.K.Shah@Sun.COM (Jignesh Shah) writes: > >> Does that include increasing the size of read/write blocks? I've > >> noticedthat with a large enough table it takes a while to do a > >> sequential scan, even if it's cached; I wonder if the fact that it > >> takes a million read(2) calls to get through an 8G table is part of > >> that. > > > > Actually some of that readaheads,etc the OS does already if it does > > some sort of throttling/clubbing of reads/writes. But its not enough > > for such types of workloads. > > > > Here is what I think will help: > > > > * Support for different Blocksize TABLESPACE without recompiling the > > code.. (Atlease support for a different Blocksize for the whole > > database without recompiling the code) > > > > * Support for bigger sizes of WAL files instead of 16MB files > > WITHOUT recompiling the code.. Should be a tuneable if you ask me > > (with checkpoint_segments at 256.. you have too many 16MB files in > > the log directory) (This will help OLTP benchmarks more since now > > they don't spend time rotating log files) > > > > * Introduce a multiblock or extent tunable variable where you can > > define a multiple of 8K (or BlockSize tuneable) to read a bigger > > chunk and store it in the bufferpool.. (Maybe writes too) (Most > > devices now support upto 1MB chunks for reads and writes) > > > > *There should be a way to preallocate files for TABLES in > > TABLESPACES otherwise with multiple table writes in the same > > filesystem ends with fragmented files which causes poor "READS" from > > the files. > > > > * With 64bit 1GB file chunks is also moot.. Maybe it should be > > tuneable too like 100GB without recompiling the code. > > > > Why recompiling is bad? Most companies that will support Postgres > > will support their own binaries and they won't prefer different > > versions of binaries for different blocksizes, different WAL file > > sizes, etc... and hence more function using the same set of binaries > > is more desirable in enterprise environments > > Every single one of these still begs the question of whether the > changes will have a *material* impact on performance. How many of these things are currently easy to change with a recompile? I should be able to start testing some of these ideas in the near future, if they only require minor code or configure changes. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com 512-569-9461 ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Chris, Unless I am wrong, you're making the assumpting the amount of time spent and ROI is known. Maybe those who've been down this path know how to get that additional 2-4% in 30 minutes or less? While each person and business' performance gains (or not) could vary, someone spending the 50-100h to gain 2-4% over a course of a month for a 24x7 operation would seem worth the investment? I would assume that dbt2 with STP helps minimize the amount of hours someone has to invest to determine performance gains with configurable options? Steve Poe > If someone spends 100h working on one of these items, and gets a 2% > performance improvement, that's almost certain to be less desirable > than spending 50h on something else that gets a 4% improvement. > > And we might discover that memory management improvements in Linux > 2.6.16 or FreeBSD 5.5 allow some OS kernels to provide some such > improvements "for free" behind our backs without *any* need to write > database code. :-) ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| On Tue, 2005-08-23 at 19:12 -0400, Michael Stone wrote: > On Tue, Aug 23, 2005 at 05:29:01PM -0400, Jignesh Shah wrote: > >Actually some of that readaheads,etc the OS does already if it does > >some sort of throttling/clubbing of reads/writes. > > Note that I specified the fully cached case--even with the workload in > RAM the system still has to process a heck of a lot of read calls. > > >* Introduce a multiblock or extent tunable variable where you can > >define a multiple of 8K (or BlockSize tuneable) to read a bigger chunk > >and store it in the bufferpool.. (Maybe writes too) (Most devices now > >support upto 1MB chunks for reads and writes) > > Yeah. The problem with relying on OS readahead is that the OS doesn't > know whether you're doing a sequential scan or an index scan; if you > have the OS agressively readahead you'll kill your seek performance. > OTOH, if you don't do readaheads you'll kill your sequential scan > performance. At the app level you know which makes sense for each > operation. This is why we have MADVISE_RANDOM and MADVISE_SEQUENTIAL. -jwb ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| Hi Jim, | How many of these things are currently easy to change with a recompile? | I should be able to start testing some of these ideas in the near | future, if they only require minor code or configure changes. The following * Data File Size 1GB * WAL File Size of 16MB * Block Size of 8K Are very easy to change with a recompile.. A Tunable will be greatly prefered as it will allow one binary for different tunings * MultiBlock read/write Is not available but will greatly help in reducing the number of system calls which will only increase as the size of the database increases if something is not done about i. * Pregrown files... maybe not important at this point since TABLESPACE can currently work around it a bit (Just need to create a different file system for each tablespace But if you really think hardware & OS is the answer for all small things...... I think we should now start to look on how to make Postgres Multi-threaded or multi-processed for each connection. With the influx of "Dual-Core" or "Multi-Core" being the fad.... Postgres can have the cutting edge if somehow exploiting cores is designed. Somebody mentioned that adding CPU to Postgres workload halved the average CPU usage... YEAH... PostgreSQL uses only 1 CPU per connection (assuming 100% usage) so if you add another CPU it is idle anyway and the system will report only 50% :-) BUT the importing to measure is.. whether the query time was cut down or not? ( No flames I am sure you were talking about multi-connection multi-user environment :-) ) But my point is then this approach is worth the ROI and the time and effort spent to solve this problem. I actually vote for a multi-threaded solution for each connection while still maintaining seperate process for each connections... This way the fundamental architecture of Postgres doesn't change, however a multi-threaded connection can then start to exploit different cores.. (Maybe have tunables for number of threads to read data files who knows.. If somebody is interested in actually working a design .. contact me and I will be interested in assisting this work. Regards, Jignesh Jim C. Nasby wrote: >On Tue, Aug 23, 2005 at 06:09:09PM -0400, Chris Browne wrote: > > >>J.K.Shah@Sun.COM (Jignesh Shah) writes: >> >> >>>>Does that include increasing the size of read/write blocks? I've >>>>noticedthat with a large enough table it takes a while to do a >>>>sequential scan, even if it's cached; I wonder if the fact that it >>>>takes a million read(2) calls to get through an 8G table is part of >>>>that. >>>> >>>> >>>Actually some of that readaheads,etc the OS does already if it does >>>some sort of throttling/clubbing of reads/writes. But its not enough >>>for such types of workloads. >>> >>>Here is what I think will help: >>> >>>* Support for different Blocksize TABLESPACE without recompiling the >>>code.. (Atlease support for a different Blocksize for the whole >>>database without recompiling the code) >>> >>>* Support for bigger sizes of WAL files instead of 16MB files >>>WITHOUT recompiling the code.. Should be a tuneable if you ask me >>>(with checkpoint_segments at 256.. you have too many 16MB files in >>>the log directory) (This will help OLTP benchmarks more since now >>>they don't spend time rotating log files) >>> >>>* Introduce a multiblock or extent tunable variable where you can >>>define a multiple of 8K (or BlockSize tuneable) to read a bigger >>>chunk and store it in the bufferpool.. (Maybe writes too) (Most >>>devices now support upto 1MB chunks for reads and writes) >>> >>>*There should be a way to preallocate files for TABLES in >>>TABLESPACES otherwise with multiple table writes in the same >>>filesystem ends with fragmented files which causes poor "READS" from >>>the files. >>> >>>* With 64bit 1GB file chunks is also moot.. Maybe it should be >>>tuneable too like 100GB without recompiling the code. >>> >>>Why recompiling is bad? Most companies that will support Postgres >>>will support their own binaries and they won't prefer different >>>versions of binaries for different blocksizes, different WAL file >>>sizes, etc... and hence more function using the same set of binaries >>>is more desirable in enterprise environments >>> >>> >>Every single one of these still begs the question of whether the >>changes will have a *material* impact on performance. >> >> ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Steve, > I would assume that dbt2 with STP helps minimize the amount of hours > someone has to invest to determine performance gains with configurable > options? Actually, these I/O operation issues show up mainly with DW workloads, so the STP isn't much use there. If I can ever get some of these machines back from the build people, I'd like to start testing some stuff. One issue with testing this is that currently PostgreSQL doesn't support block sizes above 128K. We've already done testing on that (well, Mark has) and the performance gains aren't even worth the hassle of remembering you're on a different block size (like, +4%). What the Sun people have done with other DB systems is show that substantial performance gains are possible on large databases (>100G) using block sizes of 1MB. I believe that's possible (and that it probably makes more of a difference on Solaris than on BSD) but we can't test it without some hackery first. -- Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| ||||
| Josh Berkus wrote: >Steve, > > > >>I would assume that dbt2 with STP helps minimize the amount of hours >>someone has to invest to determine performance gains with configurable >>options? >> >> > >Actually, these I/O operation issues show up mainly with DW workloads, so the >STP isn't much use there. If I can ever get some of these machines back >from the build people, I'd like to start testing some stuff. > >One issue with testing this is that currently PostgreSQL doesn't support block >sizes above 128K. We've already done testing on that (well, Mark has) and >the performance gains aren't even worth the hassle of remembering you're on a >different block size (like, +4%). > > What size database was this on? >What the Sun people have done with other DB systems is show that substantial >performance gains are possible on large databases (>100G) using block sizes >of 1MB. I believe that's possible (and that it probably makes more of a >difference on Solaris than on BSD) but we can't test it without some hackery >first. > We're running on a 100+GB database, with long streams of 8KB reads with the occasional _llseek(). I've been thinking about running with a larger blocksize with the expectation that we'd see fewer system calls and a bit more throughput. read() calls are a very expensive way to get 8KB of memory (that we know is already resident) during scans. One has to trap into the kernel, do the usual process state accounting, find the block, copy the memory to userspace, return back from the kernel to user space reversing all the process accounting, pick out the bytes one needs, and repeat all over again. That's quite a few sacrificial cache lines for 8KB. Yeah, sure, Linux syscalls are fast, but they aren't that fast, and other operating systems (windows and solaris) have a bit more overhead on syscalls. Regarding large blocks sizes on Solaris: the Solaris folks can also use large memory pages and avoid a lot of the TLB overhead from the VM system. The various trapstat and cpustat commands can be quite interesting to look at when running any large application on a Solaris system. It should be noted that having a large shared memory segment can be a performance looser just from the standpoint of TLB thrashing. O(GB) memory access patterns can take a huge performance hit in user space with 4K pages compared to the kernel which would be mapping the "segmap" (in Solaris parlance) with 4MB pages. Anyway, I guess my point is that the balance between kernel managed vs. postgresql managed buffer isn't obvious at all. -- Alan ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |