This is a discussion on Re: How to improve db performance with $7K? within the Pgsql Performance forums, part of the PostgreSQL category; --> I've been doing some reading up on this, trying to keep up here, and have found out that (experts, ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I've been doing some reading up on this, trying to keep up here, and have found out that (experts, just yawn and cover your ears) 1) some SATA drives (just type II, I think?) have a "Phase Zero" implementation of Tagged Command Queueing (the special sauce for SCSI). 2) This SATA "TCQ" is called NCQ and I believe it basically allows the disk software itself to do the reordering (this is called "simple" in TCQ terminology) It does not yet allow the TCQ "head of queue" command, allowing the current tagged request to go to head of queue, which is a simple way of manifesting a "high priority" request. 3) SATA drives are not yet multi-initiator? Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives are likely to whomp SATA II drives for a while yet (read: a year or two) in multiuser PostGres applications. -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto Sent: Thursday, April 14, 2005 2:04 PM To: Kevin Brown Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] How to improve db performance with $7K? Kevin Brown <kevin@sysexperts.com> writes: > Greg Stark wrote: > > > > I think you're being misled by analyzing the write case. > > > > Consider the read case. When a user process requests a block and > > that read makes its way down to the driver level, the driver can't > > just put it aside and wait until it's convenient. It has to go ahead > > and issue the read right away. > > Well, strictly speaking it doesn't *have* to. It could delay for a > couple of milliseconds to see if other requests come in, and then > issue the read if none do. If there are already other requests being > fulfilled, then it'll schedule the request in question just like the > rest. But then the cure is worse than the disease. You're basically describing exactly what does happen anyways, only you're delaying more requests than necessary. That intervening time isn't really idle, it's filled with all the requests that were delayed during the previous large seek... > Once the first request has been fulfilled, the driver can now schedule > the rest of the queued-up requests in disk-layout order. > > I really don't see how this is any different between a system that has > tagged queueing to the disks and one that doesn't. The only > difference is where the queueing happens. And *when* it happens. Instead of being able to issue requests while a large seek is happening and having some of them satisfied they have to wait until that seek is finished and get acted on during the next large seek. If my theory is correct then I would expect bandwidth to be essentially equivalent but the latency on SATA drives to be increased by about 50% of the average seek time. Ie, while a busy SCSI drive can satisfy most requests in about 10ms a busy SATA drive would satisfy most requests in 15ms. (add to that that 10k RPM and 15kRPM SCSI drives have even lower seek times and no such IDE/SATA drives exist...) In reality higher latency feeds into a system feedback loop causing your application to run slower causing bandwidth demands to be lower as well. It's often hard to distinguish root causes from symptoms when optimizing complex systems. -- greg ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| |||
| If SATA drives don't have the ability to replace SCSI for a multi-user Postgres apps, but you needed to save on cost (ALWAYS an issue), could/would you implement SATA for your logs (pg_xlog) and keep the rest on SCSI? Steve Poe Mohan, Ross wrote: >I've been doing some reading up on this, trying to keep up here, >and have found out that (experts, just yawn and cover your ears) > >1) some SATA drives (just type II, I think?) have a "Phase Zero" > implementation of Tagged Command Queueing (the special sauce > for SCSI). >2) This SATA "TCQ" is called NCQ and I believe it basically > allows the disk software itself to do the reordering > (this is called "simple" in TCQ terminology) It does not > yet allow the TCQ "head of queue" command, allowing the > current tagged request to go to head of queue, which is > a simple way of manifesting a "high priority" request. > >3) SATA drives are not yet multi-initiator? > >Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives >are likely to whomp SATA II drives for a while yet (read: a >year or two) in multiuser PostGres applications. > > > >-----Original Message----- >From: pgsql-performance-owner@postgresql.org [mailto >Sent: Thursday, April 14, 2005 2:04 PM >To: Kevin Brown >Cc: pgsql-performance@postgresql.org >Subject: Re: [PERFORM] How to improve db performance with $7K? > > >Kevin Brown <kevin@sysexperts.com> writes: > > > >>Greg Stark wrote: >> >> >> >> >>>I think you're being misled by analyzing the write case. >>> >>>Consider the read case. When a user process requests a block and >>>that read makes its way down to the driver level, the driver can't >>>just put it aside and wait until it's convenient. It has to go ahead >>>and issue the read right away. >>> >>> >>Well, strictly speaking it doesn't *have* to. It could delay for a >>couple of milliseconds to see if other requests come in, and then >>issue the read if none do. If there are already other requests being >>fulfilled, then it'll schedule the request in question just like the >>rest. >> >> > >But then the cure is worse than the disease. You're basically describing exactly what does happen anyways, only you're delaying more requests than necessary. That intervening time isn't really idle, it's filled with all the requests that were delayed during the previous large seek... > > > >>Once the first request has been fulfilled, the driver can now schedule >>the rest of the queued-up requests in disk-layout order. >> >>I really don't see how this is any different between a system that has >>tagged queueing to the disks and one that doesn't. The only >>difference is where the queueing happens. >> >> > >And *when* it happens. Instead of being able to issue requests while a large seek is happening and having some of them satisfied they have to wait until that seek is finished and get acted on during the next large seek. > >If my theory is correct then I would expect bandwidth to be essentially equivalent but the latency on SATA drives to be increased by about 50% of the average seek time. Ie, while a busy SCSI drive can satisfy most requests in about 10ms a busy SATA drive would satisfy most requests in 15ms. (add to that that 10k RPM and 15kRPM SCSI drives have even lower seek times and no such IDE/SATA drives exist...) > >In reality higher latency feeds into a system feedback loop causing your application to run slower causing bandwidth demands to be lower as well. It's often hard to distinguish root causes from symptoms when optimizing complex systems. > > > ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Steve Poe wrote: > If SATA drives don't have the ability to replace SCSI for a multi-user I don't think it is a matter of not having the ability. SATA all in all is fine as long as it is battery backed. It isn't as high performing as SCSI but who says it has to be? There are plenty of companies running databases on SATA without issue. Would I put it on a database that is expecting to have 500 connections at all times? No. Then again, if you have an application with that requirement, you have the money to buy a big fat SCSI array. Sincerely, Joshua D. Drake > Postgres apps, but you needed to save on cost (ALWAYS an issue), > could/would you implement SATA for your logs (pg_xlog) and keep the > rest on SCSI? > > Steve Poe > > Mohan, Ross wrote: > >> I've been doing some reading up on this, trying to keep up here, and >> have found out that (experts, just yawn and cover your ears) >> >> 1) some SATA drives (just type II, I think?) have a "Phase Zero" >> implementation of Tagged Command Queueing (the special sauce >> for SCSI). >> 2) This SATA "TCQ" is called NCQ and I believe it basically >> allows the disk software itself to do the reordering >> (this is called "simple" in TCQ terminology) It does not >> yet allow the TCQ "head of queue" command, allowing the >> current tagged request to go to head of queue, which is >> a simple way of manifesting a "high priority" request. >> >> 3) SATA drives are not yet multi-initiator? >> >> Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives >> are likely to whomp SATA II drives for a while yet (read: a >> year or two) in multiuser PostGres applications. >> >> >> -----Original Message----- >> From: pgsql-performance-owner@postgresql.org >> [mailto >> Sent: Thursday, April 14, 2005 2:04 PM >> To: Kevin Brown >> Cc: pgsql-performance@postgresql.org >> Subject: Re: [PERFORM] How to improve db performance with $7K? >> >> >> Kevin Brown <kevin@sysexperts.com> writes: >> >> >> >>> Greg Stark wrote: >>> >>> >>> >>> >>>> I think you're being misled by analyzing the write case. >>>> >>>> Consider the read case. When a user process requests a block and >>>> that read makes its way down to the driver level, the driver can't >>>> just put it aside and wait until it's convenient. It has to go >>>> ahead and issue the read right away. >>>> >>> >>> Well, strictly speaking it doesn't *have* to. It could delay for a >>> couple of milliseconds to see if other requests come in, and then >>> issue the read if none do. If there are already other requests >>> being fulfilled, then it'll schedule the request in question just >>> like the rest. >>> >> >> >> But then the cure is worse than the disease. You're basically >> describing exactly what does happen anyways, only you're delaying >> more requests than necessary. That intervening time isn't really >> idle, it's filled with all the requests that were delayed during the >> previous large seek... >> >> >> >>> Once the first request has been fulfilled, the driver can now >>> schedule the rest of the queued-up requests in disk-layout order. >>> >>> I really don't see how this is any different between a system that >>> has tagged queueing to the disks and one that doesn't. The only >>> difference is where the queueing happens. >>> >> >> >> And *when* it happens. Instead of being able to issue requests while >> a large seek is happening and having some of them satisfied they have >> to wait until that seek is finished and get acted on during the next >> large seek. >> >> If my theory is correct then I would expect bandwidth to be >> essentially equivalent but the latency on SATA drives to be increased >> by about 50% of the average seek time. Ie, while a busy SCSI drive >> can satisfy most requests in about 10ms a busy SATA drive would >> satisfy most requests in 15ms. (add to that that 10k RPM and 15kRPM >> SCSI drives have even lower seek times and no such IDE/SATA drives >> exist...) >> >> In reality higher latency feeds into a system feedback loop causing >> your application to run slower causing bandwidth demands to be lower >> as well. It's often hard to distinguish root causes from symptoms >> when optimizing complex systems. >> >> >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if > your > joining column's datatypes do not match ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| |||
| Problem with this strategy. You want battery-backed write caching for best performance & safety. (I've tried IDE for WAL before w/ write caching off -- the DB got crippled whenever I had to copy files from/to the drive on the WAL partition -- ended up just moving WAL back on the same SCSI drive as the main DB.) That means in addition to a $$$ SCSI caching controller, you also need a $$$ SATA caching controller. From my glance at prices, advanced SATA controllers seem to cost nearly as their SCSI counterparts. This also looks to be the case for the drives themselves. Sure you can get super cheap 7200RPM SATA drives but they absolutely suck for database work. Believe me, I gave it a try once -- ugh. The highend WD 10K Raptors look pretty good though -- the benchmarks @ storagereview seem to put these drives at about 90% of SCSI 10Ks for both single-user and multi-user. However, they're also priced like SCSIs -- here's what I found @ Mwave (going through pricewatch to find WD740GDs): Seagate 7200 SATA -- 80GB $59 WD 10K SATA -- 72GB $182 Seagate 10K U320 -- 72GB $289 Using the above prices for a fixed budget for RAID-10, you could get: SATA 7200 -- 680MB per $1000 SATA 10K -- 200MB per $1000 SCSI 10K -- 125MB per $1000 For a 99% read-only DB that required lots of disk space (say something like Wikipedia or blog host), using consumer level SATA probably is ok. For anything else, I'd consider SATA 10K if (1) I do not need 15K RPM and (2) I don't have SCSI intrastructure already. Steve Poe wrote: > If SATA drives don't have the ability to replace SCSI for a multi-user > Postgres apps, but you needed to save on cost (ALWAYS an issue), > could/would you implement SATA for your logs (pg_xlog) and keep the rest > on SCSI? > > Steve Poe > > Mohan, Ross wrote: > >> I've been doing some reading up on this, trying to keep up here, and >> have found out that (experts, just yawn and cover your ears) >> >> 1) some SATA drives (just type II, I think?) have a "Phase Zero" >> implementation of Tagged Command Queueing (the special sauce >> for SCSI). >> 2) This SATA "TCQ" is called NCQ and I believe it basically >> allows the disk software itself to do the reordering >> (this is called "simple" in TCQ terminology) It does not >> yet allow the TCQ "head of queue" command, allowing the >> current tagged request to go to head of queue, which is >> a simple way of manifesting a "high priority" request. >> >> 3) SATA drives are not yet multi-initiator? >> >> Largely b/c of 2 and 3, multi-initiator SCSI RAID'ed drives >> are likely to whomp SATA II drives for a while yet (read: a >> year or two) in multiuser PostGres applications. >> >> >> -----Original Message----- >> From: pgsql-performance-owner@postgresql.org >> [mailto >> Sent: Thursday, April 14, 2005 2:04 PM >> To: Kevin Brown >> Cc: pgsql-performance@postgresql.org >> Subject: Re: [PERFORM] How to improve db performance with $7K? >> >> >> Kevin Brown <kevin@sysexperts.com> writes: >> >> >> >>> Greg Stark wrote: >>> >>> >>> >>> >>>> I think you're being misled by analyzing the write case. >>>> >>>> Consider the read case. When a user process requests a block and >>>> that read makes its way down to the driver level, the driver can't >>>> just put it aside and wait until it's convenient. It has to go ahead >>>> and issue the read right away. >>>> >>> >>> Well, strictly speaking it doesn't *have* to. It could delay for a >>> couple of milliseconds to see if other requests come in, and then >>> issue the read if none do. If there are already other requests being >>> fulfilled, then it'll schedule the request in question just like the >>> rest. >>> >> >> >> But then the cure is worse than the disease. You're basically >> describing exactly what does happen anyways, only you're delaying more >> requests than necessary. That intervening time isn't really idle, it's >> filled with all the requests that were delayed during the previous >> large seek... >> >> >> >>> Once the first request has been fulfilled, the driver can now >>> schedule the rest of the queued-up requests in disk-layout order. >>> >>> I really don't see how this is any different between a system that >>> has tagged queueing to the disks and one that doesn't. The only >>> difference is where the queueing happens. >>> >> >> >> And *when* it happens. Instead of being able to issue requests while a >> large seek is happening and having some of them satisfied they have to >> wait until that seek is finished and get acted on during the next >> large seek. >> >> If my theory is correct then I would expect bandwidth to be >> essentially equivalent but the latency on SATA drives to be increased >> by about 50% of the average seek time. Ie, while a busy SCSI drive can >> satisfy most requests in about 10ms a busy SATA drive would satisfy >> most requests in 15ms. (add to that that 10k RPM and 15kRPM SCSI >> drives have even lower seek times and no such IDE/SATA drives exist...) >> >> In reality higher latency feeds into a system feedback loop causing >> your application to run slower causing bandwidth demands to be lower >> as well. It's often hard to distinguish root causes from symptoms when >> optimizing complex systems. >> >> >> > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > |
| |||
| William Yu <wyu@talisys.com> writes: > Using the above prices for a fixed budget for RAID-10, you could get: > > SATA 7200 -- 680MB per $1000 > SATA 10K -- 200MB per $1000 > SCSI 10K -- 125MB per $1000 What a lot of these analyses miss is that cheaper == faster because cheaper means you can buy more spindles for the same price. I'm assuming you picked equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as many spindles as the 125MB/$1000. That means it would have almost double the bandwidth. And the 7200 RPM case would have more than 5x the bandwidth. While 10k RPM drives have lower seek times, and SCSI drives have a natural seek time advantage, under load a RAID array with fewer spindles will start hitting contention sooner which results into higher latency. If the controller works well the larger SATA arrays above should be able to maintain their mediocre latency much better under load than the SCSI array with fewer drives would maintain its low latency response time despite its drives' lower average seek time. -- greg ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| This is fundamentaly untrue. A mirror is still a mirror. At most in a RAID 10 you can have two simultaneous seeks. You are always going to be limited by the seek time of your drives. It's a stripe, so you have to read from all members of the stripe to get data, requiring all drives to seek. There is no advantage to seek time in adding more drives. By adding more drives you can increase throughput, but the max throughput of the PCI-X bus isn't that high (I think around 400MB/sec) You can easily get this with a six or seven drive RAID 5, or a ten drive RAID 10. At that point you start having to factor in the cost of a bigger chassis to hold more drives, which can be big bucks. Alex Turner netEconomist On 18 Apr 2005 10:59:05 -0400, Greg Stark <gsstark@mit.edu> wrote: > > William Yu <wyu@talisys.com> writes: > > > Using the above prices for a fixed budget for RAID-10, you could get: > > > > SATA 7200 -- 680MB per $1000 > > SATA 10K -- 200MB per $1000 > > SCSI 10K -- 125MB per $1000 > > What a lot of these analyses miss is that cheaper == faster because cheaper > means you can buy more spindles for the same price. I'm assuming you picked > equal sized drives to compare so that 200MB/$1000 for SATA is almost twice as > many spindles as the 125MB/$1000. That means it would have almost double the > bandwidth. And the 7200 RPM case would have more than 5x the bandwidth. > > While 10k RPM drives have lower seek times, and SCSI drives have a natural > seek time advantage, under load a RAID array with fewer spindles will start > hitting contention sooner which results into higher latency. If the controller > works well the larger SATA arrays above should be able to maintain their > mediocre latency much better under load than the SCSI array with fewer drives > would maintain its low latency response time despite its drives' lower average > seek time. > > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match > ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Alex Turner <armtuk@gmail.com> writes: > This is fundamentaly untrue. > > A mirror is still a mirror. At most in a RAID 10 you can have two > simultaneous seeks. You are always going to be limited by the seek > time of your drives. It's a stripe, so you have to read from all > members of the stripe to get data, requiring all drives to seek. > There is no advantage to seek time in adding more drives. Adding drives will not let you get lower response times than the average seek time on your drives*. But it will let you reach that response time more often. The actual response time for a random access to a drive is the seek time plus the time waiting for your request to actually be handled. Under heavy load that could be many milliseconds. The more drives you have the fewer requests each drive has to handle. Look at the await and svctime columns of iostat -x. Under heavy random access load those columns can show up performance problems more accurately than the bandwidth columns. You could be doing less bandwidth but be having latency issues. While reorganizing data to allow for more sequential reads is the normal way to address that, simply adding more spindles can be surprisingly effective. > By adding more drives you can increase throughput, but the max throughput of > the PCI-X bus isn't that high (I think around 400MB/sec) You can easily get > this with a six or seven drive RAID 5, or a ten drive RAID 10. At that point > you start having to factor in the cost of a bigger chassis to hold more > drives, which can be big bucks. You could use software raid to spread the drives over multiple PCI-X cards. But if 400MB/s isn't enough bandwidth then you're probably in the realm of "enterprise-class" hardware anyways. * (Actually even that's possible: you could limit yourself to a portion of the drive surface to reduce seek time) -- greg ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| [snip] > > Adding drives will not let you get lower response times than the average seek > time on your drives*. But it will let you reach that response time more often. > [snip] I believe your assertion is fundamentaly flawed. Adding more drives will not let you reach that response time more often. All drives are required to fill every request in all RAID levels (except possibly 0+1, but that isn't used for enterprise applicaitons). Most requests in OLTP require most of the request time to seek, not to read. Only in single large block data transfers will you get any benefit from adding more drives, which is atypical in most database applications. For most database applications, the only way to increase transactions/sec is to decrease request service time, which is generaly achieved with better seek times or a better controller card, or possibly spreading your database accross multiple tablespaces on seperate paritions. My assertion therefore is that simply adding more drives to an already competent* configuration is about as likely to increase your database effectiveness as swiss cheese is to make your car run faster. Alex Turner netEconomist *Assertion here is that the DBA didn't simply configure all tables and xlog on a single 7200 RPM disk, but has seperate physical drives for xlog and tablespace at least on 10k drives. ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Hi, At 18:56 18/04/2005, Alex Turner wrote: >All drives are required to fill every request in all RAID levels No, this is definitely wrong. In many cases, most drives don't actually have the data requested, how could they handle the request? When reading one random sector, only *one* drive out of N is ever used to service any given request, be it RAID 0, 1, 0+1, 1+0 or 5. When writing: - in RAID 0, 1 drive - in RAID 1, RAID 0+1 or 1+0, 2 drives - in RAID 5, you need to read on all drives and write on 2. Otherwise, what would be the point of RAID 0, 0+1 or 1+0? Jacques. ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| ||||
| Alex Turner wrote: >[snip] > > >>Adding drives will not let you get lower response times than the average seek >>time on your drives*. But it will let you reach that response time more often. >> >> >> >[snip] > >I believe your assertion is fundamentaly flawed. Adding more drives >will not let you reach that response time more often. All drives are >required to fill every request in all RAID levels (except possibly >0+1, but that isn't used for enterprise applicaitons). Most requests >in OLTP require most of the request time to seek, not to read. Only >in single large block data transfers will you get any benefit from >adding more drives, which is atypical in most database applications. >For most database applications, the only way to increase >transactions/sec is to decrease request service time, which is >generaly achieved with better seek times or a better controller card, >or possibly spreading your database accross multiple tablespaces on >seperate paritions. > >My assertion therefore is that simply adding more drives to an already >competent* configuration is about as likely to increase your database >effectiveness as swiss cheese is to make your car run faster. > > Consider the case of a mirrored file system with a mostly read() workload. Typical behavior is to use a round-robin method for issueing the read operations to each mirror in turn, but one can use other methods like a geometric algorithm that will issue the reads to the drive with the head located closest to the desired track. Some systems have many mirrors of the data for exactly this behavior. In fact, one can carry this logic to the extreme and have one drive for every cylinder in the mirror, thus removing seek latencies completely. In fact this extreme case would also remove the rotational latency as the cylinder will be in the disks read cache. :-) Of course, writing data would be a bit slow! I'm not sure I understand your assertion that "all drives are required to fill every request in all RAID levels". After all, in mirrored reads only one mirror needs to read any given block of data, so I don't know what goal is achieved in making other mirrors read the same data. My assertion (based on ample personal experience) is that one can *always* get improved performance by adding more drives. Just limit the drives to use the first few cylinders so that the average seek time is greatly reduced and concatenate the drives together. One can then build the usual RAID device out of these concatenated metadevices. Yes, one is wasting lots of disk space, but that's life. If your goal is performance, then you need to put your money on the table. The system will be somewhat unreliable because of the device count, additional SCSI buses, etc., but that too is life in the high performance world. -- Alan ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |