vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On Tue, 15 Apr 2008, ITAGAKI Takahiro wrote: > 2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0) What is the disk controller in this system? I'm specifically curious about what write cache was involved, so I can get a better feel for the hardware your results came from. I'm busy rebuilding my performance testing systems right now, once that's done I can review this on a few platforms. One thing that jumped out at me just reading the code is this happening inside BufferSync: buf_to_write = (BufAndTag *) palloc(NBuffers * sizeof(BufAndTag)); If shared_buffers(=NBuffers) is set to something big, this could give some memory churn. And I think it's a bad idea to allocate something this large at checkpoint time, because what happens if that fails? Really not the time you want to discover there's no RAM left. Since you're always going to need this much memory for the system to operate, and the current model has the system running a checkpoint >50% of the time, the only thing that makes sense to me is to allocate it at server start time once and be done with it. That should improve performance over the original patch as well. BufAndTag is a relatively small structure (5 ints). Let's call it 40 bytes; even that's only a 0.5% overhead relative to the shared buffer allocation. If we can speed checkpoints significantly with that much overhead it sounds like a good tradeoff to me. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches |
| |||
| Greg Smith <gsmith@gregsmith.com> wrote: > On Tue, 15 Apr 2008, ITAGAKI Takahiro wrote: > > > 2x Quad core Xeon, 16GB RAM, 4x HDD (RAID-0) > > What is the disk controller in this system? I'm specifically curious > about what write cache was involved, so I can get a better feel for the > hardware your results came from. I used HP ProLiant DL380 G5 with Smart Array P400 with 256MB cache (http://h10010.www1.hp.com/wwpc/us/en...5-1121516.html) and ext3fs on LVM of CentOS 5.1 (Linux version 2.6.18-53.el5). Dirty region of database was probably larger than disk controller's cache. > buf_to_write = (BufAndTag *) palloc(NBuffers * sizeof(BufAndTag)); > > If shared_buffers(=NBuffers) is set to something big, this could give some > memory churn. And I think it's a bad idea to allocate something this > large at checkpoint time, because what happens if that fails? Really not > the time you want to discover there's no RAM left. Hmm, but I think we need to copy buffer tags into bgwriter's local memory in order to avoid locking taga many times in the sorting. Is it better to allocate sorting buffers at the first time and keep and reuse it from then on? > BufAndTag is a relatively small structure (5 ints). Let's call it 40 > bytes; even that's only a 0.5% overhead relative to the shared buffer > allocation. If we can speed checkpoints significantly with that much > overhead it sounds like a good tradeoff to me. I thinks sizeof(BufAndTag) is 20 bytes because sizeof(int) is 4 on typical platforms (and if not, I should rewrite the patch to be always so). It is 0.25% of shared buffers; when shared_buffers is set to 10GB, it takes 25MB of process local memory. If we want to consume less memory for it, RelFileNode in BufferTag could be hashed and packed into an integer; The blockNum order is important for this purpose, but RelFileNode is not. It makes the overhead to 12 bytes per page (0.15%). Is it worth doing? Regards, --- ITAGAKI Takahiro NTT Open Source Software Center -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches |
| |||
| On Wed, 16 Apr 2008, ITAGAKI Takahiro wrote: > Dirty region of database was probably larger than disk controller's cache. Might be worthwhile to run with log_checkpoints on and collect some statistics there next time you're running these tests. It's a good habit to get other testers into regardless; it's nice to be able to say something like "during the 15 checkpoints encountered during this test, the largest dirty area was 516MB while the median was 175MB". > Hmm, but I think we need to copy buffer tags into bgwriter's local memory > in order to avoid locking taga many times in the sorting. Is it better to > allocate sorting buffers at the first time and keep and reuse it from then on? That what I was thinking: allocate the memory when the background writer starts and just always have it there, the allocation you're doing is always the same size. If it's in use 50% of the time anyway (which it is if you have checkpoint_completion_target at its default), why introduce the risk that an allocation will fail at checkpoint time? Just allocate it once and keep it around. > It is 0.25% of shared buffers; when shared_buffers is set to 10GB, > it takes 25MB of process local memory. Your numbers are probably closer to correct. I was being pessimistic about the size of all the integers just to demonstrate that it's not really a significant amount of memory even if they're large. > If we want to consume less memory for it, RelFileNode in BufferTag could > be hashed and packed into an integer I personally don't feel it's worth making the code any more complicated than it needs to be just to save a fraction of a percent of the total memory used by the buffer pool. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches |
| |||
| ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > Greg Smith <gsmith@gregsmith.com> wrote: >> If shared_buffers(=NBuffers) is set to something big, this could give some >> memory churn. And I think it's a bad idea to allocate something this >> large at checkpoint time, because what happens if that fails? Really not >> the time you want to discover there's no RAM left. > Hmm, but I think we need to copy buffer tags into bgwriter's local memory > in order to avoid locking taga many times in the sorting. I updated this patch to permanently allocate the working array as Greg suggests, and to fix a bunch of commenting issues (attached). However, I am completely unable to measure any performance improvement from it. Given the possible risk of out-of-memory failures, I think the patch should not be applied without some direct proof of performance benefits, and I don't see any. regards, tom lane Index: src/backend/storage/buffer/bufmgr.c ================================================== ================= RCS file: /cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.228 diff -c -r1.228 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 1 Jan 2008 19:45:51 -0000 1.228 --- src/backend/storage/buffer/bufmgr.c 4 May 2008 01:11:08 -0000 *************** *** 56,61 **** --- 56,68 ---- #define BUF_WRITTEN 0x01 #define BUF_REUSABLE 0x02 + /* Struct for BufferSync's internal to-do list */ + typedef struct BufAndTag + { + int buf_id; + BufferTag tag; + } BufAndTag; + /* GUC variables */ bool zero_damaged_pages = false; *************** *** 986,991 **** --- 993,1025 ---- } /* + * qsort comparator for BufferSync + */ + static int + bufandtagcmp(const void *a, const void *b) + { + const BufAndTag *lhs = (const BufAndTag *) a; + const BufAndTag *rhs = (const BufAndTag *) b; + int r; + + /* + * We don't much care about the order in which different relations get + * written, so memcmp is enough for comparing the relfilenodes, + * even though its behavior will be platform-dependent. + */ + r = memcmp(&lhs->tag.rnode, &rhs->tag.rnode, sizeof(lhs->tag.rnode)); + if (r != 0) + return r; + + /* We do want blocks within a relation to be ordered accurately */ + if (lhs->tag.blockNum < rhs->tag.blockNum) + return -1; + if (lhs->tag.blockNum > rhs->tag.blockNum) + return 1; + return 0; + } + + /* * BufferSync -- Write out all dirty buffers in the pool. * * This is called at checkpoint time to write out all dirty shared buffers. *************** *** 995,1004 **** static void BufferSync(int flags) { int buf_id; - int num_to_scan; int num_to_write; int num_written; /* Make sure we can handle the pin inside SyncOneBuffer */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); --- 1029,1056 ---- static void BufferSync(int flags) { + static BufAndTag *bufs_to_write = NULL; int buf_id; int num_to_write; int num_written; + int i; + + /* + * We allocate the bufs_to_write[] array on first call and keep it + * around for the life of the process. This is okay because in normal + * operation this function is only called within the bgwriter, so + * we won't have lots of large arrays floating around. We prefer this + * way because we don't want checkpoints to suddenly start failing + * when the system gets under memory pressure. + */ + if (bufs_to_write == NULL) + { + bufs_to_write = (BufAndTag *) malloc(NBuffers * sizeof(BufAndTag)); + if (bufs_to_write == NULL) + ereport(FATAL, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); + } /* Make sure we can handle the pin inside SyncOneBuffer */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); *************** *** 1033,1038 **** --- 1085,1092 ---- if (bufHdr->flags & BM_DIRTY) { bufHdr->flags |= BM_CHECKPOINT_NEEDED; + bufs_to_write[num_to_write].buf_id = buf_id; + bufs_to_write[num_to_write].tag = bufHdr->tag; num_to_write++; } *************** *** 1043,1061 **** return; /* nothing to do */ /* ! * Loop over all buffers again, and write the ones (still) marked with ! * BM_CHECKPOINT_NEEDED. In this loop, we start at the clock sweep point ! * since we might as well dump soon-to-be-recycled buffers first. ! * ! * Note that we don't read the buffer alloc count here --- that should be ! * left untouched till the next BgBufferSync() call. */ - buf_id = StrategySyncStart(NULL, NULL); - num_to_scan = NBuffers; num_written = 0; ! while (num_to_scan-- > 0) { ! volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id]; /* * We don't need to acquire the lock here, because we're only looking --- 1097,1120 ---- return; /* nothing to do */ /* ! * Sort the buffers-to-be-written into order by file and block number. ! * This improves sequentiality of access for the upcoming I/O. ! */ ! qsort(bufs_to_write, num_to_write, sizeof(BufAndTag), bufandtagcmp); ! ! /* ! * Loop over all buffers to be written, and write the ones (still) marked ! * with BM_CHECKPOINT_NEEDED. Note that we don't need to recheck the ! * buffer tag, because if the buffer has been reassigned it cannot have ! * BM_CHECKPOINT_NEEDED still set. */ num_written = 0; ! for (i = 0; i < num_to_write; i++) { ! volatile BufferDesc *bufHdr; ! ! buf_id = bufs_to_write[i].buf_id; ! bufHdr = &BufferDescriptors[buf_id]; /* * We don't need to acquire the lock here, because we're only looking *************** *** 1077,1096 **** num_written++; /* - * We know there are at most num_to_write buffers with - * BM_CHECKPOINT_NEEDED set; so we can stop scanning if - * num_written reaches num_to_write. - * - * Note that num_written doesn't include buffers written by - * other backends, or by the bgwriter cleaning scan. That - * means that the estimate of how much progress we've made is - * conservative, and also that this test will often fail to - * trigger. But it seems worth making anyway. - */ - if (num_written >= num_to_write) - break; - - /* * Perform normal bgwriter duties and sleep to throttle our * I/O rate. */ --- 1136,1141 ---- *************** *** 1098,1110 **** (double) num_written / num_to_write); } } - - if (++buf_id >= NBuffers) - buf_id = 0; } /* ! * Update checkpoint statistics. As noted above, this doesn't include * buffers written by other backends or bgwriter scan. */ CheckpointStats.ckpt_bufs_written += num_written; --- 1143,1152 ---- (double) num_written / num_to_write); } } } /* ! * Update checkpoint statistics. The num_written count doesn't include * buffers written by other backends or bgwriter scan. */ CheckpointStats.ckpt_bufs_written += num_written; Index: src/backend/storage/buffer/freelist.c ================================================== ================= RCS file: /cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v retrieving revision 1.64 diff -c -r1.64 freelist.c *** src/backend/storage/buffer/freelist.c 1 Jan 2008 19:45:51 -0000 1.64 --- src/backend/storage/buffer/freelist.c 4 May 2008 01:11:08 -0000 *************** *** 241,250 **** } /* ! * StrategySyncStart -- tell BufferSync where to start syncing * ! * The result is the buffer index of the best buffer to sync first. ! * BufferSync() will proceed circularly around the buffer array from there. * * In addition, we return the completed-pass count (which is effectively * the higher-order bits of nextVictimBuffer) and the count of recent buffer --- 241,251 ---- } /* ! * StrategySyncStart -- tell BgBufferSync where we are reclaiming buffers * ! * The result is the buffer index of the next possible victim buffer. ! * BgBufferSync() tries to keep the buffers immediately in front of this ! * point clean. * * In addition, we return the completed-pass count (which is effectively * the higher-order bits of nextVictimBuffer) and the count of recent buffer -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches |
| |||
| On Sun, 4 May 2008, Tom Lane wrote: > However, I am completely unable to measure any performance improvement > from it. Given the possible risk of out-of-memory failures, I think the > patch should not be applied without some direct proof of performance > benefits, and I don't see any. Fair enough. There were some pgbench results attached to the original patch submission that gave me a good idea how to replicate the situation where there's some improvement. I expect I can take a shot at quantifying that independantly near the end of this month if nobody else gets to it before then (I'm stuck sorting out a number of OS level issue right now before my testing system is online again). Was planning to take a longer look at Greg Stark's prefetching work at that point as well. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches |
| |||
| Greg Smith <gsmith@gregsmith.com> writes: > On Sun, 4 May 2008, Tom Lane wrote: >> However, I am completely unable to measure any performance improvement >> from it. Given the possible risk of out-of-memory failures, I think the >> patch should not be applied without some direct proof of performance >> benefits, and I don't see any. > Fair enough. There were some pgbench results attached to the original > patch submission that gave me a good idea how to replicate the situation > where there's some improvement. Well, I tried a pgbench test similar to that one --- on smaller hardware than was reported, so it was a bit smaller test case, but it should have given similar results. I didn't see any improvement; if anything it was a bit worse. So that's what got me concerned. Of course it's notoriously hard to get consistent numbers out of pgbench anyway, so I'd rather see some other test case ... > I expect I can take a shot at quantifying > that independantly near the end of this month if nobody else gets to it > before then (I'm stuck sorting out a number of OS level issue right now > before my testing system is online again). Was planning to take a longer > look at Greg Stark's prefetching work at that point as well. Fair enough. Unless someone can volunteer to test sooner, I think we should drop this item from the current commitfest queue. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches |
| |||
| On Sun, 4 May 2008, Tom Lane wrote: > Well, I tried a pgbench test similar to that one --- on smaller hardware > than was reported, so it was a bit smaller test case, but it should have > given similar results. My pet theory on cases where sorting will help suggests you may need a write-caching controller for this patch to be useful. I expect we'll see the biggest improvement in situations where the total amount of dirty buffers is larger than the write cache and the cache becomes blocked. If you're not offloading to another device like that, the OS-level elevator sorting will handle sorting for you close enough to optimally that I doubt this will help much (and in fact may just get in the way). > Of course it's notoriously hard to get consistent numbers out of pgbench > anyway, so I'd rather see some other test case ... I have some tools to run pgbench results many times and look for patterns that work fairly well for the consistency part. pgbench will dirty a very high percentage of the buffer cache by checkpoint time relative to how much work it does, which makes it close to a best case for confirming there is a potential improvement here. I think a reasonable approach is to continue trying to quantify some improvement using pgbench with an eye toward also doing DBT2 tests, which provoke similar behavior at checkpoint time. I suspect someone who already has a known good DBT2 lab setup with caching controller hardware (EDB?) might be able to do a useful test of this patch without too much trouble on their part. > Unless someone can volunteer to test sooner, I think we should drop this > item from the current commitfest queue. This patch took a good step forward toward being commited this round with your review, which is the important part from my perspective (as someone who would like this to be committed if it truly works). I expect that performance related patches will often take more than one commitfest to pass through. From the perspective of keeping the committer's plates clean, a reasonable system for this situation might be for you to bounce this into the rejected pile as "Returned for testing" immediately, to clearly remove it from the main queue. A reasonable expectation there is that you might consider it again during May if someone gets back with said testing results before the 'fest ends. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches |
| |||
| Greg Smith <gsmith@gregsmith.com> writes: > On Sun, 4 May 2008, Tom Lane wrote: >> Well, I tried a pgbench test similar to that one --- on smaller hardware >> than was reported, so it was a bit smaller test case, but it should have >> given similar results. > ... If > you're not offloading to another device like that, the OS-level elevator > sorting will handle sorting for you close enough to optimally that I doubt > this will help much (and in fact may just get in the way). Yeah. It bothers me a bit that the patch forces writes to be done "all of file A in order, then all of file B in order, etc". We don't know enough about the disk layout of the files to be sure that that's good. (This might also mean that whether there is a win is going to be platform and filesystem dependent ...) >> Unless someone can volunteer to test sooner, I think we should drop this >> item from the current commitfest queue. > From the perspective of keeping the committer's plates clean, a reasonable > system for this situation might be for you to bounce this into the > rejected pile as "Returned for testing" immediately, to clearly remove it > from the main queue. A reasonable expectation there is that you might > consider it again during May if someone gets back with said testing > results before the 'fest ends. Right, that's in the ground rules for commitfests: if the submitter can respond to complaints before the fest is over, we'll reconsider the patch. regards, tom lane -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches |
| ||||
| On Mon, 5 May 2008, Tom Lane wrote: > It bothers me a bit that the patch forces writes to be done "all of file > A in order, then all of file B in order, etc". We don't know enough > about the disk layout of the files to be sure that that's good. (This > might also mean that whether there is a win is going to be platform and > filesystem dependent ...) I think most platform and filesystem implementations have disk location correlated enough with block order that this particular issue isn't a large one. If the writes are mainly going to one logical area (a single partition or disk array), it should be a win as long as the sorting step itself isn't introducing a delay. I am concered that in a more complicated case than pgbench, where the writes are spread across multiple arrays say, that forcing writes in order may slow things down. Example: let's say there's two tablespaces mapped to two arrays, A and B, that the data is being written to at checkpoint time. In the current case, that I/O might be AABAABABBBAB, which is going to keep both arrays busy writing. The sorted case would instead make that AAAAAABBBBBB so only one array will be active at a time. It may very well be the case that the improvement from lowering seeks on the writes to A and B is less than the loss coming from not keeping both continuously busy. I think I can simulate this by using a modified pgbench script that works against an accounts1 and accounts2 with equal frequency, where 1&2 are actually on different tablespaces on two disks. > Right, that's in the ground rules for commitfests: if the submitter can > respond to complaints before the fest is over, we'll reconsider the > patch. The small optimization I was trying to suggest was that you just bounce this type of patch automatically to the "rejected for <x>" section of the commitfest wiki page in cases like these. The standard practice on this sort of queue is to automatically reclassify when someone has made a pass over the patch, leaving the original source to re-open with more information. That keeps the unprocessed part of the queue always shrinking, and as long as people know that they can get it reconsidered by submitting new results it's not unfair to them. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches |