This is a discussion on Group Commit within the pgsql Hackers forums, part of the PostgreSQL category; --> I've been working on the patch to enhance our group commit behavior. The patch is a dirty hack at ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I've been working on the patch to enhance our group commit behavior. The patch is a dirty hack at the moment, but I'm settled on the algorithm I'm going to use and I know the issues involved. Here's the patch as it is if you want to try it out: http://community.enterprisedb.com/gr...pghead-2.patch but it needs a rewrite before being accepted. It'll only work on systems that use sysv semaphores, I needed to add a function to acquire a semaphore with timeout and I only did it for sysv_sema.c for now. What are the chances of getting this in 8.3, assuming that I rewrite and submit a patch within the next week or two? Algorithm --------- Instead of starting a WAL flush immediately after a commit record is inserted, we wait a while to give other backends a chance to finish their transactions and have them flushed by the same fsync call. There's two things we can control: how many commits to wait for (commit group size), and for how long (timeout). We try to estimate the optimal commit group size. The estimate is commit group size = (# of commit records flushed + # of commit records arrived while fsyncing). This is a relatively simple estimate that works reasonably well with very short transactions, and the timeout limits the damage when the estimate is not working. There's a lot more factors we could take into account in the estimate, for example: - # of backends and their states (affects how many are likely to commit soon) - amount of WAL written since last XLogFlush (affects the duration of fsync) - when exaclty the commit records arrive (we don't want to wait 10 ms to get one more commit record in, when an fsync takes 11 ms) but I wanted to keep this simple for now. The timeout is currently hard-coded at 1 ms. I wanted to keep it short compared to the time it takes to fsync (somewhere in the 5-15 ms depending on hardware), to limit the damage when the algorithm isn't getting the estimate right. We could also vary the timeout, but I'm not sure how to calculate the optimal value and the real granularity will depend on the system anyhow. Implementation -------------- To count the # of commits since last XLogFlush, I added a new XLogCtlCommit struct in shared memory: typedef struct XLogCtlCommit { slock_t commit_lock; /* protects the struct */ int commitCount; /* # of commit records inserted since XLogFlush */ int groupSize; /* current commit group size */ XLogRecPtr lastCommitPtr; /* location of the latest commit record */ PGPROC *waiter; /* process to signal when groupSize is reached */ } XLogCtlCommit; Whenever a commit record is inserted in XLogInsert, commitCount is incremented and lastCommitPtr is updated. When it reaches groupSize, the waiter-process is woken up. In XLogFlush, after acquiring WALWriteLock, we wait until groupSize is reached (or timeout expires) before doing the flush. Instead of the current logic to flush as much WAL as possible, we flush up to the last commit record. Flushing any more wouldn't save us an fsync later on, but might make the current fsync take longer. By doing that, we avoid the conditional acquire of the WALInsertLock that's in there currently. We make note of commitCount before starting the fsync; that's the # of commit records that arrived in time so that the fsync will flush them. Let's call that value "intime". After the fsync is finished, we update the groupSize for the next round. The new groupSize is the current commitCount after the fsync, IOW the number of commit records arrived after the previous XLogFlush, including the time it took to do the fsync. We update the commitCount by decrementing it by "intime". Now we're ready for the next round, and we can release WALWriteLock. WALWriteLock ------------ The above would work nicely, except that a normal lwlock doesn't play nicely. You can release and reacquire a lightwait lock in the same time slice even when there's other backends queuing for the lock, effectively cutting the queue. Here's what sometimes happens, with 2 clients: Client 1 Client 2 do work do work insert commit record insert commit record acquire WALWriteLock try to acquire WALWriteLock, blocks fsync release WALWriteLock begin new transaction do work insert commit record reacquire WALWriteLock wait for 2nd commit to arrive Client 1 will eventually time out and commit just its own commit record. Client 2 should be released immediately after client 1 releases the WALWriteLock. It only needs to observe that its commit record has already been flushed and doesn't need to do anything. To fix the above, and other race conditions like that, we need a specialized WALWriteLock that orders the waiters by the commit record XLogRecPtrs. WALWriteLockRelease wakes up all waiters that have their commit record already flushed. They will just fall through without acquiring the lock. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| I wrote: > What are the chances of getting this in 8.3, assuming that I rewrite and > submit a patch within the next week or two? I also intend to do performance testing with different workloads to ensure the patch doesn't introduce a performance regression under some conditions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Heikki Linnakangas <heikki@enterprisedb.com> writes: > I've been working on the patch to enhance our group commit behavior. The > patch is a dirty hack at the moment, but I'm settled on the algorithm > I'm going to use and I know the issues involved. > ... > The timeout is currently hard-coded at 1 ms. This is where my bogometer triggered. There's way too many platforms where 1 msec timeout is a sheer fantasy. If you cannot make it perform well with a 10-msec timeout then I don't think it's going to be at all portable. Now I know that newer Linux kernels tend to ship with 1KHz scheduler tick rate, so there's a useful set of platforms where you could make it work even so, but I'm not really satisfied with saying "this facility is only usable if you have a fast kernel tick rate" ... regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> I've been working on the patch to enhance our group commit behavior. The >> patch is a dirty hack at the moment, but I'm settled on the algorithm >> I'm going to use and I know the issues involved. >> ... >> The timeout is currently hard-coded at 1 ms. > > This is where my bogometer triggered. There's way too many platforms > where 1 msec timeout is a sheer fantasy. If you cannot make it perform > well with a 10-msec timeout then I don't think it's going to be at all > portable. > > Now I know that newer Linux kernels tend to ship with 1KHz scheduler > tick rate, so there's a useful set of platforms where you could make it > work even so, but I'm not really satisfied with saying "this facility is > only usable if you have a fast kernel tick rate" ... The 1 ms timeout isn't essential for the algorithm. In fact, I chose it arbitrarily; in the quick tests I did the length of the timeout didn't seem to matter much. I'm running with CONFIG_HZ=250 kernel myself, which means that the timeout is really 4 ms on my laptop. I suspect the tick rate largely explains why the current commit_delay isn't very good is that even though you specify it in microseconds, it really waits a lot longer. With the proposed algorithm, the fsync is started immediately when enough commit records have been inserted, so the timeout only comes into play when the estimate for the group size is too high. With a higher-precision timer, we could vary not only the commit group size but also the timeout. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> This is where my bogometer triggered. There's way too many platforms >> where 1 msec timeout is a sheer fantasy. If you cannot make it perform >> well with a 10-msec timeout then I don't think it's going to be at all >> portable. > The 1 ms timeout isn't essential for the algorithm. OK, but when you get to performance testing, please see how well it works at CONFIG_HZ=100. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Heikki Linnakangas <heikki@enterprisedb.com> writes: > I've been working on the patch to enhance our group commit behavior. The > patch is a dirty hack at the moment, but I'm settled on the algorithm > I'm going to use and I know the issues involved. One question that just came to mind is whether Simon's no-commit-wait patch doesn't fundamentally alter the context of discussion for this. Aside from the prospect that people won't really care about group commit if they can just use the periodic-WAL-sync approach, ISTM that one way to get group commit is to just make everybody wait for the dedicated WAL writer to write their commit record. With a sufficiently short delay between write/fsync attempts in the background process, won't that net out at about the same place as a complicated group-commit patch? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > I've been working on the patch to enhance our group commit behavior. The > > patch is a dirty hack at the moment, but I'm settled on the algorithm > > I'm going to use and I know the issues involved. > > One question that just came to mind is whether Simon's no-commit-wait > patch doesn't fundamentally alter the context of discussion for this. > Aside from the prospect that people won't really care about group commit > if they can just use the periodic-WAL-sync approach, ISTM that one way > to get group commit is to just make everybody wait for the dedicated > WAL writer to write their commit record. With a sufficiently short > delay between write/fsync attempts in the background process, won't > that net out at about the same place as a complicated group-commit > patch? This is a good point. commit_delay was designed to allow multiple transactions to fsync with a single fsync. no-commit-wait is going to do this much more effectively (the client doesn't have to wait for the other transations). The one thing commit_delay gives us that no-commit-wait does not is the guarantee that a commit returned to the client is on disk, without any milliseconds delay. The big question is who is going to care about the milliseconds delay and is using a configuration that is going to benefit from commit_delay. Basically, commit_delay always had a very limited use-case, but now with no-commit-wait, commit_delay has an even smaller use-case. I think the big question is whether commit_delay is ever going to be generally useful. I tried to find out what release commit_delay was added, and remembered that the feature was so questionable we did not mention its addition in the 7.1 release notes. After six years, we are still unsure about the feature. Another big question is whether commit_delay is _ever_ going to be useful, and with no-commit-wait being added, commit_delay looks even more questionable and perhaps it should just be removed in 8.3. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Mon, 9 Apr 2007, Bruce Momjian wrote: > The big question is who is going to care about the milliseconds delay > and is using a configuration that is going to benefit from commit_delay. I care. WAL writes are a major bottleneck when many clients are committing near the same time. Both times I've played with the commit_delay settings I found it improved the peak throughput under load at an acceptable low cost in latency. I'll try to present some numbers on that when I get time, before you make me cry by taking it away. An alternate mechanism that tells the client the commit is done when it hasn't hit disk is of no use for the applications I work with, so I haven't even been paying attention to no-commit-wait. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Greg Smith <gsmith@gregsmith.com> writes: > An alternate mechanism that tells the client the commit is done when it > hasn't hit disk is of no use for the applications I work with, so I > haven't even been paying attention to no-commit-wait. Agreed, if you need "committed" to mean "committed" then no-wait isn't going to float your boat. But the point I was making is that the infrastructure Simon proposes (ie, a separate wal-writer process) might be useful for this case too, with a lot less extra code than Heikki is thinking about. Now maybe that won't work, but we should certainly not consider these as entirely-independent patches. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| ||||
| > On Mon, 9 Apr 2007, Bruce Momjian wrote: > > > The big question is who is going to care about the milliseconds delay > > and is using a configuration that is going to benefit from commit_delay. > > I care. WAL writes are a major bottleneck when many clients are > committing near the same time. Both times I've played with the > commit_delay settings I found it improved the peak throughput under load > at an acceptable low cost in latency. I'll try to present some numbers on > that when I get time, before you make me cry by taking it away. Totally agreed here. I experienced throughput improvement by using commit_delay too. > An alternate mechanism that tells the client the commit is done when it > hasn't hit disk is of no use for the applications I work with, so I > haven't even been paying attention to no-commit-wait. Agreed too. -- Tatsuo Ishii SRA OSS, Inc. Japan ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |