This is a discussion on Load Distributed Checkpoints, take 3 within the Pgsql Patches forums, part of the PostgreSQL category; --> Here's an updated WIP patch for load distributed checkpoints. I added a spinlock to protect the signaling fields between ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Here's an updated WIP patch for load distributed checkpoints. I added a spinlock to protect the signaling fields between bgwriter and backends. The current non-locking approach gets really difficult as the patch adds two new flags, and both are more important than the existing ckpt_time_warn flag. In fact, I think there's a small race condition in CVS HEAD: 1. pg_start_backup() is called, which calls RequestCheckpoint 2. RequestCheckpoint takes note of the old value of ckpt_started 3. bgwriter wakes up from pg_usleep, and sees that we've exceeded checkpoint_timeout. 4. bgwriter increases ckpt_started to note that a new checkpoint has started 5. RequestCheckpoint signals bgwriter to start a new checkpoint 6. bgwriter calls CreateCheckpoint, with the force-flag set to false because this checkpoint was triggered by timeout 7. RequestCheckpoint sees that ckpt_started has increased, and starts to wait for ckpt_done to reach the new value. 8. CreateCheckpoint finishes immediately, because there was no XLOG activity since last checkpoint. 9. RequestCheckpoint sees that ckpt_done matches ckpt_started, and returns. 10. pg_start_backup() continues, with potentially the same redo location and thus history filename as previous backup. Now I admit that the chances for that to happen are extremely small, people don't usually do two pg_start_backup calls without *any* WAL logged activity in between them, for example. But as we add the new flags, avoiding scenarios like that becomes harder. Since last patch, I did some clean up and refactoring, and added a bunch of comments, and user documentation. I haven't yet changed GetInsertRecPtr to use the almost up-to-date value protected by the info_lck per Simon's suggestion, and I need to do some correctness testing. After that, I'm done with the patch. Ps. In case you wonder what took me so long since last revision, I've spent a lot of time reviewing HOT. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| Heikki Linnakangas <heikki@enterprisedb.com> writes: > I added a spinlock to protect the signaling fields between bgwriter and > backends. The current non-locking approach gets really difficult as the > patch adds two new flags, and both are more important than the existing > ckpt_time_warn flag. That may be, but you could minimize the cost and notational ugliness by not using the spinlock where you don't have to. Put the sig_atomic_t fields back the way they were, and many of the uses of the spinlock go away. All you really need it for is the places where the additional flags are set or read. > In fact, I think there's a small race condition in CVS HEAD: Yeah, probably --- the original no-locking design didn't have any side flags. The reason you need the lock is for a backend to be sure that a newly-started checkpoint is using its requested flags. But the detection of checkpoint termination is still the same. Some other comments: I tend to agree with whoever said upthread that the combination of GUC variables proposed here is confusing and ugly. It'd make more sense to have min and max checkpoint rates in KB/s, with the max checkpoint rate only honored when we are predicting we'll finish before the next checkpoint time. The flow of control and responsibility between xlog.c, bgwriter.c and bufmgr.c seems to have reached the breaking point of unintelligibility. Can you think of a refactoring that would simplify it? We might have to resign ourselves to this much messiness, but I'd rather not... regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| Heikki Linnakangas <heikki@enterprisedb.com> wrote: > Here's an updated WIP patch for load distributed checkpoints. > Since last patch, I did some clean up and refactoring, and added a bunch > of comments, and user documentation. The only thing I don't understand is the naming of 'checkpoint_smoothing'. Can users imagine the unit of 'smoothing' is a fraction? You explain the paremeter with the word 'fraction'. Why don't you simply name it 'checkpoint_fraction' ? | Specifies the target length of checkpoints, as a fraction of | the checkpoint interval. The default is 0.3. Sorry if I'm missing discussions abount the naming. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| ITAGAKI Takahiro wrote: > The only thing I don't understand is the naming of 'checkpoint_smoothing'. > Can users imagine the unit of 'smoothing' is a fraction? > > You explain the paremeter with the word 'fraction'. > Why don't you simply name it 'checkpoint_fraction' ? > | Specifies the target length of checkpoints, as a fraction of > | the checkpoint interval. The default is 0.3. I chose checkpoint_smoothing because it tells you what the parameter is for. If you want more smoothing, tune it up, and if you want less, tune it down. checkpoint_fraction makes you wonder what you can do with it and why you would change it. > Sorry if I'm missing discussions abount the naming. No, I chose _smoothing on my own. People didn't like checkpoint_write_percent either (including). -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> I added a spinlock to protect the signaling fields between bgwriter and >> backends. The current non-locking approach gets really difficult as the >> patch adds two new flags, and both are more important than the existing >> ckpt_time_warn flag. > > That may be, but you could minimize the cost and notational ugliness by > not using the spinlock where you don't have to. Put the sig_atomic_t > fields back the way they were, and many of the uses of the spinlock go > away. All you really need it for is the places where the additional > flags are set or read. I find it easier to understand if it's used whenever any of the fields are accessed. You don't need to read and write ckpt_failed and ckpt_started/ckpt_done in specific order anymore, for example. > Some other comments: > > I tend to agree with whoever said upthread that the combination of GUC > variables proposed here is confusing and ugly. It'd make more sense to > have min and max checkpoint rates in KB/s, with the max checkpoint rate > only honored when we are predicting we'll finish before the next > checkpoint time. Really? I thought everyone is happy with the current combination, and that it was just the old trio of parameters controlling the write, nap and sync phases that was ugly. One particularly nice thing about defining the duration as a fraction of checkpoint interval is that we can come up with a reasonable default value that doesn't depend on your hardware. How would a min and max rate work? Anyone else have an opinion on the parameters? > The flow of control and responsibility between xlog.c, bgwriter.c and > bufmgr.c seems to have reached the breaking point of unintelligibility. > Can you think of a refactoring that would simplify it? We might have > to resign ourselves to this much messiness, but I'd rather not... The problem we're trying to solve is doing a checkpoint while running the normal bgwriter activity at the same time. The normal way to do two things simultaneously is to have two different processes (or threads). I thought about having a separate checkpoint process, but I gave up on that thought because the communication needed between backends, bgwriter and the checkpointer seems like a mess. The checkpointer would need the pendingOpsTable so that it can do the fsyncs, and it would also need to receive the forget-messages to that table. We could move that table entirely to the checkpointer, but since bgwriter is presumably doing most of the writes, there would be a lot of chatter between bgwriter and the checkpointer. The current approach is like co-operative multitasking. BufferSyncs yields control to bgwriter every now and then. The division of labor between xlog.c and other modules is not that bad, IMO. CreateCheckPoint is the main entry point to create a checkpoint, and it calls other modules to do their stuff, including bufmgr.c. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> In fact, I think there's a small race condition in CVS HEAD: > > Yeah, probably --- the original no-locking design didn't have any side > flags. The reason you need the lock is for a backend to be sure that > a newly-started checkpoint is using its requested flags. But the > detection of checkpoint termination is still the same. Actually, the race condition I outlined isn't related to the flags. It's possible because RequestCheckpoint doesn't guarantee that a checkpoint is performed when there's been no WAL activity since last one. I did use a new force-flag to fix it, but I'm sure there is other ways. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Heikki Linnakangas <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> I tend to agree with whoever said upthread that the combination of GUC >> variables proposed here is confusing and ugly. It'd make more sense to >> have min and max checkpoint rates in KB/s, with the max checkpoint rate >> only honored when we are predicting we'll finish before the next >> checkpoint time. > Really? I thought everyone is happy with the current combination, and > that it was just the old trio of parameters controlling the write, nap > and sync phases that was ugly. One particularly nice thing about > defining the duration as a fraction of checkpoint interval is that we > can come up with a reasonable default value that doesn't depend on your > hardware. That argument would hold some water if you weren't introducing a hardware-dependent min rate in the same patch. Do we need the min rate at all? If so, why can't it be in the same units as the max (ie, a fraction of checkpoint)? > How would a min and max rate work? Pretty much the same as the code does now, no? You either delay, or not. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Tom Lane wrote: >>> I tend to agree with whoever said upthread that the combination of GUC >>> variables proposed here is confusing and ugly. It'd make more sense to >>> have min and max checkpoint rates in KB/s, with the max checkpoint rate >>> only honored when we are predicting we'll finish before the next >>> checkpoint time. > >> Really? I thought everyone is happy with the current combination, and >> that it was just the old trio of parameters controlling the write, nap >> and sync phases that was ugly. One particularly nice thing about >> defining the duration as a fraction of checkpoint interval is that we >> can come up with a reasonable default value that doesn't depend on your >> hardware. > > That argument would hold some water if you weren't introducing a > hardware-dependent min rate in the same patch. Do we need the min rate > at all? If so, why can't it be in the same units as the max (ie, a > fraction of checkpoint)? > >> How would a min and max rate work? > > Pretty much the same as the code does now, no? You either delay, or not. I don't think you understand how the settings work. Did you read the documentation? If you did, it's apparently not adequate. The main tuning knob is checkpoint_smoothing, which is defined as a fraction of the checkpoint interval (both checkpoint_timeout and checkpoint_segments are taken into account). Normally, the write phase of a checkpoint takes exactly that much time. So the length of a checkpoint stays the same regardless of how many dirty buffers there is (assuming you don't exceed the bandwidth of your hardware), but the I/O rate varies. There's another possible strategy: keep the I/O rate constant, but vary the length of the checkpoint. checkpoint_rate allows you to do that. I'm envisioning we set the defaults so that checkpoint_smoothing is the effective control in a relatively busy system, and checkpoint_rate ensures that we don't unnecessarily prolong checkpoints on an idle system. Now how would you replace checkpoint_smoothing with a max I/O rate? The only real alternative way of defining it is directly as a time and/or segments, similar to checkpoint_timeout and checkpoint_segments, but then we'd really need two knobs instead of one. Though maybe we could just hard-code it to 0.8, for example, and tell people to tune checkpoint_rate if they want more aggressive checkpoints. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Heikki Linnakangas <heikki@enterprisedb.com> writes: > I don't think you understand how the settings work. Did you read the > documentation? If you did, it's apparently not adequate. I did read the documentation, and I'm not complaining that I don't understand it. I'm complaining that I don't like the presented API because it's self-inconsistent. You've got two parameters that are in effect upper and lower bounds for the checkpoint write rate, but they are named inconsistently and not even measured in the same kind of unit. Nor do I agree that the inconsistency buys any ease of use. > The main tuning knob is checkpoint_smoothing, which is defined as a > fraction of the checkpoint interval (both checkpoint_timeout and > checkpoint_segments are taken into account). Normally, the write phase > of a checkpoint takes exactly that much time. So the question is, why in the heck would anyone want the behavior that "checkpoints take exactly X time"?? The useful part of this whole patch is to cap the write rate at something that doesn't interfere too much with foreground queries. I don't see why people wouldn't prefer "checkpoints can take any amount of time up to the checkpoint interval, but we do our best not to exceed Y writes/second". Basically I don't see what useful values checkpoint_smoothing would have other than 0 and 1. You might as well make it a bool. > There's another possible strategy: keep the I/O rate constant, but vary > the length of the checkpoint. checkpoint_rate allows you to do that. But only from the lower side. > Now how would you replace checkpoint_smoothing with a max I/O rate? I don't see why you think that's hard. It looks to me like the components of the decision are the same numbers in any case: you have to estimate your progress towards checkpoint completion, your available time till next checkpoint, and your write rate. Then you either delay or not. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| ||||
| Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> The main tuning knob is checkpoint_smoothing, which is defined as a >> fraction of the checkpoint interval (both checkpoint_timeout and >> checkpoint_segments are taken into account). Normally, the write phase >> of a checkpoint takes exactly that much time > > So the question is, why in the heck would anyone want the behavior that > "checkpoints take exactly X time"?? The useful part of this whole patch > is to cap the write rate at something that doesn't interfere too much > with foreground queries. I don't see why people wouldn't prefer > "checkpoints can take any amount of time up to the checkpoint interval, > but we do our best not to exceed Y writes/second". Because it's easier to tune. You don't need to know how much checkpoint I/O you can tolerate. The system will use just enough I/O bandwidth to meet the deadline, but not more than that. > Basically I don't see what useful values checkpoint_smoothing would have > other than 0 and 1. You might as well make it a bool. Well that's one option. It feels like a good thing to be able to control how much headroom you have until the next checkpoint, but maybe we can just hardcode it close to 1. It's also good to avoid spreading the checkpoints unnecessarily, to keep recovery times lower, but you can control that with the min rate setting as well. >> There's another possible strategy: keep the I/O rate constant, but vary >> the length of the checkpoint. checkpoint_rate allows you to do that. > > But only from the lower side. > >> Now how would you replace checkpoint_smoothing with a max I/O rate? > > I don't see why you think that's hard. It looks to me like the > components of the decision are the same numbers in any case: you have to > estimate your progress towards checkpoint completion, your available > time till next checkpoint, and your write rate. Then you either delay > or not. Let me put it this way: If you define a min and a max I/O rate, when would the max I/O rate limit take effect? If there's few dirty buffers in the pool, so that you'll finish the checkpoint in time before the next one is due writing at the min rate, that's what you'll use. If there's more, you'll need to write fast enough that you'll finish the checkpoint in time, regardless of the max rate. Or would you let the next checkpoint slip and keep writing at the max rate? That seems like a footgun if someone sets it to a too low value. Or are you thinking that we have just one setting: checkpoint_rate? You describe it as a maximum, but I've been thinking of it as a minimum because you *will* exceed it if the next checkpoint is due soon. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |