Unix Technical Forum

Load Distributed Checkpoints, take 3

This is a discussion on Load Distributed Checkpoints, take 3 within the Pgsql Patches forums, part of the PostgreSQL category; --> Here's an updated WIP patch for load distributed checkpoints. I added a spinlock to protect the signaling fields between ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Patches

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-18-2008, 11:15 AM
Heikki Linnakangas
 
Posts: n/a
Default Load Distributed Checkpoints, take 3

Here's an updated WIP patch for load distributed checkpoints.

I added a spinlock to protect the signaling fields between bgwriter and
backends. The current non-locking approach gets really difficult as the
patch adds two new flags, and both are more important than the existing
ckpt_time_warn flag.

In fact, I think there's a small race condition in CVS HEAD:

1. pg_start_backup() is called, which calls RequestCheckpoint
2. RequestCheckpoint takes note of the old value of ckpt_started
3. bgwriter wakes up from pg_usleep, and sees that we've exceeded
checkpoint_timeout.
4. bgwriter increases ckpt_started to note that a new checkpoint has started
5. RequestCheckpoint signals bgwriter to start a new checkpoint
6. bgwriter calls CreateCheckpoint, with the force-flag set to false
because this checkpoint was triggered by timeout
7. RequestCheckpoint sees that ckpt_started has increased, and starts to
wait for ckpt_done to reach the new value.
8. CreateCheckpoint finishes immediately, because there was no XLOG
activity since last checkpoint.
9. RequestCheckpoint sees that ckpt_done matches ckpt_started, and returns.
10. pg_start_backup() continues, with potentially the same redo location
and thus history filename as previous backup.

Now I admit that the chances for that to happen are extremely small,
people don't usually do two pg_start_backup calls without *any* WAL
logged activity in between them, for example. But as we add the new
flags, avoiding scenarios like that becomes harder.

Since last patch, I did some clean up and refactoring, and added a bunch
of comments, and user documentation.

I haven't yet changed GetInsertRecPtr to use the almost up-to-date value
protected by the info_lck per Simon's suggestion, and I need to do some
correctness testing. After that, I'm done with the patch.

Ps. In case you wonder what took me so long since last revision, I've
spent a lot of time reviewing HOT.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-18-2008, 11:15 AM
Tom Lane
 
Posts: n/a
Default Re: Load Distributed Checkpoints, take 3

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I added a spinlock to protect the signaling fields between bgwriter and
> backends. The current non-locking approach gets really difficult as the
> patch adds two new flags, and both are more important than the existing
> ckpt_time_warn flag.


That may be, but you could minimize the cost and notational ugliness by
not using the spinlock where you don't have to. Put the sig_atomic_t
fields back the way they were, and many of the uses of the spinlock go
away. All you really need it for is the places where the additional
flags are set or read.

> In fact, I think there's a small race condition in CVS HEAD:


Yeah, probably --- the original no-locking design didn't have any side
flags. The reason you need the lock is for a backend to be sure that
a newly-started checkpoint is using its requested flags. But the
detection of checkpoint termination is still the same.

Some other comments:

I tend to agree with whoever said upthread that the combination of GUC
variables proposed here is confusing and ugly. It'd make more sense to
have min and max checkpoint rates in KB/s, with the max checkpoint rate
only honored when we are predicting we'll finish before the next
checkpoint time.

The flow of control and responsibility between xlog.c, bgwriter.c and
bufmgr.c seems to have reached the breaking point of unintelligibility.
Can you think of a refactoring that would simplify it? We might have
to resign ourselves to this much messiness, but I'd rather not...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-18-2008, 11:15 AM
ITAGAKI Takahiro
 
Posts: n/a
Default Re: Load Distributed Checkpoints, take 3

Heikki Linnakangas <heikki@enterprisedb.com> wrote:

> Here's an updated WIP patch for load distributed checkpoints.
> Since last patch, I did some clean up and refactoring, and added a bunch
> of comments, and user documentation.


The only thing I don't understand is the naming of 'checkpoint_smoothing'.
Can users imagine the unit of 'smoothing' is a fraction?

You explain the paremeter with the word 'fraction'.
Why don't you simply name it 'checkpoint_fraction' ?
| Specifies the target length of checkpoints, as a fraction of
| the checkpoint interval. The default is 0.3.

Sorry if I'm missing discussions abount the naming.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-18-2008, 11:15 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Load Distributed Checkpoints, take 3

ITAGAKI Takahiro wrote:
> The only thing I don't understand is the naming of 'checkpoint_smoothing'.
> Can users imagine the unit of 'smoothing' is a fraction?
>
> You explain the paremeter with the word 'fraction'.
> Why don't you simply name it 'checkpoint_fraction' ?
> | Specifies the target length of checkpoints, as a fraction of
> | the checkpoint interval. The default is 0.3.


I chose checkpoint_smoothing because it tells you what the parameter is
for. If you want more smoothing, tune it up, and if you want less, tune
it down. checkpoint_fraction makes you wonder what you can do with it
and why you would change it.

> Sorry if I'm missing discussions abount the naming.


No, I chose _smoothing on my own. People didn't like
checkpoint_write_percent either (including).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-18-2008, 11:15 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Load Distributed Checkpoints, take 3

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> I added a spinlock to protect the signaling fields between bgwriter and
>> backends. The current non-locking approach gets really difficult as the
>> patch adds two new flags, and both are more important than the existing
>> ckpt_time_warn flag.

>
> That may be, but you could minimize the cost and notational ugliness by
> not using the spinlock where you don't have to. Put the sig_atomic_t
> fields back the way they were, and many of the uses of the spinlock go
> away. All you really need it for is the places where the additional
> flags are set or read.


I find it easier to understand if it's used whenever any of the fields
are accessed. You don't need to read and write ckpt_failed and
ckpt_started/ckpt_done in specific order anymore, for example.

> Some other comments:
>
> I tend to agree with whoever said upthread that the combination of GUC
> variables proposed here is confusing and ugly. It'd make more sense to
> have min and max checkpoint rates in KB/s, with the max checkpoint rate
> only honored when we are predicting we'll finish before the next
> checkpoint time.


Really? I thought everyone is happy with the current combination, and
that it was just the old trio of parameters controlling the write, nap
and sync phases that was ugly. One particularly nice thing about
defining the duration as a fraction of checkpoint interval is that we
can come up with a reasonable default value that doesn't depend on your
hardware.

How would a min and max rate work?

Anyone else have an opinion on the parameters?

> The flow of control and responsibility between xlog.c, bgwriter.c and
> bufmgr.c seems to have reached the breaking point of unintelligibility.
> Can you think of a refactoring that would simplify it? We might have
> to resign ourselves to this much messiness, but I'd rather not...


The problem we're trying to solve is doing a checkpoint while running
the normal bgwriter activity at the same time. The normal way to do two
things simultaneously is to have two different processes (or threads). I
thought about having a separate checkpoint process, but I gave up on
that thought because the communication needed between backends, bgwriter
and the checkpointer seems like a mess. The checkpointer would need the
pendingOpsTable so that it can do the fsyncs, and it would also need to
receive the forget-messages to that table. We could move that table
entirely to the checkpointer, but since bgwriter is presumably doing
most of the writes, there would be a lot of chatter between bgwriter and
the checkpointer.

The current approach is like co-operative multitasking. BufferSyncs
yields control to bgwriter every now and then.

The division of labor between xlog.c and other modules is not that bad,
IMO. CreateCheckPoint is the main entry point to create a checkpoint,
and it calls other modules to do their stuff, including bufmgr.c.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-18-2008, 11:15 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Load Distributed Checkpoints, take 3

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> In fact, I think there's a small race condition in CVS HEAD:

>
> Yeah, probably --- the original no-locking design didn't have any side
> flags. The reason you need the lock is for a backend to be sure that
> a newly-started checkpoint is using its requested flags. But the
> detection of checkpoint termination is still the same.


Actually, the race condition I outlined isn't related to the flags. It's
possible because RequestCheckpoint doesn't guarantee that a checkpoint
is performed when there's been no WAL activity since last one.

I did use a new force-flag to fix it, but I'm sure there is other ways.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-18-2008, 11:15 AM
Tom Lane
 
Posts: n/a
Default Re: Load Distributed Checkpoints, take 3

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> I tend to agree with whoever said upthread that the combination of GUC
>> variables proposed here is confusing and ugly. It'd make more sense to
>> have min and max checkpoint rates in KB/s, with the max checkpoint rate
>> only honored when we are predicting we'll finish before the next
>> checkpoint time.


> Really? I thought everyone is happy with the current combination, and
> that it was just the old trio of parameters controlling the write, nap
> and sync phases that was ugly. One particularly nice thing about
> defining the duration as a fraction of checkpoint interval is that we
> can come up with a reasonable default value that doesn't depend on your
> hardware.


That argument would hold some water if you weren't introducing a
hardware-dependent min rate in the same patch. Do we need the min rate
at all? If so, why can't it be in the same units as the max (ie, a
fraction of checkpoint)?

> How would a min and max rate work?


Pretty much the same as the code does now, no? You either delay, or not.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-18-2008, 11:15 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Load Distributed Checkpoints, take 3

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> I tend to agree with whoever said upthread that the combination of GUC
>>> variables proposed here is confusing and ugly. It'd make more sense to
>>> have min and max checkpoint rates in KB/s, with the max checkpoint rate
>>> only honored when we are predicting we'll finish before the next
>>> checkpoint time.

>
>> Really? I thought everyone is happy with the current combination, and
>> that it was just the old trio of parameters controlling the write, nap
>> and sync phases that was ugly. One particularly nice thing about
>> defining the duration as a fraction of checkpoint interval is that we
>> can come up with a reasonable default value that doesn't depend on your
>> hardware.

>
> That argument would hold some water if you weren't introducing a
> hardware-dependent min rate in the same patch. Do we need the min rate
> at all? If so, why can't it be in the same units as the max (ie, a
> fraction of checkpoint)?
>
>> How would a min and max rate work?

>
> Pretty much the same as the code does now, no? You either delay, or not.


I don't think you understand how the settings work. Did you read the
documentation? If you did, it's apparently not adequate.

The main tuning knob is checkpoint_smoothing, which is defined as a
fraction of the checkpoint interval (both checkpoint_timeout and
checkpoint_segments are taken into account). Normally, the write phase
of a checkpoint takes exactly that much time.

So the length of a checkpoint stays the same regardless of how many
dirty buffers there is (assuming you don't exceed the bandwidth of your
hardware), but the I/O rate varies.

There's another possible strategy: keep the I/O rate constant, but vary
the length of the checkpoint. checkpoint_rate allows you to do that.

I'm envisioning we set the defaults so that checkpoint_smoothing is the
effective control in a relatively busy system, and checkpoint_rate
ensures that we don't unnecessarily prolong checkpoints on an idle system.

Now how would you replace checkpoint_smoothing with a max I/O rate? The
only real alternative way of defining it is directly as a time and/or
segments, similar to checkpoint_timeout and checkpoint_segments, but
then we'd really need two knobs instead of one.

Though maybe we could just hard-code it to 0.8, for example, and tell
people to tune checkpoint_rate if they want more aggressive checkpoints.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-18-2008, 11:15 AM
Tom Lane
 
Posts: n/a
Default Re: Load Distributed Checkpoints, take 3

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> I don't think you understand how the settings work. Did you read the
> documentation? If you did, it's apparently not adequate.


I did read the documentation, and I'm not complaining that I don't
understand it. I'm complaining that I don't like the presented API
because it's self-inconsistent. You've got two parameters that are in
effect upper and lower bounds for the checkpoint write rate, but they
are named inconsistently and not even measured in the same kind of unit.
Nor do I agree that the inconsistency buys any ease of use.

> The main tuning knob is checkpoint_smoothing, which is defined as a
> fraction of the checkpoint interval (both checkpoint_timeout and
> checkpoint_segments are taken into account). Normally, the write phase
> of a checkpoint takes exactly that much time.


So the question is, why in the heck would anyone want the behavior that
"checkpoints take exactly X time"?? The useful part of this whole patch
is to cap the write rate at something that doesn't interfere too much
with foreground queries. I don't see why people wouldn't prefer
"checkpoints can take any amount of time up to the checkpoint interval,
but we do our best not to exceed Y writes/second".

Basically I don't see what useful values checkpoint_smoothing would have
other than 0 and 1. You might as well make it a bool.

> There's another possible strategy: keep the I/O rate constant, but vary
> the length of the checkpoint. checkpoint_rate allows you to do that.


But only from the lower side.

> Now how would you replace checkpoint_smoothing with a max I/O rate?


I don't see why you think that's hard. It looks to me like the
components of the decision are the same numbers in any case: you have to
estimate your progress towards checkpoint completion, your available
time till next checkpoint, and your write rate. Then you either delay
or not.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-18-2008, 11:15 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Load Distributed Checkpoints, take 3

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> The main tuning knob is checkpoint_smoothing, which is defined as a
>> fraction of the checkpoint interval (both checkpoint_timeout and
>> checkpoint_segments are taken into account). Normally, the write phase
>> of a checkpoint takes exactly that much time

>
> So the question is, why in the heck would anyone want the behavior that
> "checkpoints take exactly X time"?? The useful part of this whole patch
> is to cap the write rate at something that doesn't interfere too much
> with foreground queries. I don't see why people wouldn't prefer
> "checkpoints can take any amount of time up to the checkpoint interval,
> but we do our best not to exceed Y writes/second".


Because it's easier to tune. You don't need to know how much checkpoint
I/O you can tolerate. The system will use just enough I/O bandwidth to
meet the deadline, but not more than that.

> Basically I don't see what useful values checkpoint_smoothing would have
> other than 0 and 1. You might as well make it a bool.


Well that's one option. It feels like a good thing to be able to control
how much headroom you have until the next checkpoint, but maybe we
can just hardcode it close to 1. It's also good to avoid spreading the
checkpoints unnecessarily, to keep recovery times lower, but you can
control that with the min rate setting as well.

>> There's another possible strategy: keep the I/O rate constant, but vary
>> the length of the checkpoint. checkpoint_rate allows you to do that.

>
> But only from the lower side.
>
>> Now how would you replace checkpoint_smoothing with a max I/O rate?

>
> I don't see why you think that's hard. It looks to me like the
> components of the decision are the same numbers in any case: you have to
> estimate your progress towards checkpoint completion, your available
> time till next checkpoint, and your write rate. Then you either delay
> or not.


Let me put it this way: If you define a min and a max I/O rate, when
would the max I/O rate limit take effect? If there's few dirty buffers
in the pool, so that you'll finish the checkpoint in time before the
next one is due writing at the min rate, that's what you'll use. If
there's more, you'll need to write fast enough that you'll finish the
checkpoint in time, regardless of the max rate. Or would you let the
next checkpoint slip and keep writing at the max rate? That seems like a
footgun if someone sets it to a too low value.

Or are you thinking that we have just one setting: checkpoint_rate? You
describe it as a maximum, but I've been thinking of it as a minimum
because you *will* exceed it if the next checkpoint is due soon.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:51 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com