Unix Technical Forum

SEO

vBulletin Search Engine Optimization


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Patches

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 05-07-2008, 10:18 AM
Simon Riggs
 
Posts: n/a
Default Verified fix for Bug 4137


The following patch has been independently verified as fixing bug 4137.

The problem was that at the very start of archive recovery the %r
parameter in restore_command could be set to a filename later than the
currently requested filename (%f). This could lead to early truncation
of the archived WAL files and would cause warm standby replication to
fail soon afterwards, in certain specific circumstances.

Fix applied to both core server in generating correct %r filenames and
also to pg_standby to prevent acceptance of out-of-sequence filenames.

Request review and commit. No port specific details.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 05-07-2008, 10:18 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Verified fix for Bug 4137

Simon Riggs wrote:
> The problem was that at the very start of archive recovery the %r
> parameter in restore_command could be set to a filename later than the
> currently requested filename (%f). This could lead to early truncation
> of the archived WAL files and would cause warm standby replication to
> fail soon afterwards, in certain specific circumstances.
>
> Fix applied to both core server in generating correct %r filenames and
> also to pg_standby to prevent acceptance of out-of-sequence filenames.


So the core problem is that we use ControlFile->checkPointCopy.redo in
RestoreArchivedFile to determine the safe truncation point, but when
there's a backup label file, that's still coming from pg_control file,
which is wrong.

The patch fixes that by determining the safe truncation point as
Min(checkPointCopy.redo, xlogfname), where xlogfname is the xlog file
being restored. That depends on the assumption that everything before
the first file we (try to) restore is safe to truncate. IOW, we never
try to restore file B first, and then A, where A < B.

I'm not totally convinced that's a safe assumption. As an example,
consider doing an archive recovery, but without a backup label, and the
latest checkpoint record is broken. We would try to read the latest
(broken) checkpoint record first, and call RestoreArchivedFile to get
the xlog file containing that. But because that record is broken, we
fall back to using the previous checkpoint, and will need the xlog file
where the previous checkpoint record is in.

That's a pretty contrived example, but the point is that assumption
feels fragile. At the very least it should be noted explicitly in the
comments. A less fragile approach would be to use something dummy, like
"000000000000000000000000" as the truncation point until we've
successfully read the checkpoint/restartpoint record and started the replay.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 05-07-2008, 10:18 AM
Simon Riggs
 
Posts: n/a
Default Re: Verified fix for Bug 4137

On Tue, 2008-05-06 at 12:02 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > The problem was that at the very start of archive recovery the %r
> > parameter in restore_command could be set to a filename later than the
> > currently requested filename (%f). This could lead to early truncation
> > of the archived WAL files and would cause warm standby replication to
> > fail soon afterwards, in certain specific circumstances.
> >
> > Fix applied to both core server in generating correct %r filenames and
> > also to pg_standby to prevent acceptance of out-of-sequence filenames.

>
> So the core problem is that we use ControlFile->checkPointCopy.redo in
> RestoreArchivedFile to determine the safe truncation point, but when
> there's a backup label file, that's still coming from pg_control file,
> which is wrong.
>
> The patch fixes that by determining the safe truncation point as
> Min(checkPointCopy.redo, xlogfname), where xlogfname is the xlog file
> being restored. That depends on the assumption that everything before
> the first file we (try to) restore is safe to truncate. IOW, we never
> try to restore file B first, and then A, where A < B.
>
> I'm not totally convinced that's a safe assumption. As an example,
> consider doing an archive recovery, but without a backup label, and the
> latest checkpoint record is broken. We would try to read the latest
> (broken) checkpoint record first, and call RestoreArchivedFile to get
> the xlog file containing that. But because that record is broken, we
> fall back to using the previous checkpoint, and will need the xlog file
> where the previous checkpoint record is in.
>
> That's a pretty contrived example, but the point is that assumption
> feels fragile. At the very least it should be noted explicitly in the
> comments. A less fragile approach would be to use something dummy, like
> "000000000000000000000000" as the truncation point until we've
> successfully read the checkpoint/restartpoint record and started the replay.


Your comments are interesting and I'll think through that some more to
see if I can see if that leads to bugs. Will talk more later.

The only valid starting place, in all cases, is the checkpoint written
by pg_start_backup(). The user doesn't need to have been archiving files
prior to that so could legitimately have had a null archive_command.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 05-07-2008, 10:18 AM
Simon Riggs
 
Posts: n/a
Default Re: Verified fix for Bug 4137

On Tue, 2008-05-06 at 12:02 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > The problem was that at the very start of archive recovery the %r
> > parameter in restore_command could be set to a filename later than the
> > currently requested filename (%f). This could lead to early truncation
> > of the archived WAL files and would cause warm standby replication to
> > fail soon afterwards, in certain specific circumstances.
> >
> > Fix applied to both core server in generating correct %r filenames and
> > also to pg_standby to prevent acceptance of out-of-sequence filenames.

>
> So the core problem is that we use ControlFile->checkPointCopy.redo in
> RestoreArchivedFile to determine the safe truncation point, but when
> there's a backup label file, that's still coming from pg_control file,
> which is wrong.
>
> The patch fixes that by determining the safe truncation point as
> Min(checkPointCopy.redo, xlogfname), where xlogfname is the xlog file
> being restored. That depends on the assumption that everything before
> the first file we (try to) restore is safe to truncate. IOW, we never
> try to restore file B first, and then A, where A < B.
>
> I'm not totally convinced that's a safe assumption. As an example,
> consider doing an archive recovery, but without a backup label, and the
> latest checkpoint record is broken. We would try to read the latest
> (broken) checkpoint record first, and call RestoreArchivedFile to get
> the xlog file containing that. But because that record is broken, we
> fall back to using the previous checkpoint, and will need the xlog file
> where the previous checkpoint record is in.


> That's a pretty contrived example, but the point is that assumption
> feels fragile. At the very least it should be noted explicitly in the
> comments. A less fragile approach would be to use something dummy, like
> "000000000000000000000000" as the truncation point until we've
> successfully read the checkpoint/restartpoint record and started the replay.


Thanks for thinking about this.

Falling back to the secondary checkpoint implies we have a corrupted or
absent WAL file, so making recovery startup work correctly won't avoid
the need to re-run the base backup. We'll end with an unrecoverable
error in either case, so it doesn't seem worth attempting to improve
this in the way you suggest.

I think we should completely prevent access to secondary checkpoints
during archive recovery, because if the primary checkpoint isn't present
or is corrupt we aren't ever going to get passed it to get to the
pg_stop_backup() point. Hence an archive recovery can never be valid in
that case. I'll do a separate patch for that because they are unrelated
issues.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 05-07-2008, 10:18 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Verified fix for Bug 4137

Simon Riggs wrote:
> Falling back to the secondary checkpoint implies we have a corrupted or
> absent WAL file, so making recovery startup work correctly won't avoid
> the need to re-run the base backup. We'll end with an unrecoverable
> error in either case, so it doesn't seem worth attempting to improve
> this in the way you suggest.


That's true whenever you have to fall back to a secondary checkpoint,
but we still try to get the database up. One could argue that we
shouldn't, of course.

Anyway, the point is that the patch relies on a non-obvious assumption.
Even if the secondary checkpoint issue is a non-issue, it's not obvious
(to me at least) that there isn't other similar scenarios. And someone
might inadvertently break the assumption in a future patch, because it's
not an obvious one; calling ReadRecord looks very innocent. We shouldn't
introduce an assumption like that when we don't have to.

> I think we should completely prevent access to secondary checkpoints
> during archive recovery, because if the primary checkpoint isn't present
> or is corrupt we aren't ever going to get passed it to get to the
> pg_stop_backup() point. Hence an archive recovery can never be valid in
> that case. I'll do a separate patch for that because they are unrelated
> issues.


Well, we already don't use the secondary checkpoint if a backup label
file is present. And you can take a base backup without
pg_start_backup()/pg_stop_backup() if you shut down the system first (a
cold backup).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 05-07-2008, 10:18 AM
Simon Riggs
 
Posts: n/a
Default Re: Verified fix for Bug 4137

On Tue, 2008-05-06 at 15:00 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Falling back to the secondary checkpoint implies we have a corrupted or
> > absent WAL file, so making recovery startup work correctly won't avoid
> > the need to re-run the base backup. We'll end with an unrecoverable
> > error in either case, so it doesn't seem worth attempting to improve
> > this in the way you suggest.

>
> That's true whenever you have to fall back to a secondary checkpoint,
> but we still try to get the database up. One could argue that we
> shouldn't, of course.


I wouldn't. If we're doing a crash recovery we need that.

> Anyway, the point is that the patch relies on a non-obvious assumption.
> Even if the secondary checkpoint issue is a non-issue, it's not obvious
> (to me at least) that there isn't other similar scenarios. And someone
> might inadvertently break the assumption in a future patch, because it's
> not an obvious one; calling ReadRecord looks very innocent. We shouldn't
> introduce an assumption like that when we don't have to.


We were already assuming archive files were "OK to delete, if before".
The whole of recovery already relies heavily on the alphabetic sorting
property of WAL and associated filenames. Those filenames have been
specifically documented as maintaining that sorted order for that
reason. If somebody wanted to recover files in non-sorted order, then
yes I would expect a few things to break - this aspect wouldn't be the
most critical thing though.

The patch adds only what we already knew: you should never expect to
delete a file *after* one that is being requested. So I see no reason to
remove or change anything from the patch as it stands.

I'd be happy to add comments to say

* We use the alphanumeric sorting property of the filenames to decide
* which ones are earlier

as we have done elsewhere in xlog.c

I do want everything to work in a clear way, but I honestly can't see
any reason to add more code, especially in the absence of any problem.

> > I think we should completely prevent access to secondary checkpoints
> > during archive recovery, because if the primary checkpoint isn't present
> > or is corrupt we aren't ever going to get passed it to get to the
> > pg_stop_backup() point. Hence an archive recovery can never be valid in
> > that case. I'll do a separate patch for that because they are unrelated
> > issues.

>
> Well, we already don't use the secondary checkpoint if a backup label
> file is present.


Yep, looking at the wrong section of code. I wondered why I hadn't
plugged that gap previously.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 05-07-2008, 10:18 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Verified fix for Bug 4137

Simon Riggs wrote:
> We were already assuming archive files were "OK to delete, if before".
> The whole of recovery already relies heavily on the alphabetic sorting
> property of WAL and associated filenames. Those filenames have been
> specifically documented as maintaining that sorted order for that
> reason. If somebody wanted to recover files in non-sorted order, then
> yes I would expect a few things to break - this aspect wouldn't be the
> most critical thing though.


I didn't suggest that alphabetical sorting property is a new assumption;
it sure isn't. The new assumption is that you never call ReadRecord()
for record 0002 before you call it for record 0001 (before initializing
the checkPointCopy field from the checkpoint record, anyway).

I can imagine a future patch to do xlog file prefetching, for example,
that breaks that assumption. Or falling back to the previous checkpoint
as discussed. Or maybe you screwed up your recovery.conf, and try to use
WAL files that belong to a different installation. The database won't
start up, of course, because the checkpoint record isn't in the right
place, but the damage has already been done because the recovery command
deleted some files it shouldn't have.

Granted, none of those are particularly likely, but we should be extra
careful when deleting files.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 05-07-2008, 10:18 AM
Simon Riggs
 
Posts: n/a
Default Re: Verified fix for Bug 4137

On Tue, 2008-05-06 at 17:52 +0100, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > We were already assuming archive files were "OK to delete, if before".
> > The whole of recovery already relies heavily on the alphabetic sorting
> > property of WAL and associated filenames. Those filenames have been
> > specifically documented as maintaining that sorted order for that
> > reason. If somebody wanted to recover files in non-sorted order, then
> > yes I would expect a few things to break - this aspect wouldn't be the
> > most critical thing though.

>
> I didn't suggest that alphabetical sorting property is a new assumption;
> it sure isn't. The new assumption is that you never call ReadRecord()
> for record 0002 before you call it for record 0001 (before initializing
> the checkPointCopy field from the checkpoint record, anyway).


I've not done anything with the ordering of calls to ReadRecord(). There
is no such assumption introduced here.

The patch prevents an erroneous call to delete files. The earlier code
just assumed such a call would never occur, which was wrong, hence the
patch. There are no new assumptions introduced into the code by this.

I very much respect your opinion in all things, most especially on
matters of code, more so now you are a committer. I would be willing to
make changes to any patch on your say-so, though I can't see how any of
your comments improve this patch in this specific case only. I don't
doubt that you will find flaws in many of my future patches.

> I can imagine a future patch to do xlog file prefetching, for example,
> that breaks that assumption.


Maybe, but I think its for such a patch to consider in full the
consequences of doing that, not for me to do so in advance. The current
assumption is filename ordered recovery, there is definitely more than
one part of the code that relies significantly on that point.

I think any prefetcher would be advised to stay within one file at a
time. Anything else will effect other behaviours in ways that would need
careful planning to avoid unintended consequences.

> Or falling back to the previous checkpoint
> as discussed. Or maybe you screwed up your recovery.conf, and try to use
> WAL files that belong to a different installation. The database won't
> start up, of course, because the checkpoint record isn't in the right
> place, but the damage has already been done because the recovery command
> deleted some files it shouldn't have.
>
> Granted, none of those are particularly likely, but we should be extra
> careful when deleting files.


With respect, that is exactly what the patch does, though it is
certainly better for being tested by your comments.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 05-07-2008, 10:18 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Verified fix for Bug 4137

Simon Riggs wrote:
> On Tue, 2008-05-06 at 17:52 +0100, Heikki Linnakangas wrote:
>> I didn't suggest that alphabetical sorting property is a new assumption;
>> it sure isn't. The new assumption is that you never call ReadRecord()
>> for record 0002 before you call it for record 0001 (before initializing
>> the checkPointCopy field from the checkpoint record, anyway).

>
> I've not done anything with the ordering of calls to ReadRecord(). There
> is no such assumption introduced here.


Hmm, I see that I was inaccurate when I said the patch introduces that
assumption, sorry about the confusion. It's more like the code is
currently completely oblivious of the issue, and the patch makes it
better but doesn't fully fix it.

The code in CVS is broken, as we now know, because it assumes that we
can delete all files < checkPointCopy.redo, which is not true when
checkPointCopy.redo has not yet been initialized from the backup label
file, and points to a location that's ahead of the real safe truncation
point. The patch makes it better, and the assumption is now that it's
safe to truncate everything < min(checkPointCopy.redo, xlog file we're
reading). That still seems too fragile to me, because we assume that
after you've asked for xlog record X, we never need anything earlier
then that.

In fact, what will happen if the checkpoint record's redo pointer points
to an earlier xlog file:

1. The location of the checkpoint record is read by read_backup_label().
Let's say that it's 0005.
2. ReadCheckpointRecord() is called for 0005. The restore command is
called because that xlog file is not present. The safe truncation point
is determined to be 0005, as that's what we're reading. Everything
before that is truncated
3. The redo pointer in the checkpoint record points to 0003. That's
where we should start the recovery. Oops :-(

I haven't tested this, so, am I missing something that makes that not
happen?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 05-07-2008, 10:18 AM
Simon Riggs
 
Posts: n/a
Default Re: Verified fix for Bug 4137

On Tue, 2008-05-06 at 21:51 +0100, Heikki Linnakangas wrote:

> In fact, what will happen if the checkpoint record's redo pointer points
> to an earlier xlog file:
>
> 1. The location of the checkpoint record is read by read_backup_label().
> Let's say that it's 0005.
> 2. ReadCheckpointRecord() is called for 0005. The restore command is
> called because that xlog file is not present. The safe truncation point
> is determined to be 0005, as that's what we're reading. Everything
> before that is truncated
> 3. The redo pointer in the checkpoint record points to 0003. That's
> where we should start the recovery. Oops :-(


Yes, this case could be a problem, if the records are in different
files. It's the files that matter, not the records themselves though.

I've extended the patch without introducing another new status variable,
which was my original concern with what you suggested previously.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT. The time now is 02:02 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 64