Unix Technical Forum

Re: Checkpoint cost, looks like it is WAL/CRC

This is a discussion on Re: Checkpoint cost, looks like it is WAL/CRC within the pgsql Hackers forums, part of the PostgreSQL category; --> > Here's an idea: > > We read the page that we would have backed up, calc the CRC ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 05:46 AM
Zeugswetter Andreas DAZ SD
 
Posts: n/a
Default Re: Checkpoint cost, looks like it is WAL/CRC


> Here's an idea:
>
> We read the page that we would have backed up, calc the CRC and
> write a short WAL record with just the CRC, not the block. When
> we recover we re-read the database page, calc its CRC and
> compare it with the CRC from the transaction log. If they
> differ, we know that the page was torn and we know the database
> needs recovery. (So we calc the CRC when we log AND when we recover).


Won't work, since the page on disk may have x different contents
between 2 checkpoints (bgwriter from lru).

Only workable solution would imho be to write the LSN to each 512
byte block (not that I am propagating that idea).

Andreas

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 05:46 AM
Tom Lane
 
Posts: n/a
Default Re: Checkpoint cost, looks like it is WAL/CRC

"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:
> Only workable solution would imho be to write the LSN to each 512
> byte block (not that I am propagating that idea).


We're not doing anything like that, as it would create an impossible
space-management problem (or are you happy with limiting tuples to
500 bytes?). What we *could* do is calculate a page-level CRC and
store it in the page header just before writing out. Torn pages
would then manifest as a wrong CRC on read. No correction ability,
but at least a reliable detection ability.

However, this will do nothing to solve the performance problem if
the core of that problem is the cost of computing page-sized CRCs :-(

We still don't know enough about the situation to know what a solution
might look like. Is the slowdown Josh is seeing due to the extra CPU
cost of the CRCs, or the extra I/O cost, or excessive locking of the
WAL-related data structures while we do this stuff, or ???. Need more
data.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-11-2008, 05:46 AM
Tom Lane
 
Posts: n/a
Default Re: Checkpoint cost, looks like it is WAL/CRC

I wrote:
> We still don't know enough about the situation to know what a solution
> might look like. Is the slowdown Josh is seeing due to the extra CPU
> cost of the CRCs, or the extra I/O cost, or excessive locking of the
> WAL-related data structures while we do this stuff, or ???. Need more
> data.


Josh, is OSDL up enough that you can try another comparison run?
If so, please undo the previous patch (which disabled page dumping
entirely) and instead try removing this block of code, starting
at about xlog.c line 620 in CVS tip:

/*
* Now add the backup block headers and data into the CRC
*/
for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)
{
if (dtbuf_bkp[i])
{
BkpBlock *bkpb = &(dtbuf_xlg[i]);
char *page;

COMP_CRC32(rdata_crc,
(char *) bkpb,
sizeof(BkpBlock));
page = (char *) BufferGetBlock(dtbuf[i]);
if (bkpb->hole_length == 0)
{
COMP_CRC32(rdata_crc,
page,
BLCKSZ);
}
else
{
/* must skip the hole */
COMP_CRC32(rdata_crc,
page,
bkpb->hole_offset);
COMP_CRC32(rdata_crc,
page + (bkpb->hole_offset + bkpb->hole_length),
BLCKSZ - (bkpb->hole_offset + bkpb->hole_length));
}
}
}

This will remove just the CRC calculation work associated with backed-up
pages. Note that any attempt to recover from the WAL will fail, but I
assume you don't need that for the purposes of the test run.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-11-2008, 05:46 AM
Greg Stark
 
Posts: n/a
Default Re: Checkpoint cost, looks like it is WAL/CRC

Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:
> > Only workable solution would imho be to write the LSN to each 512
> > byte block (not that I am propagating that idea).

>
> We're not doing anything like that, as it would create an impossible
> space-management problem (or are you happy with limiting tuples to
> 500 bytes?). What we *could* do is calculate a page-level CRC and
> store it in the page header just before writing out. Torn pages
> would then manifest as a wrong CRC on read. No correction ability,
> but at least a reliable detection ability.


At the same time as you do the CRC you can copy the bytes to a fresh page
skipping the LSNs. Likewise, when writing out the page you have to calculate
the CRC; at the same time as you calculate the CRC you write out the bytes to
a temporary buffer adding LSNs and write that to disk.

This would be "zero-copy" if you're already scanning the bytes to calculate
the CRC since you can add and remove LSNs at the same time. It does require an
extra buffer to store the page in before writing and that entails some amount
of cache thrashing. But maybe you could reuse the same buffer over and over
again for every read/write.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-11-2008, 05:46 AM
Tom Lane
 
Posts: n/a
Default Re: Checkpoint cost, looks like it is WAL/CRC

Greg Stark <gsstark@mit.edu> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> What we *could* do is calculate a page-level CRC and
>> store it in the page header just before writing out. Torn pages
>> would then manifest as a wrong CRC on read. No correction ability,
>> but at least a reliable detection ability.


> At the same time as you do the CRC you can copy the bytes to a fresh page
> skipping the LSNs. Likewise, when writing out the page you have to calculate
> the CRC; at the same time as you calculate the CRC you write out the bytes to
> a temporary buffer adding LSNs and write that to disk.


Huh? You seem to be proposing doing *both* things, which sounds entirely
pointless.

BTW, I was envisioning the page CRCs as something we'd only check during
crash recovery, not normal-operation reads.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-11-2008, 05:46 AM
Josh Berkus
 
Posts: n/a
Default Re: Checkpoint cost, looks like it is WAL/CRC

Tom,

> Josh, is OSDL up enough that you can try another comparison run?


Thankfully, yes.

> If so, please undo the previous patch (which disabled page dumping
> entirely) and instead try removing this block of code, starting
> at about xlog.c line 620 in CVS tip:


Will do. Results in a few days.

--
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-11-2008, 05:46 AM
Kenneth Marshall
 
Posts: n/a
Default Re: Checkpoint cost, looks like it is WAL/CRC

On Thu, Jul 07, 2005 at 11:36:40AM -0400, Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > Tom Lane <tgl@sss.pgh.pa.us> writes:
> >> What we *could* do is calculate a page-level CRC and
> >> store it in the page header just before writing out. Torn pages
> >> would then manifest as a wrong CRC on read. No correction ability,
> >> but at least a reliable detection ability.

>
> > At the same time as you do the CRC you can copy the bytes to a fresh page
> > skipping the LSNs. Likewise, when writing out the page you have to calculate
> > the CRC; at the same time as you calculate the CRC you write out the bytes to
> > a temporary buffer adding LSNs and write that to disk.

>
> Huh? You seem to be proposing doing *both* things, which sounds entirely
> pointless.
>
> BTW, I was envisioning the page CRCs as something we'd only check during
> crash recovery, not normal-operation reads.
>
> regards, tom lane
>

Does the DB page on disk have a version number? If so, maybe we could
update the WAL with the CRC+version anytime the page is update. You may
need to check the log for multiple CRC+version entries to determine the
torn-page status.

Ken

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-11-2008, 05:47 AM
Tom Lane
 
Posts: n/a
Default Re: Checkpoint cost, looks like it is WAL/CRC

Josh Berkus <josh@agliodbs.com> writes:
>> If so, please undo the previous patch (which disabled page dumping
>> entirely) and instead try removing this block of code, starting
>> at about xlog.c line 620 in CVS tip:


> Will do. Results in a few days.


Great. BTW, don't bother testing snapshots between 2005/07/05 2300 EDT
and just now --- Bruce's full_page_writes patch introduced a large
random negative component into the timing ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-11-2008, 05:47 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Checkpoint cost, looks like it is WAL/CRC

On Thu, 7 Jul 2005, Tom Lane wrote:

> We still don't know enough about the situation to know what a solution
> might look like. Is the slowdown Josh is seeing due to the extra CPU
> cost of the CRCs, or the extra I/O cost, or excessive locking of the
> WAL-related data structures while we do this stuff, or ???. Need more
> data.


I wonder if a different BLCKSZ would make a difference either way. Say,
1024 bytes instead of the default 8192.

- Heikki

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-11-2008, 05:48 AM
Josh Berkus
 
Posts: n/a
Default Re: Checkpoint cost, looks like it is WAL/CRC

Tom,

> Great. BTW, don't bother testing snapshots between 2005/07/05 2300 EDT
> and just now --- Bruce's full_page_writes patch introduced a large
> random negative component into the timing ...


Ach. Starting over, then.

--Josh

--
Josh Berkus
Aglio Database Solutions
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 02:15 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com