This is a discussion on Re: [HACKERS] Full page writes improvement, code update within the Pgsql Patches forums, part of the PostgreSQL category; --> On Tue, 2007-03-27 at 11:52 +0900, Koichi Suzuki wrote: > Here's an update of a code to improve full ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On Tue, 2007-03-27 at 11:52 +0900, Koichi Suzuki wrote: > Here's an update of a code to improve full page writes as proposed in > > http://archives.postgresql.org/pgsql...1/msg01491.php > and > http://archives.postgresql.org/pgsql...1/msg00607.php > > Update includes some modification for error handling in archiver and > restoration command. > > In the previous threads, I posted several evaluation and shown that we > can keep all the full page writes needed for full XLOG crash recovery, > while compressing archive log size considerably better than gzip, with > less CPU consumption. I've found no further objection for this proposal > but still would like to hear comments/opinions/advices. Koichi-san, Looks interesting. I like the small amount of code to do this. A few thoughts: - Not sure why we need "full_page_compress", why not just mark them always? That harms noone. (Did someone else ask for that? If so, keep it) - OTOH I'd like to see an explicit parameter set during recovery since you're asking the main recovery path to act differently in case a single bit is set/unset. If you are using that form of recovery, we should say so explicitly, to keep everybody else safe. - I'd rather mark just the nonremovable blocks. But no real difference - We definitely don't want an normal elog in a XLogInsert critical section, especially at DEBUG1 level - diff -c format is easier and the standard - pg_logarchive and pg_logrestore should be called by a name that reflects what they actually do. Possibly pg_compresslog and pg_decompresslog etc.. I've not reviewed those programs, but: - Not sure why we have to compress away page headers. Touch as little as you can has always been my thinking with recovery code. - I'm very uncomfortable with touching the LSN. Maybe I misunderstand? - Have you thought about how pg_standby would integrate with this option? Can you please? - I'll do some docs for this after Freeze, if you'd like. I have some other changes to make there, so I can do this at the same time. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Simon; Thanks a lot for your comments/advices. I'd like to write some feedback. Simon Riggs wrote: > On Tue, 2007-03-27 at 11:52 +0900, Koichi Suzuki wrote: > >> Here's an update of a code to improve full page writes as proposed in >> >> http://archives.postgresql.org/pgsql...1/msg01491.php >> and >> http://archives.postgresql.org/pgsql...1/msg00607.php >> >> Update includes some modification for error handling in archiver and >> restoration command. >> >> In the previous threads, I posted several evaluation and shown that we >> can keep all the full page writes needed for full XLOG crash recovery, >> while compressing archive log size considerably better than gzip, with >> less CPU consumption. I've found no further objection for this proposal >> but still would like to hear comments/opinions/advices. > > Koichi-san, > > Looks interesting. I like the small amount of code to do this. > > A few thoughts: > > - Not sure why we need "full_page_compress", why not just mark them > always? That harms noone. (Did someone else ask for that? If so, keep > it) No, no one asked to have a separate option. There'll be no bad influence to do so. As written below, full page write can be categolized as follows: 1) Needed for crash recovery: first page update after each checkpoint. This has to be kept in WAL. 2) Needed for archive recovery: page update between pg_start_backup and pg_stop_backup. This has to be kept in archive log. 3) For log-shipping slave such as pg_standby: no full page writes will be needed for this purpose. My proposal deals with 2). So, if we mark each "full_page_write", I'd rather mark when this is needed. Still need only one bit because the case 3) does not need any mark. > - OTOH I'd like to see an explicit parameter set during recovery since > you're asking the main recovery path to act differently in case a single > bit is set/unset. If you are using that form of recovery, we should say > so explicitly, to keep everybody else safe. Only one thing I had to do is to create "dummy" full page write to maintain LSNs. Full page writes are omitted in archive log. We have to LSNs same as those in the original WAL. In this case, recovery has to read logical log, not "dummy" full page writes. On the other hand, if both logical log and "real" full page writes are found in a log record, the recovery has to use "real" full page writes. > - I'd rather mark just the nonremovable blocks. But no real difference It sound nicer. According to the full page write categories above, we can mark full page writes as needed in "crash recovery" or "archive recovery". Please give some feedback to the above full page write categories. > > - We definitely don't want an normal elog in a XLogInsert critical > section, especially at DEBUG1 level I agree. I'll remove elog calls from critical sections. > - diff -c format is easier and the standard I'll change the diff option. > - pg_logarchive and pg_logrestore should be called by a name that > reflects what they actually do. Possibly pg_compresslog and > pg_decompresslog etc.. I've not reviewed those programs, but: I wasn't careful to have command names more based on what to be done. I'll change the command name. > > - Not sure why we have to compress away page headers. Touch as little as > you can has always been my thinking with recovery code. Eliminating page headers gives compression rate slightly better, a couple of percents. To make code simpler, I'll drop this compression. > > - I'm very uncomfortable with touching the LSN. Maybe I misunderstand? The reason why we need pg_logarchive (or pg_decompresslog) is to maintain LSN the same as those in the original WAL. For this purpose, we need "dummy" full page write. I don't like to touch LSN either and this is the reason why pg_decompresslog need some extra work, eliminating many other changes in the recovery. > > - Have you thought about how pg_standby would integrate with this > option? Can you please? Yes I believe so. As pg_standby does not include any chance to meet partial writes of pages, I believe you can omit all the full page writes. Of course, as Tom Lange suggested in http://archives.postgresql.org/pgsql...2/msg00034.php removing full page writes can lose a chance to recover from partial/inconsisitent writes in the crash of pg_standby. In this case, we have to import a backup and archive logs (with full page writes during the backup) to recover. (We have to import them when the file system crashes anyway). If it's okay, I believe pg_compresslog/pg_decompresslog can be integrated with pg_standby. Maybe we can work together to include pg_compresslog/pg_decompresslog in pg_standby. > > - I'll do some docs for this after Freeze, if you'd like. I have some > other changes to make there, so I can do this at the same time. I'll be looking forward to your writings. Best regards; -- Koichi Suzuki ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Wed, 2007-03-28 at 10:54 +0900, Koichi Suzuki wrote: > As written below, full page write can be > categolized as follows: > > 1) Needed for crash recovery: first page update after each checkpoint. > This has to be kept in WAL. > > 2) Needed for archive recovery: page update between pg_start_backup and > pg_stop_backup. This has to be kept in archive log. > > 3) For log-shipping slave such as pg_standby: no full page writes will > be needed for this purpose. > > My proposal deals with 2). So, if we mark each "full_page_write", I'd > rather mark when this is needed. Still need only one bit because the > case 3) does not need any mark. I'm very happy with this proposal, though I do still have some points in detailed areas. If you accept that 1 & 2 are valid goals, then 1 & 3 or 1, 2 & 3 are also valid goals, ISTM. i.e. you might choose to use full_page_writes on the primary and yet would like to see optimised data transfer to the standby server. In that case, you would need the mark. > > - Not sure why we need "full_page_compress", why not just mark them > > always? That harms noone. (Did someone else ask for that? If so, keep > > it) > > No, no one asked to have a separate option. There'll be no bad > influence to do so. So, if we mark each "full_page_write", I'd > rather mark when this is needed. Still need only one bit because the > case 3) does not need any mark. OK, different question: Why would anyone ever set full_page_compress = off? Why have a parameter that does so little? ISTM this is: i) one more thing to get wrong ii) cheaper to mark the block when appropriate than to perform the if() test each time. That can be done only in the path where backup blocks are present. iii) If we mark the blocks every time, it allows us to do an offline WAL compression. If the blocks aren't marked that option is lost. The bit is useful information, so we should have it in all cases. > > - OTOH I'd like to see an explicit parameter set during recovery since > > you're asking the main recovery path to act differently in case a single > > bit is set/unset. If you are using that form of recovery, we should say > > so explicitly, to keep everybody else safe. > > Only one thing I had to do is to create "dummy" full page write to > maintain LSNs. Full page writes are omitted in archive log. We have to > LSNs same as those in the original WAL. In this case, recovery has to > read logical log, not "dummy" full page writes. On the other hand, if > both logical log and "real" full page writes are found in a log record, > the recovery has to use "real" full page writes. I apologise for not understanding your reply, perhaps my original request was unclear. In recovery.conf, I'd like to see a parameter such as dummy_backup_blocks = off (default) | on to explicitly indicate to the recovery process that backup blocks are present, yet they are garbage and should be ignored. Having garbage data within the system is potentially dangerous and I want to be told by the user that they were expecting that and its OK to ignore that data. Otherwise I want to throw informative errors. Maybe it seems OK now, but the next change to the system may have unintended consequences and it may not be us making the change. "It's OK the Alien will never escape from the lab" is the starting premise for many good sci-fi horrors and I want to watch them, not be in one myself. :-) We can call it other things, of course. e.g. ignore_dummy_blocks decompressed_blocks apply_backup_blocks > Yes I believe so. As pg_standby does not include any chance to meet > partial writes of pages, I believe you can omit all the full page > writes. Of course, as Tom Lange suggested in > http://archives.postgresql.org/pgsql...2/msg00034.php > removing full page writes can lose a chance to recover from > partial/inconsisitent writes in the crash of pg_standby. In this case, > we have to import a backup and archive logs (with full page writes > during the backup) to recover. (We have to import them when the file > system crashes anyway). If it's okay, I believe > pg_compresslog/pg_decompresslog can be integrated with pg_standby. > > Maybe we can work together to include pg_compresslog/pg_decompresslog in > pg_standby. ISTM there are two options. I think this option is already possible: 1. Allow pg_decompresslog to operate on a file, replacing it with the expanded form, like gunzip, so we would do this: restore_command = 'pg_standby %f decomp.tmp && pg_decompresslog decomp.tmp %p' though the decomp.tmp file would not get properly initialised or cleaned up when we finish. whereas this will take additional work 2. Allow pg_standby to write to stdin, so that we can do this: restore_command = 'pg_standby %f | pg_decompresslog - %p' -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| Hi, Here're some feedback to the comment: Simon Riggs wrote: > On Wed, 2007-03-28 at 10:54 +0900, Koichi Suzuki wrote: > >> As written below, full page write can be >> categolized as follows: >> >> 1) Needed for crash recovery: first page update after each checkpoint. >> This has to be kept in WAL. >> >> 2) Needed for archive recovery: page update between pg_start_backup and >> pg_stop_backup. This has to be kept in archive log. >> >> 3) For log-shipping slave such as pg_standby: no full page writes will >> be needed for this purpose. >> >> My proposal deals with 2). So, if we mark each "full_page_write", I'd >> rather mark when this is needed. Still need only one bit because the >> case 3) does not need any mark. > > I'm very happy with this proposal, though I do still have some points in > detailed areas. > > If you accept that 1 & 2 are valid goals, then 1 & 3 or 1, 2 & 3 are > also valid goals, ISTM. i.e. you might choose to use full_page_writes on > the primary and yet would like to see optimised data transfer to the > standby server. In that case, you would need the mark. Yes, I need the mark. In my proposal, only unmarked full-page-writes, which were written as the first update after a checkpoint, are to be removed offline (pg_compresslog). > >>> - Not sure why we need "full_page_compress", why not just mark them >>> always? That harms noone. (Did someone else ask for that? If so, keep >>> it) >> No, no one asked to have a separate option. There'll be no bad >> influence to do so. So, if we mark each "full_page_write", I'd >> rather mark when this is needed. Still need only one bit because the >> case 3) does not need any mark. > > OK, different question: > Why would anyone ever set full_page_compress = off? > > Why have a parameter that does so little? ISTM this is: > > i) one more thing to get wrong > > ii) cheaper to mark the block when appropriate than to perform the if() > test each time. That can be done only in the path where backup blocks > are present. > > iii) If we mark the blocks every time, it allows us to do an offline WAL > compression. If the blocks aren't marked that option is lost. The bit is > useful information, so we should have it in all cases. Not only full-page-writes are written as WAL record. In my proposal, both full-page-writes and logical log are written in a WAL record, which will make WAL size slightly bigger (five percent or so). If full_page_compress = off, only a full-page-write will be written in a WAL record. I thought someone will not be happy with this size growth. I agree to make this mandatory if every body is happy with extra logical log in WAL records with full page writes. I'd like to have your opinion. > >>> - OTOH I'd like to see an explicit parameter set during recovery since >>> you're asking the main recovery path to act differently in case a single >>> bit is set/unset. If you are using that form of recovery, we should say >>> so explicitly, to keep everybody else safe. >> Only one thing I had to do is to create "dummy" full page write to >> maintain LSNs. Full page writes are omitted in archive log. We have to >> LSNs same as those in the original WAL. In this case, recovery has to >> read logical log, not "dummy" full page writes. On the other hand, if >> both logical log and "real" full page writes are found in a log record, >> the recovery has to use "real" full page writes. > > I apologise for not understanding your reply, perhaps my original > request was unclear. > > In recovery.conf, I'd like to see a parameter such as > > dummy_backup_blocks = off (default) | on > > to explicitly indicate to the recovery process that backup blocks are > present, yet they are garbage and should be ignored. Having garbage data > within the system is potentially dangerous and I want to be told by the > user that they were expecting that and its OK to ignore that data. > Otherwise I want to throw informative errors. Maybe it seems OK now, but > the next change to the system may have unintended consequences and it > may not be us making the change. "It's OK the Alien will never escape > from the lab" is the starting premise for many good sci-fi horrors and I > want to watch them, not be in one myself. :-) > > We can call it other things, of course. e.g. > ignore_dummy_blocks > decompressed_blocks > apply_backup_blocks So far, we don't need any modification to the recovery and redo functions. They ignore the dummy and apply logical logs. Also, if there are both full page writes and logical log, current recovery selects full page writes to apply. I agree to introduce this option if 8.3 code introduces any conflict to the current. Or, we could introduce this option for future safety. Do you think we should introduce this option? If this should be introduced now, what we should do is to check this option when dummy full-page-write appears. > >> Yes I believe so. As pg_standby does not include any chance to meet >> partial writes of pages, I believe you can omit all the full page >> writes. Of course, as Tom Lange suggested in >> http://archives.postgresql.org/pgsql...2/msg00034.php >> removing full page writes can lose a chance to recover from >> partial/inconsisitent writes in the crash of pg_standby. In this case, >> we have to import a backup and archive logs (with full page writes >> during the backup) to recover. (We have to import them when the file >> system crashes anyway). If it's okay, I believe >> pg_compresslog/pg_decompresslog can be integrated with pg_standby. >> >> Maybe we can work together to include pg_compresslog/pg_decompresslog in >> pg_standby. > > ISTM there are two options. > > I think this option is already possible: > > 1. Allow pg_decompresslog to operate on a file, replacing it with the > expanded form, like gunzip, so we would do this: > restore_command = 'pg_standby %f decomp.tmp && pg_decompresslog > decomp.tmp %p' > > though the decomp.tmp file would not get properly initialised or cleaned > up when we finish. > > whereas this will take additional work > > 2. Allow pg_standby to write to stdin, so that we can do this: > restore_command = 'pg_standby %f | pg_decompresslog - %p' > Both seem to work fine. pg_decompresslog will read entire file at the beginning and so both two will be equivallent. To make the second option run quicker, pg_decompresslog needs some modification to read WAL record one after another. Anyway, could you try to run pg_standby with pg_compresslog and pg_decompresslog? ---- Additional recomment on page header removal: I found that it is not simple to keep page header in the compressed archive log. Because we eliminate unmarked full page writes and shift the rest of the WAL file data, it is not simple to keep page header as the page header in the compressed archive log. It is much simpler to remove page header as well and rebuild them. I'd like to keep current implementation in this point. Any suggestions are welcome. ------------------- I'll modify the name of the commands and post it hopefully within 20hours. With Best Regards; -- Koichi Suzuki ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On Thu, 2007-03-29 at 17:50 +0900, Koichi Suzuki wrote: > Not only full-page-writes are written as WAL record. In my proposal, > both full-page-writes and logical log are written in a WAL record, which > will make WAL size slightly bigger (five percent or so). If > full_page_compress = off, only a full-page-write will be written in a > WAL record. I thought someone will not be happy with this size growth. OK, I see what you're doing now and agree with you that we do need a parameter. Not sure about the name you've chosen though - it certainly confused me until you explained. A parameter called ..._compress indicates to me that it would reduce something in size whereas what it actually does is increase the size of WAL slightly. We should have a parameter name that indicates what it actually does, otherwise some people will choose to use this parameter even when they are not using archive_command with pg_compresslog. Some possible names... additional_wal_info = 'COMPRESS' add_wal_info wal_additional_info wal_auxiliary_info wal_extra_data attach_wal_info .... others? I've got some ideas for the future for adding additional WAL info for various purposes, so it might be useful to have a parameter that can cater for multiple types of additional WAL data. Or maybe we go for something more specific like wal_add_compress_info = on wal_add_XXXX_info ... > > In recovery.conf, I'd like to see a parameter such as > > > > dummy_backup_blocks = off (default) | on > > > > to explicitly indicate to the recovery process that backup blocks are > > present, yet they are garbage and should be ignored. Having garbage data > > within the system is potentially dangerous and I want to be told by the > > user that they were expecting that and its OK to ignore that data. > > Otherwise I want to throw informative errors. Maybe it seems OK now, but > > the next change to the system may have unintended consequences and it > > may not be us making the change. "It's OK the Alien will never escape > > from the lab" is the starting premise for many good sci-fi horrors and I > > want to watch them, not be in one myself. :-) > > > > We can call it other things, of course. e.g. > > ignore_dummy_blocks > > decompressed_blocks > > apply_backup_blocks > > So far, we don't need any modification to the recovery and redo > functions. They ignore the dummy and apply logical logs. Also, if > there are both full page writes and logical log, current recovery > selects full page writes to apply. > > I agree to introduce this option if 8.3 code introduces any conflict to > the current. Or, we could introduce this option for future safety. Do > you think we should introduce this option? Yes. You are skipping a correctness test and that should be by explicit command only. It's no problem to include that as well, since you are already having to specify pg_... decompress... but the recovery process doesn't know whether or not you've done that. > Anyway, could you try to run pg_standby with pg_compresslog and > pg_decompresslog? After freeze, yes. > ---- > Additional recomment on page header removal: > > I found that it is not simple to keep page header in the compressed > archive log. Because we eliminate unmarked full page writes and shift > the rest of the WAL file data, it is not simple to keep page header as > the page header in the compressed archive log. It is much simpler to > remove page header as well and rebuild them. I'd like to keep current > implementation in this point. OK. This is a good feature. Thanks for your patience with my comments. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Simon, > OK, different question: > Why would anyone ever set full_page_compress = off? The only reason I can see is if compression costs us CPU but gains RAM & I/O. I can think of a lot of applications ... benchmarks included ... which are CPU-bound but not RAM or I/O bound. For those applications, compression is a bad tradeoff. If, however, CPU used for compression is made up elsewhere through smaller file processing, then I'd agree that we don't need a switch. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Thu, 2007-03-29 at 11:45 -0700, Josh Berkus wrote: > > OK, different question: > > Why would anyone ever set full_page_compress = off? > > The only reason I can see is if compression costs us CPU but gains RAM & > I/O. I can think of a lot of applications ... benchmarks included ... > which are CPU-bound but not RAM or I/O bound. For those applications, > compression is a bad tradeoff. > > If, however, CPU used for compression is made up elsewhere through smaller > file processing, then I'd agree that we don't need a switch. Koichi-san has explained things for me now. I misunderstood what the parameter did and reading your post, ISTM you have as well. I do hope Koichi-san will alter the name to allow everybody to understand what it does. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| > Without a switch, because both full page writes and > corresponding logical log is included in WAL, this will > increase WAL size slightly > (maybe about five percent or so). If everybody is happy > with this, we > don't need a switch. Sorry, I still don't understand that. What is the "corresponding logical log" ? It seems to me, that a full page WAL record has enough info to produce a dummy LSN WAL entry. So insead of just cutting the full page wal record you could replace it with a LSN WAL entry when archiving the log. Then all that is needed is the one flag, no extra space ? Andreas ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Fri, 2007-03-30 at 10:22 +0200, Zeugswetter Andreas ADI SD wrote: > > Without a switch, because both full page writes and > > corresponding logical log is included in WAL, this will > > increase WAL size slightly > > (maybe about five percent or so). If everybody is happy > > with this, we > > don't need a switch. > > Sorry, I still don't understand that. What is the "corresponding logical > log" ? > It seems to me, that a full page WAL record has enough info to produce a > > dummy LSN WAL entry. So insead of just cutting the full page wal record > you > could replace it with a LSN WAL entry when archiving the log. > > Then all that is needed is the one flag, no extra space ? The full page write is required for crash recovery, but that isn't required during archive recovery because the base backup provides the safe base. Archive recovery needs the normal xlog record, which in some cases has been optimised away because the backup block is present, since the full block already contains the changes. If you want to remove the backup blocks, you need to put back the information that was optimised away, otherwise you won't be able to do the archive recovery correctly. Hence a slight increase in WAL volume to allow it to be compressed does make sense. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| ||||
| Simon Riggs wrote: > On Fri, 2007-03-30 at 10:22 +0200, Zeugswetter Andreas ADI SD wrote: >>> Without a switch, because both full page writes and >>> corresponding logical log is included in WAL, this will >>> increase WAL size slightly >>> (maybe about five percent or so). If everybody is happy >>> with this, we >>> don't need a switch. >> Sorry, I still don't understand that. What is the "corresponding logical >> log" ? >> It seems to me, that a full page WAL record has enough info to produce a >> >> dummy LSN WAL entry. So insead of just cutting the full page wal record >> you >> could replace it with a LSN WAL entry when archiving the log. >> >> Then all that is needed is the one flag, no extra space ? > > The full page write is required for crash recovery, but that isn't > required during archive recovery because the base backup provides the > safe base. Is that always true? Could the backup not pick up a partially-written page? Assuming it's being written to as the backup is in progress. (We are talking about when disk blocks are smaller than PG blocks here, so can't guarantee an atomic write for a PG block?) -- Richard Huxton Archonet Ltd ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |