This is a discussion on archive_command within the pgsql Admins forums, part of the PostgreSQL category; --> On Mon, 2005-10-03 at 20:00 -0700, Jeff Frost wrote: > On Sun, 2 Oct 2005, Simon Riggs wrote: > ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On Mon, 2005-10-03 at 20:00 -0700, Jeff Frost wrote: > On Sun, 2 Oct 2005, Simon Riggs wrote: > > > Probably the best idea is to backup the last WAL file for each timeline > > seen. Keep track of that, so when the current file changes you'll know > > which timeline you're on and stick to that from then on. Or more simply, > > put some notes with your program saying "if you ever use a complex > > recovery situation, make sure to clear all archive_status files for > > higher timeline ids before using this program". > > Tell me if you think this is the most reasonable way to determine the in use > WAL file: > > ls -rt $PGDATA/pg_xlog/ | grep -v "backup\|archive\|RECOVERY" | tail -1 If you trust the times your filesystem hands you. PostgreSQL doesn't use the file times it uses the sequential number naming, so for paranoid accuracy, this should too so perhaps take the t off. > > This would not be necessary had I completed my logswitch patch in time > > for 8.1 freeze, but I regret that I was unable to do that. > > What's the logswitch patch going to accomplish? Nothing until I finish it. :-) > Also, I have an interesting and unrelated question...this past weekend, we had > a failure in which we had to restore from backup. I have created a script > which makes a base backup every two weeks and we keep the last two. Also, we > use PITR for replication, making a base backup between the primary and > secondary server every 8 hours, and running my rsynclastlog script once/minute > to keep as up to date as possible. Sounds like a good setup. > Now here's the problem...the servers > switched roles on Sep 21. We switched them back a little while after that. > When I went to replay the WAL files using the Sep 15 base backup, it happily > played back the WAL files to Sep 21 and stopped. I presume that this means > whenever the servers switch roles I need to create a fresh base backup? Yes, but more generally if I lost one node I would always be inclined to fully backup the remaining one just in case. > Is there any possible way to replay the other WAL files after the Sep 21 > switchover/switchback? I'm going to guess this is similar to having another > timeline. That may be your exact case, it depends upon how you did recovery. Both systems think they are the same one, so you need to be careful. Look for the last log file of the went-down node. That file should exist twice, once as finally written by the went-down node and once as continued to be written to by the stayed-up node after switchover. It sounds like the wrong file was used to recover with and as a result stopped recovery on Sep 21. But if you're running with the system now, you'll need to do an intermediate rebuild and recover the data manually. (I recommend testing recovery procedures before you go live and then at least every 6 months, to ensure that they actually still work when you need 'em. Trying to do a recovery with any confidence is not easy at 4am on a Sunday morning under maximum stress, from experience.) > We restored from a nightly pg_dump which we have kept doing, but that means we > lost about 20 hrs of data. :-( But sounds recoverable. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Jeff Frost wrote: > On Sun, 2 Oct 2005, Simon Riggs wrote: > >> Probably the best idea is to backup the last WAL file for each timeline >> seen. Keep track of that, so when the current file changes you'll know >> which timeline you're on and stick to that from then on. Or more simply, >> put some notes with your program saying "if you ever use a complex >> recovery situation, make sure to clear all archive_status files for >> higher timeline ids before using this program". > > Tell me if you think this is the most reasonable way to determine the in > use WAL file: > > ls -rt $PGDATA/pg_xlog/ | grep -v "backup\|archive\|RECOVERY" | tail -1 > Look at this post I did last year: http://archives.postgresql.org/pgsql...6/msg00013.php in that messages there are two script that deliver remotelly the archive wall, and they store too last WAL in order to not loose the current WAL in case of crash. This was the function I used to find the WAL in use: function copy_last_wal { FILE=$( ls -t1p $PGXLOG | grep -v / | head -1 ) echo "Last Wal> " $FILE cp ${PGXLOG}/${FILE} ${PARTIAL}/${FILE}.tmp mv ${PARTIAL}/${FILE}.tmp ${PARTIAL}/${FILE}.partial find ${PARTIAL} -name *.partial | grep -v ${FILE} | xargs -i rm -fr {} } At that time Tom Lane agreed to provide some functions to ask the engine the name of WAL currently in use, dunno if in the new 8.1 something was done in order to help this process. Regards Gaetano Mendola |
| |||
| On Tue, 4 Oct 2005, Simon Riggs wrote: >> Now here's the problem...the servers >> switched roles on Sep 21. We switched them back a little while after that. >> When I went to replay the WAL files using the Sep 15 base backup, it happily >> played back the WAL files to Sep 21 and stopped. I presume that this means >> whenever the servers switch roles I need to create a fresh base backup? > > Yes, but more generally if I lost one node I would always be inclined to > fully backup the remaining one just in case. I believe they were switched not because one went down, but to add more ram to the primary. They switched back just peachy at that time. >> Is there any possible way to replay the other WAL files after the Sep 21 >> switchover/switchback? I'm going to guess this is similar to having another >> timeline. > > That may be your exact case, it depends upon how you did recovery. Both > systems think they are the same one, so you need to be careful. > > Look for the last log file of the went-down node. That file should exist > twice, once as finally written by the went-down node and once as > continued to be written to by the stayed-up node after switchover. It > sounds like the wrong file was used to recover with and as a result > stopped recovery on Sep 21. But if you're running with the system now, > you'll need to do an intermediate rebuild and recover the data manually. What would constitute an intermediate rebuild? Of course the system is up and live and having data added to it. How would one restore from multiple timelines? > (I recommend testing recovery procedures before you go live and then at > least every 6 months, to ensure that they actually still work when you > need 'em. Trying to do a recovery with any confidence is not easy at 4am > on a Sunday morning under maximum stress, from experience.) > I absolutely agree..unfortunately, it's only been 3 months since the last test. :-( -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954 ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| On Tue, 2005-10-04 at 08:03 -0700, Jeff Frost wrote: > >> Is there any possible way to replay the other WAL files after the Sep 21 > >> switchover/switchback? I'm going to guess this is similar to having another > >> timeline. > > > > That may be your exact case, it depends upon how you did recovery. Both > > systems think they are the same one, so you need to be careful. > > > > Look for the last log file of the went-down node. That file should exist > > twice, once as finally written by the went-down node and once as > > continued to be written to by the stayed-up node after switchover. It > > sounds like the wrong file was used to recover with and as a result > > stopped recovery on Sep 21. But if you're running with the system now, > > you'll need to do an intermediate rebuild and recover the data manually. > > What would constitute an intermediate rebuild? Of course the system is up and > live and having data added to it. How would one restore from multiple > timelines? This is only if you are back up and working on the went-down box: intermediate rebuild: I mean that you will have to restore the data from the period of switchover, manually extract the relevant data with SQL and then re-insert those changes yourself and resolve conflicts. There isn't a process to merge the two log streams. The same txnid will have been used on both servers to refer to separate transactions, so there can be no automated way of resolving the data. It has to be done using business domain knowledge rather than log data. All of that's no different from any other RDBMS, as I'm sure you know. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Tue, 4 Oct 2005, Simon Riggs wrote: >> What would constitute an intermediate rebuild? Of course the system is up and >> live and having data added to it. How would one restore from multiple >> timelines? > > This is only if you are back up and working on the went-down box: > > intermediate rebuild: I mean that you will have to restore the data from > the period of switchover, manually extract the relevant data with SQL > and then re-insert those changes yourself and resolve conflicts. I think we're likely out of luck here, but let me provide a few more details so you might give me a better idea. We are up on the went-down box with a pg_dump restore from the 3:30a.m. of the previous day. We have WAL files available from the box that was up during the failure of the primary. What I do not have is a base backup of the box which was up for part of the day while we brought up the went-down box. So, unless there is some method of replaying WAL files by hand one at a time, I think we are out of luck, no? > > There isn't a process to merge the two log streams. The same txnid will > have been used on both servers to refer to separate transactions, so > there can be no automated way of resolving the data. It has to be done > using business domain knowledge rather than log data. > > All of that's no different from any other RDBMS, as I'm sure you know. I do know, and in fact, I think PITR is great, I just wish we had thought about the corner case of the servers switching roles between the biweekly base backups. -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954 ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Tue, 4 Oct 2005, Gaetano Mendola wrote: Looks like we're doing just about the same thing, but you're using head and I'm using tail. However, it seems that your method does not require grepping out the special files, so that's awesome! BTW you do not have to use the -1 flag when you put ls to a pipe. -Jeff > FILE=$( ls -t1p $PGXLOG | grep -v / | head -1 ) -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954 ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| ||||
| Looks like using that FILE= expression, I ended up with a bad file choice last night immediately after the base backup: Copying 0000000100000015000000F8.0088A490.backup to /mnt/pgbackup/pitr/0000000100000015000000F8.0088A490.backup Also, I noticed that rsync had a file disappear out from under it during the base backup: file has vanished: "/usr/local/pgsql/data/pg_subtrans/0743" So, I have modified my FILE= expression to the following: FILE=`ls -tp /pg_xlog/ | grep -v "backup\|/" | head -1` Which gives me this file: 00000001000000160000002A Given these in the pg_xlog dir. 00000001000000160000002A 00000001000000160000002B 00000001000000160000002C 00000001000000160000002D 00000001000000160000002E 00000001000000160000002F 000000010000001600000030 000000010000001600000031 Removing the -t takes yields the same result as long as I grep out the .backup files. On Tue, 4 Oct 2005, Gaetano Mendola wrote: > This was the function I used to find the WAL in use: > > function copy_last_wal > { > FILE=$( ls -t1p $PGXLOG | grep -v / | head -1 ) > > echo "Last Wal> " $FILE > > cp ${PGXLOG}/${FILE} ${PARTIAL}/${FILE}.tmp > mv ${PARTIAL}/${FILE}.tmp ${PARTIAL}/${FILE}.partial > find ${PARTIAL} -name *.partial | grep -v ${FILE} | xargs -i rm -fr {} > } > -- Jeff Frost, Owner <jeff@frostconsultingllc.com> Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954 ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |