vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Tom Lane wrote: > Bricklen Anderson <BAnderson@PresiNET.com> writes: > >>Tom Lane wrote: >> >>>I would have suggested that maybe this represented on-disk data >>>corruption, but the appearance of two different but not-too-far-apart >>>WAL offsets in two different pages suggests that indeed the end of WAL >>>was up around segment 972 or 973 at one time. > > >>Nope, never touched pg_resetxlog. >>My pg_xlog list ranges from 000000010000007300000041 to 0000000100000073000000FE, with no breaks. >>There are also these: 000000010000007400000000 to 00000001000000740000000B > > > That seems like rather a lot of files; do you have checkpoint_segments > set to a large value, like 100? The pg_controldata dump shows that the > latest checkpoint record is in the 73/41 file, so presumably the active > end of WAL isn't exceedingly far past that. You've got 200 segments > prepared for future activity, which is a bit over the top IMHO. > > But anyway, the evidence seems pretty clear that in fact end of WAL is > in the 73 range, and so those page LSNs with 972 and 973 have to be > bogus. I'm back to thinking about dropped bits in RAM or on disk. > IIRC these numbers are all hex, so the extra "9" could come from just > two bits getting turned on that should not be. Might be time to run > memtest86 and/or badblocks. > > regards, tom lane Yes, checkpoint_segments is set to 100, although I can set that lower if you feel that that is more appropriate. Currently, the system receives around 5-8 million inserts per day (across 3 primary tables), so I was leaning towards the "more is better" philosophy. We ran e2fsck with badblocks option last week and didn't turn anything up, along with a couple of passes with memtest. I will run a full-scale memtest and post any interesting results. I've also read that kill -9 postmaster is "not a good thing". I honestly can't vouch for whether or not this may or may not have occurred around the time of the initial creation of this database. It's possible, since this db started it's life as a development db at 8r3 then was bumped to 8r5, then on to 8 final where it has become a dev-final db. Assuming that the memtest passes cleanly, as does another run of badblocks, do you have any more suggestions on how I should proceed? Should I run for a while with zero_damaged_pages set to true and accpet the data loss, or just recreate the whole db from scratch? Thanks again for your help. Cheers, Bricklen ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Bricklen Anderson wrote: > Any ideas on what I should try next? Considering that this db is not in > production yet, I _do_ have the liberty to rebuild the database if > necessary. Do you have any further recommendations? I recall reading something in this ML about problems with the way that Ext3 FS recovers a dirty file system, could it be related? -- Alban Hertroys MAG Productions T: +31(0)53 4346874 F: +31(0)53 4346876 E: alban@magproductions.nl W: http://www.magproductions.nl ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Alban Hertroys wrote: > Bricklen Anderson wrote: > >> Any ideas on what I should try next? Considering that this db is not >> in production yet, I _do_ have the liberty to rebuild the database if >> necessary. Do you have any further recommendations? > > > I recall reading something in this ML about problems with the way that > Ext3 FS recovers a dirty file system, could it be related? > I really have no idea, but we _are_ running an ext3 fs on this particular server. -- _______________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately. _______________________________ ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| ||||
| Tom Lane wrote: > Bricklen Anderson <BAnderson@PresiNET.com> writes: > >>>Tom Lane wrote: >>> >>>>But anyway, the evidence seems pretty clear that in fact end of WAL is >>>>in the 73 range, and so those page LSNs with 972 and 973 have to be >>>>bogus. I'm back to thinking about dropped bits in RAM or on disk. > > >>memtest86+ ran for over 15 hours with no errors reported. >>e2fsck -c completed with no errors reported. > > > Hmm ... that's not proof your hardware is ok, but it at least puts the > ball back in play. > > >>Any ideas on what I should try next? Considering that this db is not >>in production yet, I _do_ have the liberty to rebuild the database if >>necessary. Do you have any further recommendations? > > > If the database isn't too large, I'd suggest saving aside a physical > copy (eg, cp or tar dump taken with postmaster stopped) for forensic > purposes, and then rebuilding so you can get on with your own work. > > One bit of investigation that might be worth doing is to look at every > single 8K page in the database files and collect information about the > LSN fields, which are the first 8 bytes of each page. Do you mean this line from pg_filedump's results: LSN: logid 56 recoff 0x3f4be440 Special 8176 (0x1ff0) If so, I've set up a shell script that looped all of the files and emitted that line. It's not particularly elegant, but it worked. Again, that's assuming that it was the correct line. I'll write a perl script to parse out the LSN values to see if any are greater than 116 (which I believe is the hex of 74?). In case anyone wants the script that I ran to get the LSN: #!/bin/sh for FILE in /var/postgres/data/base/17235/*; do i=0 echo $FILE >> test_file; while [ 1==1 ]; do str=`pg_filedump -R $i $FILE | grep LSN`; if [ "$?" -eq "1" ]; then break fi echo "$FILE: $str" >> LSN_out; i=$((i+1)); done done > In a non-broken database all of these should be less than or equal to the current ending > WAL offset (which you can get with pg_controldata if the postmaster is > stopped). We know there are at least two bad pages, but are there more? > Is there any pattern to the bad LSN values? Also it would be useful to > look at each bad page in some detail to see if there's any evidence of > corruption extending beyond the LSN value. > > regards, tom lane NB. I've recreated the database, and saved off the old directory (all 350 gigs of it) so I can dig into it further. Thanks again for you help, Tom. Cheers, Bricklen ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| Thread Tools | |
| Display Modes | |
|
|