This is a discussion on Re: Cleaning up unreferenced table files within the Pgsql Patches forums, part of the PostgreSQL category; --> Applied. --------------------------------------------------------------------------- pgman wrote: > Tom Lane wrote: > > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > > On Tue, ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Applied. --------------------------------------------------------------------------- pgman wrote: > Tom Lane wrote: > > Heikki Linnakangas <hlinnaka@iki.fi> writes: > > > On Tue, 26 Apr 2005, Tom Lane wrote: > > >> Heikki Linnakangas <hlinnaka@iki.fi> writes: > > >>> I feel that crashes that leaves behind stale files are rare. > > >> > > >> Indeed, and getting more so all the time ... > > > > > How so? Have changes been made in those parts of the code? > > > > No, just that the overall reliability of Postgres keeps improving. > > > > At the time that TODO entry was created, I don't think we even had the > > ability to roll back table creates/drops properly, so there were > > scenarios in which unreferenced files could be left behind without even > > assuming any software error. And the prevalence of backend crashes was > > way higher than it is now, too. So I think a good argument could be > > made that the TODO item isn't nearly as important as it was at the time. > > Agreed, but I don't think we are ever going to have the file system obey > the same semantics as the database, so there will always be corner cases > where we have the possibility for unreferenced files. > > > > If nobody ever runs into this issue in production, and this whole exercise > > > turns out to be completely unnecessary, at least we'll know. That alone > > > makes me feel better. > > > > We will know no such thing, unless the patch is made to announce the > > problem so intrusively that people are certain to see it *and report it > > to us*. Which I don't think will be acceptable. > > You are right that checking only after a server crash isn't going to > help us very much, because the odds someone is going to look at the logs > just at that moment are slim, and if they stop and start the server > again, the message will not appear so they will think it is fixed. > > Though the problem is caused by a server crash, it remains even after a > clean startup. My new version of the patch checks not just after a > server crash, but every time the server starts up. Here are some > timings: > > 3dbs without patch 0m0.230s 300 files > 3dbs with patch 0m0.230s 300 files > 100dbs with patch 0m0.270s 10000 files > > (The timing test started the server and waited for the message "database > system is ready".) > > You can see that for three empty databases (only system tables/indexes), > the patch has no impact. For 100 databases, or 10,000 files, there is a > 17% increase in startup time, which seems acceptable. > > The message printed in the logs is: > > LOG: The table or index file "/u/pgsql/data/base/1/100001" is stale and can be safely removed > > I made a few changes to the patch. I changed 'clean' to 'check' because > I don't think we are ever going to remove the files on startup (If we > did we could do the cleanup just on crash recovery startup.) I changed > the message from NOTICE to LOG, I made it run on all startups and not > just crash recovery, and it was missing FreeDir() calls. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 359-1001 > + If your life is a hard drive, | 13 Roberts Road > + Christ can be your backup. | Newtown Square, Pennsylvania 19073 > Index: doc/src/sgml/maintenance.sgml > ================================================== ================= > RCS file: /cvsroot/pgsql/doc/src/sgml/maintenance.sgml,v > retrieving revision 1.41 > diff -c -c -r1.41 maintenance.sgml > *** doc/src/sgml/maintenance.sgml 20 Feb 2005 02:21:26 -0000 1.41 > --- doc/src/sgml/maintenance.sgml 1 May 2005 04:30:48 -0000 > *************** > *** 474,479 **** > --- 474,496 ---- > </para> > </sect1> > > + <sect1 id="check-files-after-crash"> > + <title>Check files after crash</title> > + > + <indexterm zone="check-files-after-crash"> > + <primary>stale file</primary> > + </indexterm> > + > + <para> > + <productname>PostgreSQL</productname> recovers automatically after crash > + using the write-ahead log (see <xref linkend="wal">) and no manual > + operations are normally needed. However, if there was a transaction running > + when the crash occured that created or dropped a relation, the > + transaction might have left a stale file in the data directory. If this > + happens, you will get a notice in the log file stating which files can be > + deleted. > + </para> > + </sect1> > > <sect1 id="logfile-maintenance"> > <title>Log File Maintenance</title> > Index: src/backend/access/transam/xlog.c > ================================================== ================= > RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v > retrieving revision 1.189 > diff -c -c -r1.189 xlog.c > *** src/backend/access/transam/xlog.c 28 Apr 2005 21:47:10 -0000 1.189 > --- src/backend/access/transam/xlog.c 1 May 2005 04:30:52 -0000 > *************** > *** 43,48 **** > --- 43,49 ---- > #include "utils/builtins.h" > #include "utils/guc.h" > #include "utils/relcache.h" > + #include "utils/flatfiles.h" > > > /* > *************** > *** 4525,4530 **** > --- 4526,4533 ---- > > CreateCheckPoint(true, true); > > + CheckStaleRelFiles(); > + > /* > * Close down recovery environment > */ > *************** > *** 4536,4541 **** > --- 4539,4550 ---- > */ > remove_backup_label(); > } > + else > + { > + XLogInitRelationCache(); > + CheckStaleRelFiles(); > + XLogCloseRelationCache(); > + } > > /* > * Preallocate additional log files, if wanted. > Index: src/backend/catalog/catalog.c > ================================================== ================= > RCS file: /cvsroot/pgsql/src/backend/catalog/catalog.c,v > retrieving revision 1.59 > diff -c -c -r1.59 catalog.c > *** src/backend/catalog/catalog.c 14 Apr 2005 20:03:23 -0000 1.59 > --- src/backend/catalog/catalog.c 1 May 2005 04:30:52 -0000 > *************** > *** 106,111 **** > --- 106,144 ---- > return path; > } > > + /* > + * GetTablespacePath - construct path to a tablespace symbolic link > + * > + * Result is a palloc'd string. > + * > + * XXX this must agree with relpath and GetDatabasePath! > + */ > + char * > + GetTablespacePath(Oid spcNode) > + { > + int pathlen; > + char *path; > + > + Assert(spcNode != GLOBALTABLESPACE_OID); > + > + if (spcNode == DEFAULTTABLESPACE_OID) > + { > + /* The default tablespace is {datadir}/base */ > + pathlen = strlen(DataDir) + 5 + 1; > + path = (char *) palloc(pathlen); > + snprintf(path, pathlen, "%s/base", > + DataDir); > + } > + else > + { > + /* All other tablespaces have symlinks in pg_tblspc */ > + pathlen = strlen(DataDir) + 11 + OIDCHARS + 1; > + path = (char *) palloc(pathlen); > + snprintf(path, pathlen, "%s/pg_tblspc/%u", > + DataDir, spcNode); > + } > + return path; > + } > > /* > * IsSystemRelation > Index: src/backend/commands/tablespace.c > ================================================== ================= > RCS file: /cvsroot/pgsql/src/backend/commands/tablespace.c,v > retrieving revision 1.17 > diff -c -c -r1.17 tablespace.c > *** src/backend/commands/tablespace.c 14 Apr 2005 20:03:24 -0000 1.17 > --- src/backend/commands/tablespace.c 1 May 2005 04:30:53 -0000 > *************** > *** 341,348 **** > /* > * All seems well, create the symlink > */ > ! linkloc = (char *) palloc(strlen(DataDir) + 11 + 10 + 1); > ! sprintf(linkloc, "%s/pg_tblspc/%u", DataDir, tablespaceoid); > > if (symlink(location, linkloc) < 0) > ereport(ERROR, > --- 341,347 ---- > /* > * All seems well, create the symlink > */ > ! linkloc = GetTablespacePath(tablespaceoid); > > if (symlink(location, linkloc) < 0) > ereport(ERROR, > *************** > *** 495,502 **** > char *subfile; > struct stat st; > > ! location = (char *) palloc(strlen(DataDir) + 11 + 10 + 1); > ! sprintf(location, "%s/pg_tblspc/%u", DataDir, tablespaceoid); > > /* > * Check if the tablespace still contains any files. We try to rmdir > --- 494,500 ---- > char *subfile; > struct stat st; > > ! location = GetTablespacePath(tablespaceoid); > > /* > * Check if the tablespace still contains any files. We try to rmdir > *************** > *** 1036,1043 **** > set_short_version(location); > > /* Create the symlink if not already present */ > ! linkloc = (char *) palloc(strlen(DataDir) + 11 + 10 + 1); > ! sprintf(linkloc, "%s/pg_tblspc/%u", DataDir, xlrec->ts_id); > > if (symlink(location, linkloc) < 0) > { > --- 1034,1040 ---- > set_short_version(location); > > /* Create the symlink if not already present */ > ! linkloc = GetTablespacePath(xlrec->ts_id); > > if (symlink(location, linkloc) < 0) > { > Index: src/backend/utils/adt/misc.c > ================================================== ================= > RCS file: /cvsroot/pgsql/src/backend/utils/adt/misc.c,v > retrieving revision 1.40 > diff -c -c -r1.40 misc.c > *** src/backend/utils/adt/misc.c 31 Dec 2004 22:01:22 -0000 1.40 > --- src/backend/utils/adt/misc.c 1 May 2005 04:30:53 -0000 > *************** > *** 26,31 **** > --- 26,32 ---- > #include "funcapi.h" > #include "catalog/pg_type.h" > #include "catalog/pg_tablespace.h" > + #include "catalog/catalog.h" > > #define atooid(x) ((Oid) strtoul((x), NULL, 10)) > > *************** > *** 144,154 **** > > fctx = palloc(sizeof(ts_db_fctx)); > > - /* > - * size = path length + tablespace dirname length + 2 dir sep > - * chars + oid + terminator > - */ > - fctx->location = (char *) palloc(strlen(DataDir) + 11 + 10 + 1); > if (tablespaceOid == GLOBALTABLESPACE_OID) > { > fctx->dirdesc = NULL; > --- 145,150 ---- > *************** > *** 157,168 **** > } > else > { > ! if (tablespaceOid == DEFAULTTABLESPACE_OID) > ! sprintf(fctx->location, "%s/base", DataDir); > ! else > ! sprintf(fctx->location, "%s/pg_tblspc/%u", DataDir, > ! tablespaceOid); > ! > fctx->dirdesc = AllocateDir(fctx->location); > > if (!fctx->dirdesc) > --- 153,159 ---- > } > else > { > ! fctx->location = GetTablespacePath(tablespaceOid); > fctx->dirdesc = AllocateDir(fctx->location); > > if (!fctx->dirdesc) > Index: src/backend/utils/init/Makefile > ================================================== ================= > RCS file: /cvsroot/pgsql/src/backend/utils/init/Makefile,v > retrieving revision 1.18 > diff -c -c -r1.18 Makefile > *** src/backend/utils/init/Makefile 20 Feb 2005 02:22:00 -0000 1.18 > --- src/backend/utils/init/Makefile 1 May 2005 04:30:53 -0000 > *************** > *** 12,18 **** > top_builddir = ../../../.. > include $(top_builddir)/src/Makefile.global > > ! OBJS = flatfiles.o globals.o miscinit.o postinit.o > > all: SUBSYS.o > > --- 12,18 ---- > top_builddir = ../../../.. > include $(top_builddir)/src/Makefile.global > > ! OBJS = flatfiles.o globals.o miscinit.o postinit.o checkfiles.o > > all: SUBSYS.o > > Index: src/include/catalog/catalog.h > ================================================== ================= > RCS file: /cvsroot/pgsql/src/include/catalog/catalog.h,v > retrieving revision 1.30 > diff -c -c -r1.30 catalog.h > *** src/include/catalog/catalog.h 31 Dec 2004 22:03:24 -0000 1.30 > --- src/include/catalog/catalog.h 1 May 2005 04:30:54 -0000 > *************** > *** 19,24 **** > --- 19,25 ---- > > extern char *relpath(RelFileNode rnode); > extern char *GetDatabasePath(Oid dbNode, Oid spcNode); > + extern char *GetTablespacePath(Oid spcNode); > > extern bool IsSystemRelation(Relation relation); > extern bool IsToastRelation(Relation relation); > Index: src/include/utils/flatfiles.h > ================================================== ================= > RCS file: /cvsroot/pgsql/src/include/utils/flatfiles.h,v > retrieving revision 1.1 > diff -c -c -r1.1 flatfiles.h > *** src/include/utils/flatfiles.h 20 Feb 2005 02:22:07 -0000 1.1 > --- src/include/utils/flatfiles.h 1 May 2005 04:30:54 -0000 > *************** > *** 30,33 **** > --- 30,36 ---- > > extern Datum flatfile_update_trigger(PG_FUNCTION_ARGS); > > + /* from checkfiles.c */ > + extern void CheckStaleRelFiles(void); > + > #endif /* FLATFILES_H */ > /*------------------------------------------------------------------------- > * > * checkfiles.c > * support to clean up stale relation files on crash recovery > * > * If a backend crashes while in a transaction that has created or > * deleted a relfilenode, a stale file can be left over in the data > * directory. This file contains routines to clean up those stale > * files on recovery. > * > * $PostgreSQL$ > * > *------------------------------------------------------------------------- > */ > #include "postgres.h" > > #include "storage/fd.h" > > #include "utils/flatfiles.h" > #include "miscadmin.h" > #include "catalog/pg_tablespace.h" > #include "catalog/catalog.h" > #include "access/skey.h" > #include "utils/fmgroids.h" > #include "access/relscan.h" > #include "access/heapam.h" > #include "utils/resowner.h" > > static void CheckStaleRelFilesFrom(Oid tablespaceoid, Oid dboid); > static void CheckStaleRelFilesFromTablespace(Oid tablespaceoid); > > /* Like AllocateDir, but ereports on failure */ > static DIR * > AllocateDirChecked(char *path) > { > DIR *dirdesc = AllocateDir(path); > if (dirdesc == NULL) > ereport(ERROR, > (errcode_for_file_access(), > errmsg("could not open directory \"%s\": %m", > path))); > return dirdesc; > } > > /* > * Scan through all tablespaces for relations left over > * by aborted transactions. > * > * For example, if a transaction issues > * BEGIN; CREATE TABLE foobar (); > * and then the backend crashes, the file is left in the > * tablespace until CheckStaleRelFiles deletes it. > */ > void > CheckStaleRelFiles(void) > { > DIR *dirdesc; > struct dirent *de; > char *path; > int pathlen; > > pathlen = strlen(DataDir) + 11 + 1; > path = (char *) palloc(pathlen); > snprintf(path, pathlen, "%s/pg_tblspc/", DataDir); > dirdesc = AllocateDirChecked(path); > while ((de = readdir(dirdesc)) != NULL) > { > char *invalid; > Oid tablespaceoid; > > /* Check that the directory name looks like valid tablespace link. */ > tablespaceoid = (Oid) strtol(de->d_name, &invalid, 10); > if(invalid[0] == '\0') > CheckStaleRelFilesFromTablespace(tablespaceoid); > } > FreeDir(dirdesc); > pfree(path); > > CheckStaleRelFilesFromTablespace(DEFAULTTABLESPACE _OID); > } > > /* Scan a specific tablespace for stale relations */ > static void > CheckStaleRelFilesFromTablespace(Oid tablespaceoid) > { > DIR *dirdesc; > struct dirent *de; > char *path; > > path = GetTablespacePath(tablespaceoid); > > dirdesc = AllocateDirChecked(path); > while ((de = readdir(dirdesc)) != NULL) > { > char *invalid; > Oid dboid; > > dboid = (Oid) strtol(de->d_name, &invalid, 10); > if(invalid[0] == '\0') > CheckStaleRelFilesFrom(tablespaceoid, dboid); > } > FreeDir(dirdesc); > pfree(path); > } > > /* Scan a specific database in a specific tablespace for stale relations. > * > * First, pg_class for the database is opened, and the relfilenodes of all > * relations mentioned there are stored in a hash table. > * > * Then the directory is scanned. Every file in the directory that's not > * found in pg_class (the hash table) is logged. > */ > static void > CheckStaleRelFilesFrom(Oid tablespaceoid, Oid dboid) > { > DIR *dirdesc; > struct dirent *de; > HASHCTL hashctl; > HTAB *relfilenodeHash; > MemoryContext mcxt; > RelFileNode rnode; > char *path; > > /* We create a private memory context so that we can easily deallocate > * the hash table and its contents > */ > mcxt = AllocSetContextCreate(TopMemoryContext, "CheckStaleRelFiles", > ALLOCSET_DEFAULT_MINSIZE, > ALLOCSET_DEFAULT_INITSIZE, > ALLOCSET_DEFAULT_MAXSIZE); > > hashctl.hash = tag_hash; > /* The entry contents is not used for anything, we just check > * if an oid is in the hash table or not. */ > hashctl.keysize = sizeof(Oid); > hashctl.entrysize = 1; > hashctl.hcxt = mcxt; > relfilenodeHash = hash_create("relfilenodeHash", 100, &hashctl, > HASH_FUNCTION > | HASH_ELEM | HASH_CONTEXT); > > /* Read all relfilenodes from pg_class into the hash table */ > { > ResourceOwner owner, oldowner; > Relation rel; > HeapScanDesc scan; > HeapTuple tuple; > > /* Need a resowner to keep the heapam and buffer code happy */ > owner = ResourceOwnerCreate(NULL, "CheckStaleRelFiles"); > oldowner = CurrentResourceOwner; > CurrentResourceOwner = owner; > > rnode.spcNode = tablespaceoid; > rnode.dbNode = dboid; > rnode.relNode = RelationRelationId; > rel = XLogOpenRelation(true, 0 , rnode); > > scan = heap_beginscan(rel, SnapshotNow, 0, NULL); > while ((tuple = heap_getnext(scan, ForwardScanDirection)) != NULL) > { > Form_pg_class classform = (Form_pg_class) GETSTRUCT(tuple); > > hash_search(relfilenodeHash, &classform->relfilenode, > HASH_ENTER, NULL); > } > heap_endscan(scan); > > XLogCloseRelation(rnode); > CurrentResourceOwner = oldowner; > ResourceOwnerDelete(owner); > } > > /* Scan the directory */ > path = GetDatabasePath(dboid, tablespaceoid); > > dirdesc = AllocateDirChecked(path); > while ((de = readdir(dirdesc)) != NULL) > { > char *invalid; > Oid relfilenode; > > relfilenode = strtol(de->d_name, &invalid, 10); > if(invalid[0] == '\0') > { > /* Filename was a valid number, check if pg_class knows > about it */ > if(hash_search(relfilenodeHash, &relfilenode, > HASH_FIND, NULL) == NULL) > { > char *filepath; > rnode.spcNode = tablespaceoid; > rnode.dbNode = dboid; > rnode.relNode = relfilenode; > > filepath = relpath(rnode); > > ereport(LOG, > (errcode_for_file_access(), > errmsg("The table or index file \"%s\" is stale and can be safely removed", > filepath))); > pfree(filepath); > } > } > } > FreeDir(dirdesc); > pfree(path); > hash_destroy(relfilenodeHash); > MemoryContextDelete(mcxt); > } -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Bruce Momjian <pgman@candle.pha.pa.us> writes: > Applied. Now that I've had a chance to look at it, this patch is thoroughly broken. Problems observed in a quick review: 1. It doesn't work at all for non-default tablespaces: it will claim that every file in such a tablespace is stale. The fact that it does that rather than failing entirely is accidental. It tries to read the database's pg_class in the target tablespace whether it's there or not. Because the system is still in recovery mode, the low-level routines allow the access to the nonexistent pg_class table to pass --- in fact they think they should create the file, so after it runs there's a bogus empty "1259" file in each such tablespace (which of course it complains about, too). The code then proceeds to think that pg_class is empty so of course everything draws a warning. 2. It's not robust against stale subdirectories of a tablespace (ie, subdirs corresponding to a nonexistent database) --- again, it'll try to read a nonexistent pg_class. Then it'll produce a bunch of off-target complaint messages. 3. It's assuming that relfilenode is unique database-wide, when no such assumption is safe. We only have a guarantee that it's unique tablespace-wide. 4. It fails to examine table segment files (such as "nnn.1"). These should be complained of when the "nnn" doesn't match any hash entry. 5. It will load every relfilenode value in pg_class into the hashtable whether it's meaningful or not. There should be a check on relkind. 6. I don't think relying on strtol to decide if a filename is entirely numeric is very safe. Note all the extra defenses in pg_atoi against various platform-specific misbehaviors of strtol. Personally I'd use a strspn test instead. 7. There are no checks for readdir failure (compare any other readdir loop in the backend). See also Simon Riggs' complaints that the circumstances under which it's done are pretty randomly selected. (One particular thing that I think is a bad idea is to do this in a standalone backend. Any sort of corruption in any db's pg_class would render it impossible to start up.) To fix the first three problems, and also avoid the performance problem of multiply rescanning a database's pg_class for each of its tablespaces, I would suggest that the hashtable entries be widened to RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then there should be one iteration over pg_database to learn the OIDs and default tablespaces of each database; with that you can read pg_class from its correct location for each database and load all the entries into the hashtable. Then you iterate through the tablespaces looking for stuff not present in the hashtable. You might also want to build a list or hashtable of known database OIDs, so that you can recognize a stale subdirectory immediately and issue a direct complaint about it without even recursing into it. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org |
| |||
| On Thu, 5 May 2005, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Applied. > > Now that I've had a chance to look at it, this patch is thoroughly > broken. Problems observed in a quick review: > > 1. It doesn't work at all for non-default tablespaces: it will > claim that every file in such a tablespace is stale. The fact > that it does that rather than failing entirely is accidental. > It tries to read the database's pg_class in the target tablespace > whether it's there or not. Because the system is still in recovery > mode, the low-level routines allow the access to the nonexistent > pg_class table to pass --- in fact they think they should create > the file, so after it runs there's a bogus empty "1259" file in each > such tablespace (which of course it complains about, too). The code > then proceeds to think that pg_class is empty so of course everything > draws a warning. > > 2. It's not robust against stale subdirectories of a tablespace > (ie, subdirs corresponding to a nonexistent database) --- again, > it'll try to read a nonexistent pg_class. Then it'll produce a > bunch of off-target complaint messages. > > 3. It's assuming that relfilenode is unique database-wide, when no > such assumption is safe. We only have a guarantee that it's unique > tablespace-wide. > > 4. It fails to examine table segment files (such as "nnn.1"). These > should be complained of when the "nnn" doesn't match any hash entry. > > 5. It will load every relfilenode value in pg_class into the hashtable > whether it's meaningful or not. There should be a check on relkind. > > 6. I don't think relying on strtol to decide if a filename is entirely > numeric is very safe. Note all the extra defenses in pg_atoi against > various platform-specific misbehaviors of strtol. Personally I'd use a > strspn test instead. > I'll fix 1-6 according to your suggestions, and send another patch. It shows how little experience I have with multiple database and tablespace management. > 7. There are no checks for readdir failure (compare any other readdir > loop in the backend). I couldn't figure out what you meant. The readdir code is the same as anywhere else. Also, man page (Linux) says that readdir returns NULL on error, and that is checked. > See also Simon Riggs' complaints that the circumstances under which it's > done are pretty randomly selected. (One particular thing that I think > is a bad idea is to do this in a standalone backend. Any sort of > corruption in any db's pg_class would render it impossible to start up.) I'd agree with Simons complaints if we actually deleted the files. But since we only report them, it's a good idea to report them on every startup, otherwise the DBA might think that the stale files are not there anymore since the system isn't complaining about them anymore. The original patch only ran the check on crash recovery, but Bruce changed it to run on startup as well, for the above reason. I agree, though, that it's a bad idea to do it in standalone mode. I'll add a check for that. Also it probably shouldn't stop the startup even if some pg_class is corrupt. Other databases could be fine. > To fix the first three problems, and also avoid the performance problem > of multiply rescanning a database's pg_class for each of its > tablespaces, I would suggest that the hashtable entries be widened to > RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then > there should be one iteration over pg_database to learn the OIDs and > default tablespaces of each database; with that you can read pg_class > from its correct location for each database and load all the entries > into the hashtable. Then you iterate through the tablespaces looking > for stuff not present in the hashtable. You might also want to build a > list or hashtable of known database OIDs, so that you can recognize a > stale subdirectory immediately and issue a direct complaint about it > without even recursing into it. > > regards, tom lane > - Heikki ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Maybe we should take a different approach to the problem: 1. Create new file with an extension to mark that it's not yet committed (eg. 1234.notcommitted) 2. ... 3. Take CheckpointStartLock 4. Write commit record to WAL, with list of created files. 5. rename created file (1234.notcommitted -> 1234). 6. Release CheckpointStartLock This would guarantee that after successful WAL replay, all files in the data directory with .notcommitted extension can be safely deleted. No need to read pg_database or pg_class. We would take a performance hit because of the additional rename and fsync step. Also, we must somehow make sure that the new file or the directory it's in is fsynced on checkpoint to make sure that the rename is flushed to disk. A variant of the scheme would be to create two files on step 1. One would be the actual relfile (1234) and the other would an empty marker file (1234.notcommitted). That way the smgr code wouldn't have to care it the file is new or not when opening it. - Heikki On Thu, 5 May 2005, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Applied. > > Now that I've had a chance to look at it, this patch is thoroughly > broken. Problems observed in a quick review: > > 1. It doesn't work at all for non-default tablespaces: it will > claim that every file in such a tablespace is stale. The fact > that it does that rather than failing entirely is accidental. > It tries to read the database's pg_class in the target tablespace > whether it's there or not. Because the system is still in recovery > mode, the low-level routines allow the access to the nonexistent > pg_class table to pass --- in fact they think they should create > the file, so after it runs there's a bogus empty "1259" file in each > such tablespace (which of course it complains about, too). The code > then proceeds to think that pg_class is empty so of course everything > draws a warning. > > 2. It's not robust against stale subdirectories of a tablespace > (ie, subdirs corresponding to a nonexistent database) --- again, > it'll try to read a nonexistent pg_class. Then it'll produce a > bunch of off-target complaint messages. > > 3. It's assuming that relfilenode is unique database-wide, when no > such assumption is safe. We only have a guarantee that it's unique > tablespace-wide. > > 4. It fails to examine table segment files (such as "nnn.1"). These > should be complained of when the "nnn" doesn't match any hash entry. > > 5. It will load every relfilenode value in pg_class into the hashtable > whether it's meaningful or not. There should be a check on relkind. > > 6. I don't think relying on strtol to decide if a filename is entirely > numeric is very safe. Note all the extra defenses in pg_atoi against > various platform-specific misbehaviors of strtol. Personally I'd use a > strspn test instead. > > 7. There are no checks for readdir failure (compare any other readdir > loop in the backend). > > See also Simon Riggs' complaints that the circumstances under which it's > done are pretty randomly selected. (One particular thing that I think > is a bad idea is to do this in a standalone backend. Any sort of > corruption in any db's pg_class would render it impossible to start up.) > > To fix the first three problems, and also avoid the performance problem > of multiply rescanning a database's pg_class for each of its > tablespaces, I would suggest that the hashtable entries be widened to > RelFileNode structs (ie, db oid, tablespace oid, relfilenode oid). Then > there should be one iteration over pg_database to learn the OIDs and > default tablespaces of each database; with that you can read pg_class > from its correct location for each database and load all the entries > into the hashtable. Then you iterate through the tablespaces looking > for stuff not present in the hashtable. You might also want to build a > list or hashtable of known database OIDs, so that you can recognize a > stale subdirectory immediately and issue a direct complaint about it > without even recursing into it. > > regards, tom lane > - Heikki ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| ||||
| Heikki Linnakangas <hlinnaka@iki.fi> writes: > Maybe we should take a different approach to the problem: > 1. Create new file with an extension to mark that it's not > yet committed (eg. 1234.notcommitted) This is pushing the problem into the wrong place, viz the lowest-level file access routines, which will now all have to know about ..notcommitted status. It also creates race conditions --- think about backend A trying to commit file 1234 at about the same time that backend B is trying to flush some dirty buffers belonging to that file. But most importantly, it doesn't handle the file-deletion case. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |