This is a discussion on [PATCH] Lazy xid assingment V2 within the pgsql Hackers forums, part of the PostgreSQL category; --> Florian G. Pflug wrote: > August Zajonc wrote: >> I'm confused about this. >> >> As long as we ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Florian G. Pflug wrote: > August Zajonc wrote: >> I'm confused about this. >> >> As long as we assert the rule that the file name can't change on the >> move, then after commit the file can be in only one of two places. >> The name of the file is known (ie, pg_class). The directories are >> known. What needs to be carried forwarded past a checkpoint? We don't >> even look at WAL, so checkpoints are irrelevant it seems >> If there is a crash just after commit and before the move, no harm. >> You just move on startup. If the move fails, no harm, you can emit >> warning and open in /pending (or simply error, even easier). > If you're going to open the file from /pending, whats the point of moving > it in the first place? Allow time for someone to sort out the disk situation without impacting things. Preserves the concept that after COMMIT a file exists on disk that is accessible, which is what people I think expect. > The idea would have to be that you move on commit (Or on COMMIT-record > replay, in case of a crash), and then, after recovering the whole wal, > you could remove leftover files in /pending. I think so. What I was thinking was Limit to moves from spclocation/file.new to spclocation/file. So given a pg_class filename the datafile can only have two possible names. relfilenode or relfilenode.new you commit, then move. If crash occurs before commit, you leak a .new table. If move fails after commit you emit warning. The commit is still valid, because fopen can fall back to .new. The data is still there. On crash recovery move .new files that show up in pg_class to their proper name if the disk has been fixed and move can succeed. For .new files that don't exist in pg_class after log replay, delete them. Fallback open. fopen relfilenode, if ENOENT fopen relfilenode.new if ENOENT error elseif emit warning > > So, what are you going to do if the move fails? You cannot roll back, and > you cannot update the CLOG (because than others would see your new table, > but no datafile). The only option is to PANIC. This will lead to a server > restart, WAL recovery, and probably another PANIC once the COMMIT-record > is replayed (Since the move probably still won't be possible). That was the idea of the fallback generally, avoid this issue. You never rollback. If datafile is not where expected, it can only be one other place. > It might be even worse - I'm not sure that a rename is an atomic > operation > on most filesystems. If it's not, then you might end up with two files if > power fails *just* as you rename, or, worse with no file at all. Even > a slight > possibility of the second case seems unacceptable - I means loosing > a committed transaction. Yes, atomic renames are an assumption. > > I agree that we should eventually find a way to guarantee either no file > leakage, or at least an upper bound on the amount of wasted space. But > doing so at the cost of PANICing if the move fails seems like a bad > tradeoff... > Agreed... > greetings, Florian Pflug ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Tom Lane wrote: > What I was thinking about was a "flag file" separate from the data file > itself, a bit like what we use for archiver signaling. If nnnn is the > new data file, then "touch nnnn.new" to mark the file as needing to be > deleted on restart. Remove these files just *before* commit. This > leaves you with a narrow window between removing the flag file and > actually committing, but there's no risk of having to PANIC --- if the > remove fails, you just abort the transaction. > > However, this has nonzero overhead compared to the current behavior. > I'm still dubious that we have a problem that needs solving ... > > regards, tom lane Maybe just get back to the beginning: We are concerned about leaking files on *crashes*. So during recovery, the only time we are going to care, iterate through list of files that should exist (pg_class) after replay of log. Any files that don't belong (but fit the standard naming conventions) move to spclocation/trash or similar. For the admin who had critical data they needed to recover and is going to spend the time going to the disk too pull it out they still can access it. This also saves admins who create table like filenames in table spaces. For most other folks they can dump /trash. Does this avoid the dataloss worries? ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Tom Lane wrote: > "Florian G. Pflug" <fgp@phlo.org> writes: >> It might be even worse - I'm not sure that a rename is an atomic operation >> on most filesystems. > > rename(2) is specified to be atomic by POSIX, but relinking a file into > a different directory can hardly be --- it's not even provided as a > single kernel call, is it? I'd have thought that they only guarantee that if the new name already exists it's atomically replaced. But I might be wrong.... > And there's still the problem that changing the filename on-the-fly is > going to break tons of low-level stuff, most of which is not supposed to > know about transactions at all, notably bgwriter. Good point - I thought that we wouldn't have to care about this because we could close the relation before renaming in the committing backend and be done with it, because other backends won't see the new file before we update the clog. But you're right, bgwriter is a problem and one not easily solved... So that rename-on-commit idea seems to be quite dead... > What I was thinking about was a "flag file" separate from the data file > itself, a bit like what we use for archiver signaling. If nnnn is the > new data file, then "touch nnnn.new" to mark the file as needing to be > deleted on restart. Remove these files just *before* commit. This > leaves you with a narrow window between removing the flag file and > actually committing, but there's no risk of having to PANIC --- if the > remove fails, you just abort the transaction. Hm.. we could call the file "nnn.xid.new", and delete it after the commit, silently ignoring any failures. During both database-wide VACUUM and after recovery we'd remove any leftover *.xid.new files, but only if the xid is marked committed in the clog. After that cleanup step, we'd delete any files which still have an associated flag file. Processing those nnn.xid.new files during VACUUM is just needed to avoid any problems because of xid wraparound - it could maybe be replaced by maybe naming the file nnn.epoch.xid.new > However, this has nonzero overhead compared to the current behavior. > I'm still dubious that we have a problem that needs solving ... I agree that file leakage is not a critical problem - if it were, they'd be much more complaints... But it's still something that a postgres DBA has to be aware of, because it might bite you quite badly. Since IMHO admin friendlyness is one of the strengths of postgresql, removing the possibility of leakage would be nice in the long term. Nothing that needs any rushing, though - and nothing that we'd want to pay for in terms of performance. ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| "Florian G. Pflug" <fgp@phlo.org> writes: > Tom Lane wrote: >> rename(2) is specified to be atomic by POSIX, but relinking a file into >> a different directory can hardly be --- it's not even provided as a >> single kernel call, is it? > I'd have thought that they only guarantee that if the new name already > exists it's atomically replaced. But I might be wrong.... I reread the spec and realized that rename() does include moving a link into a different directory --- but it only promises that replacement of the target filename is atomic, not that (say) the link couldn't appear in both directories concurrently. Also it's not clear that the spec intends to make any hard guarantees about the filesystem state after crash-and-recovery. In any case I don't think we can make renaming of active data files work --- bufmgr and bgwriter need those file names to be stable. The flag-file approach seems more promising. There's also the plan B of scanning pg_class to decide which relfilenode values are legit. IIRC Bruce did up a patch for this about a year ago, which I vetoed because I was afraid of the consequences if it removed data that someone really needed. Someone just mentioned doing the same thing but pushing the unreferenced files into a "trash" directory instead of actually deleting them. While that answers the risk-of-data-loss objection, I'm not sure it does much for the goal of avoiding useless space consumption: how many DBAs will faithfully examine and clean out that trash directory? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| Tom Lane wrote: > There's also the plan B of scanning pg_class to decide which relfilenode > values are legit. IIRC Bruce did up a patch for this about a year ago, > which I vetoed because I was afraid of the consequences if it removed > data that someone really needed. I posted a patch like that, 2-3 years ago I think. IIRC, the consensus back then was to just write a log message of the stale files, so an admin can go and delete them manually. That's safer than just deleting them, and we'll get an idea of how much of a problem this is in practice; at the moment a DBA has no way to know if there's some leaked space, except doing a manual compare of pg_class and filesystem. If it turns out to be reliable enough, and the problem big enough, we might start deleting the files automatically in future releases. I never got around to fixing the issues with the patch, but it's been tickling me a bit for all these years. > Someone just mentioned doing the same > thing but pushing the unreferenced files into a "trash" directory > instead of actually deleting them. While that answers the > risk-of-data-loss objection, I'm not sure it does much for the goal of > avoiding useless space consumption: how many DBAs will faithfully > examine and clean out that trash directory? That sounds like a good idea to me. If you DBA finds himself running out of disk space unexpectedly, he'll start looking around. Doing a "rm trash/*" surely seems easier and safer than deleting individual files from base. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| "Heikki Linnakangas" <heikki@enterprisedb.com> writes: > Tom Lane wrote: >> Someone just mentioned doing the same >> thing but pushing the unreferenced files into a "trash" directory >> instead of actually deleting them. > That sounds like a good idea to me. It suddenly strikes me that there's lots of precedent for it: fsck moves unreferenced files into lost+found, instead of just deleting them. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| Tom Lane wrote: > There's also the plan B of scanning pg_class to decide which relfilenode > values are legit. IIRC Bruce did up a patch for this about a year ago, > which I vetoed because I was afraid of the consequences if it removed > data that someone really needed. Someone just mentioned doing the same > thing but pushing the unreferenced files into a "trash" directory > instead of actually deleting them. While that answers the > risk-of-data-loss objection, I'm not sure it does much for the goal of > avoiding useless space consumption: how many DBAs will faithfully > examine and clean out that trash directory? > > For the admin who for some reason deletes critical input data before seeing a COMMIT return from postgresql they can probably keep the files. The thing is, the leak occurs in situation where a COMMIT hasn't returned to the user, so we are trying to guarantee no data-loss even when the user doesn't see a successful commit? That's a tall order obviously and hopefully people design their apps to attend to transaction success / failure. Plan B certainly won't take more space, and is probably the easiest to cleanup. ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| ||||
| August Zajonc wrote: > The thing is, the leak occurs in situation where a COMMIT hasn't > returned to the user, so we are trying to guarantee no data-loss even > when the user doesn't see a successful commit? That's a tall order > obviously and hopefully people design their apps to attend to > transaction success / failure. No, by the time we'd delete/move to trash the files, we know that they're no longer needed. But because deleting a relation is such a drastic thing to do, with potential for massive data loss, we're being extra paranoid. What if there's a bug in PostgreSQL that makes us delete the wrong file? Or transaction wraparound happens, despite the protections that we now have in place? Or you have bad RAM in the server, and a bit flips at an unfortunate place? That's the kind of situations we're worried about. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| Thread Tools | |
| Display Modes | |
|
|