Unix Technical Forum

[PATCH] Lazy xid assingment V2

This is a discussion on [PATCH] Lazy xid assingment V2 within the pgsql Hackers forums, part of the PostgreSQL category; --> Florian G. Pflug wrote: > August Zajonc wrote: >> I'm confused about this. >> >> As long as we ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #21 (permalink)  
Old 04-15-2008, 09:47 PM
August Zajonc
 
Posts: n/a
Default Re: [PATCH] Lazy xid assingment V2

Florian G. Pflug wrote:
> August Zajonc wrote:
>> I'm confused about this.
>>
>> As long as we assert the rule that the file name can't change on the
>> move, then after commit the file can be in only one of two places.
>> The name of the file is known (ie, pg_class). The directories are
>> known. What needs to be carried forwarded past a checkpoint? We don't
>> even look at WAL, so checkpoints are irrelevant it seems
>> If there is a crash just after commit and before the move, no harm.
>> You just move on startup. If the move fails, no harm, you can emit
>> warning and open in /pending (or simply error, even easier).

> If you're going to open the file from /pending, whats the point of moving
> it in the first place?

Allow time for someone to sort out the disk situation without impacting
things. Preserves the concept that after COMMIT a file exists on disk
that is accessible, which is what people I think expect.
> The idea would have to be that you move on commit (Or on COMMIT-record
> replay, in case of a crash), and then, after recovering the whole wal,
> you could remove leftover files in /pending.

I think so. What I was thinking was

Limit to moves from spclocation/file.new to spclocation/file. So given a
pg_class filename the datafile can only have two possible names.
relfilenode or relfilenode.new

you commit, then move.

If crash occurs before commit, you leak a .new table.

If move fails after commit you emit warning. The commit is still valid,
because fopen can fall back to .new. The data is still there.

On crash recovery move .new files that show up in pg_class to their
proper name if the disk has been fixed and move can succeed. For .new
files that don't exist in pg_class after log replay, delete them.

Fallback open.

fopen relfilenode,
if ENOENT fopen relfilenode.new
if ENOENT error
elseif emit warning

>
> So, what are you going to do if the move fails? You cannot roll back, and
> you cannot update the CLOG (because than others would see your new table,
> but no datafile). The only option is to PANIC. This will lead to a server
> restart, WAL recovery, and probably another PANIC once the COMMIT-record
> is replayed (Since the move probably still won't be possible).

That was the idea of the fallback generally, avoid this issue. You never
rollback. If datafile is not where expected, it can only be one other
place.
> It might be even worse - I'm not sure that a rename is an atomic
> operation
> on most filesystems. If it's not, then you might end up with two files if
> power fails *just* as you rename, or, worse with no file at all. Even
> a slight
> possibility of the second case seems unacceptable - I means loosing
> a committed transaction.

Yes, atomic renames are an assumption.
>
> I agree that we should eventually find a way to guarantee either no file
> leakage, or at least an upper bound on the amount of wasted space. But
> doing so at the cost of PANICing if the move fails seems like a bad
> tradeoff...
>

Agreed...
> greetings, Florian Pflug



---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #22 (permalink)  
Old 04-15-2008, 09:47 PM
August Zajonc
 
Posts: n/a
Default Re: [PATCH] Lazy xid assingment V2

Tom Lane wrote:
> What I was thinking about was a "flag file" separate from the data file
> itself, a bit like what we use for archiver signaling. If nnnn is the
> new data file, then "touch nnnn.new" to mark the file as needing to be
> deleted on restart. Remove these files just *before* commit. This
> leaves you with a narrow window between removing the flag file and
> actually committing, but there's no risk of having to PANIC --- if the
> remove fails, you just abort the transaction.
>
> However, this has nonzero overhead compared to the current behavior.
> I'm still dubious that we have a problem that needs solving ...
>
> regards, tom lane

Maybe just get back to the beginning:

We are concerned about leaking files on *crashes*.

So during recovery, the only time we are going to care, iterate through
list of files that should exist (pg_class) after replay of log.

Any files that don't belong (but fit the standard naming conventions)
move to spclocation/trash or similar.

For the admin who had critical data they needed to recover and is going
to spend the time going to the disk too pull it out they still can
access it. This also saves admins who create table like filenames in
table spaces.

For most other folks they can dump /trash.

Does this avoid the dataloss worries?


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #23 (permalink)  
Old 04-15-2008, 09:47 PM
Florian G. Pflug
 
Posts: n/a
Default Re: [PATCH] Lazy xid assingment V2

Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> It might be even worse - I'm not sure that a rename is an atomic operation
>> on most filesystems.

>
> rename(2) is specified to be atomic by POSIX, but relinking a file into
> a different directory can hardly be --- it's not even provided as a
> single kernel call, is it?

I'd have thought that they only guarantee that if the new name already
exists it's atomically replaced. But I might be wrong....

> And there's still the problem that changing the filename on-the-fly is
> going to break tons of low-level stuff, most of which is not supposed to
> know about transactions at all, notably bgwriter.

Good point - I thought that we wouldn't have to care about this because
we could close the relation before renaming in the committing backend
and be done with it, because other backends won't see the new file
before we update the clog. But you're right, bgwriter is a problem
and one not easily solved...

So that rename-on-commit idea seems to be quite dead...

> What I was thinking about was a "flag file" separate from the data file
> itself, a bit like what we use for archiver signaling. If nnnn is the
> new data file, then "touch nnnn.new" to mark the file as needing to be
> deleted on restart. Remove these files just *before* commit. This
> leaves you with a narrow window between removing the flag file and
> actually committing, but there's no risk of having to PANIC --- if the
> remove fails, you just abort the transaction.

Hm.. we could call the file "nnn.xid.new", and delete it after the commit,
silently ignoring any failures. During both database-wide VACUUM and
after recovery we'd remove any leftover *.xid.new files, but only
if the xid is marked committed in the clog. After that cleanup step,
we'd delete any files which still have an associated flag file.

Processing those nnn.xid.new files during VACUUM is just needed to
avoid any problems because of xid wraparound - it could maybe
be replaced by maybe naming the file nnn.epoch.xid.new

> However, this has nonzero overhead compared to the current behavior.
> I'm still dubious that we have a problem that needs solving ...

I agree that file leakage is not a critical problem - if it were, they'd
be much more complaints...

But it's still something that a postgres DBA has to be aware of, because
it might bite you quite badly. Since IMHO admin friendlyness is one of
the strengths of postgresql, removing the possibility of leakage would be
nice in the long term.

Nothing that needs any rushing, though - and nothing that we'd want to pay
for in terms of performance.



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #24 (permalink)  
Old 04-15-2008, 09:47 PM
Tom Lane
 
Posts: n/a
Default Re: [PATCH] Lazy xid assingment V2

"Florian G. Pflug" <fgp@phlo.org> writes:
> Tom Lane wrote:
>> rename(2) is specified to be atomic by POSIX, but relinking a file into
>> a different directory can hardly be --- it's not even provided as a
>> single kernel call, is it?


> I'd have thought that they only guarantee that if the new name already
> exists it's atomically replaced. But I might be wrong....


I reread the spec and realized that rename() does include moving a link
into a different directory --- but it only promises that replacement of
the target filename is atomic, not that (say) the link couldn't appear
in both directories concurrently. Also it's not clear that the spec
intends to make any hard guarantees about the filesystem state after
crash-and-recovery.

In any case I don't think we can make renaming of active data files work
--- bufmgr and bgwriter need those file names to be stable. The
flag-file approach seems more promising.

There's also the plan B of scanning pg_class to decide which relfilenode
values are legit. IIRC Bruce did up a patch for this about a year ago,
which I vetoed because I was afraid of the consequences if it removed
data that someone really needed. Someone just mentioned doing the same
thing but pushing the unreferenced files into a "trash" directory
instead of actually deleting them. While that answers the
risk-of-data-loss objection, I'm not sure it does much for the goal of
avoiding useless space consumption: how many DBAs will faithfully
examine and clean out that trash directory?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #25 (permalink)  
Old 04-15-2008, 09:47 PM
Heikki Linnakangas
 
Posts: n/a
Default Re: [PATCH] Lazy xid assingment V2

Tom Lane wrote:
> There's also the plan B of scanning pg_class to decide which relfilenode
> values are legit. IIRC Bruce did up a patch for this about a year ago,
> which I vetoed because I was afraid of the consequences if it removed
> data that someone really needed.


I posted a patch like that, 2-3 years ago I think. IIRC, the consensus
back then was to just write a log message of the stale files, so an
admin can go and delete them manually. That's safer than just deleting
them, and we'll get an idea of how much of a problem this is in
practice; at the moment a DBA has no way to know if there's some leaked
space, except doing a manual compare of pg_class and filesystem. If it
turns out to be reliable enough, and the problem big enough, we might
start deleting the files automatically in future releases.

I never got around to fixing the issues with the patch, but it's been
tickling me a bit for all these years.

> Someone just mentioned doing the same
> thing but pushing the unreferenced files into a "trash" directory
> instead of actually deleting them. While that answers the
> risk-of-data-loss objection, I'm not sure it does much for the goal of
> avoiding useless space consumption: how many DBAs will faithfully
> examine and clean out that trash directory?


That sounds like a good idea to me. If you DBA finds himself running out
of disk space unexpectedly, he'll start looking around. Doing a "rm
trash/*" surely seems easier and safer than deleting individual files
from base.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #26 (permalink)  
Old 04-15-2008, 09:47 PM
Tom Lane
 
Posts: n/a
Default Re: [PATCH] Lazy xid assingment V2

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Someone just mentioned doing the same
>> thing but pushing the unreferenced files into a "trash" directory
>> instead of actually deleting them.


> That sounds like a good idea to me.


It suddenly strikes me that there's lots of precedent for it: fsck moves
unreferenced files into lost+found, instead of just deleting them.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #27 (permalink)  
Old 04-15-2008, 09:47 PM
August Zajonc
 
Posts: n/a
Default Re: [PATCH] Lazy xid assingment V2

Tom Lane wrote:
> There's also the plan B of scanning pg_class to decide which relfilenode
> values are legit. IIRC Bruce did up a patch for this about a year ago,
> which I vetoed because I was afraid of the consequences if it removed
> data that someone really needed. Someone just mentioned doing the same
> thing but pushing the unreferenced files into a "trash" directory
> instead of actually deleting them. While that answers the
> risk-of-data-loss objection, I'm not sure it does much for the goal of
> avoiding useless space consumption: how many DBAs will faithfully
> examine and clean out that trash directory?
>
>

For the admin who for some reason deletes critical input data before
seeing a COMMIT return from postgresql they can probably keep the files.

The thing is, the leak occurs in situation where a COMMIT hasn't
returned to the user, so we are trying to guarantee no data-loss even
when the user doesn't see a successful commit? That's a tall order
obviously and hopefully people design their apps to attend to
transaction success / failure.

Plan B certainly won't take more space, and is probably the easiest to
cleanup.








---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #28 (permalink)  
Old 04-15-2008, 09:48 PM
Heikki Linnakangas
 
Posts: n/a
Default Re: [PATCH] Lazy xid assingment V2

August Zajonc wrote:
> The thing is, the leak occurs in situation where a COMMIT hasn't
> returned to the user, so we are trying to guarantee no data-loss even
> when the user doesn't see a successful commit? That's a tall order
> obviously and hopefully people design their apps to attend to
> transaction success / failure.


No, by the time we'd delete/move to trash the files, we know that
they're no longer needed. But because deleting a relation is such a
drastic thing to do, with potential for massive data loss, we're being
extra paranoid. What if there's a bug in PostgreSQL that makes us delete
the wrong file? Or transaction wraparound happens, despite the
protections that we now have in place? Or you have bad RAM in the
server, and a bit flips at an unfortunate place? That's the kind of
situations we're worried about.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 08:52 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com