Unix Technical Forum

Proposal: Commit timestamp

This is a discussion on Proposal: Commit timestamp within the pgsql Hackers forums, part of the PostgreSQL category; --> On 1/25/2007 11:41 PM, Bruce Momjian wrote: > Jan Wieck wrote: >> On 1/25/2007 6:49 PM, Tom Lane wrote: ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 04-12-2008, 06:41 AM
Jan Wieck
 
Posts: n/a
Default Re: Proposal: Commit timestamp

On 1/25/2007 11:41 PM, Bruce Momjian wrote:
> Jan Wieck wrote:
>> On 1/25/2007 6:49 PM, Tom Lane wrote:
>> > Jan Wieck <JanWieck@Yahoo.com> writes:
>> >> To provide this data, I would like to add another "log" directory,
>> >> pg_tslog. The files in this directory will be similar to the clog, but
>> >> contain arrays of timestamptz values.
>> >
>> > Why should everybody be made to pay this overhead?

>>
>> It could be made an initdb time option. If you intend to use a product
>> that requires this feature, you will be willing to pay that price.

>
> That is going to cut your usage by like 80%. There must be a better
> way.


I'd love to.

But it is a datum that needs to be collected at the moment where
basically the clog entry is made ... I don't think any external module
can do that ever.

You know how long I've been in and out and back into replication again.
The one thing that pops up again and again in all the scenarios is "what
the heck was the commit order?". Now the pure commit order for a single
node could certainly be recorded from a sequence, but that doesn't cover
the multi-node environment I am after. That's why I want it to be a
timestamp with a few fudged bits at the end. If you look at what I've
described, you will notice that as long as all node priorities are
unique, this timestamp will be a globally unique ID in a somewhat
ascending order along a timeline. That is what replication people are
looking for.

Tom fears that the overhead is significant, which I do understand and
frankly, wonder myself about (actually I don't even have a vague
estimate). I really think we should make this thing an initdb option and
decide later if it's on or off by default. Probably we can implement it
even in a way that one can turn it on/off and a postmaster restart plus
waiting the desired freeze-delay would do.

What I know for certain is that no async replication system can ever do
without the commit timestamp information. Using the transaction start
time or even the single statements timeofday will only lead to
inconsistencies all over the place (I haven't been absent from the
mailing lists for the past couple of month hiding in my closet ... I've
been experimenting and trying to get around all these issues - in my
closet). Slony-I can survive without that information because everything
happens on one node and we record snapshot information for later abusal.
But look at what cost we are dealing with this rather trivial issue. All
we need to know is the serializable commit order. And we have to issue
queries that eventually might exceed address space limits?


Jan

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = JanWieck@Yahoo.com #

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #12 (permalink)  
Old 04-12-2008, 06:41 AM
Naz Gassiep
 
Posts: n/a
Default Re: Proposal: Commit timestamp

I would be *very* concerned that system time is not a guaranteed
monotonic entity. Surely a counter or other internally managed mechanism
would be a better solution.

Furthermore, what would be the ramifications of master and slave system
times being out of sync?

Finally what if system time is rolled forward a few minutes as part of a
correction and there were transactions completed in that time? There is
a change, albeit small, that two transactions will have the same
timestamp. More importantly, this will throw all kinds of issues in when
the slave sees transactions in the future. Even with regular NTP syncs,
drift can cause a clock to be rolled forward a few milliseconds,
possibly resulting in duplicate transaction IDs.

In summary, I don't think the use of system time has any place in
PostgreSQL's internal consistency mechanisms, it is too unreliable an
environment property. Why can't a counter be used for this instead?

- Naz.

Jan Wieck wrote:
> For a future multimaster replication system, I will need a couple of
> features in the PostgreSQL server itself. I will submit separate
> proposals per feature so that discussions can be kept focused on one
> feature per thread.
>
> For conflict resolution purposes in an asynchronous multimaster
> system, the "last update" definition often comes into play. For this
> to work, the system must provide a monotonically increasing timestamp
> taken at the commit of a transaction. During replication, the
> replication process must be able to provide the remote nodes timestamp
> so that the replicated data will be "as of the time it was written on
> the remote node", and not the current local time of the replica, which
> is by definition of "asynchronous" later.
>
> To provide this data, I would like to add another "log" directory,
> pg_tslog. The files in this directory will be similar to the clog, but
> contain arrays of timestamptz values. On commit, the current system
> time will be taken. As long as this time is lower or equal to the last
> taken time in this PostgreSQL instance, the value will be increased by
> one microsecond. The resulting time will be added to the commit WAL
> record and written into the pg_tslog file.
>
> If a per database configurable tslog_priority is given, the timestamp
> will be truncated to milliseconds and the increment logic is done on
> milliseconds. The priority is added to the timestamp. This guarantees
> that no two timestamps for commits will ever be exactly identical,
> even across different servers.
>
> The COMMIT syntax will get extended to
>
> COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
>
> The extension is limited to superusers and will override the normally
> generated commit timestamp. This will be used to give the replicating
> transaction on the replica the exact same timestamp it got on the
> originating master node.
>
> The pg_tslog segments will be purged like the clog segments, after all
> transactions belonging to them have been stamped frozen. A frozen xid
> by definition has a timestamp of epoch. To ensure a system using this
> timestamp feature has enough time to perform its work, a new GUC
> variable defining an interval will prevent vacuum from freezing xid's
> that are younger than that.
>
> A function get_commit_timestamp(xid) returning timpstamptz will return
> the commit time of a transaction as recorded by this feature.
>
>
> Comments, changes, additions?
>
> Jan
>


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #13 (permalink)  
Old 04-12-2008, 06:41 AM
Markus Schiltknecht
 
Posts: n/a
Default Re: Proposal: Commit timestamp

Hi,

Jan Wieck wrote:
> The replication system I have in mind will have another field type of
> the balance nature, where it will never communicate the current value
> but only deltas that get applied regardless of the two timestamps.


I'd favor a more generally usable conflict resolution function
interface, on top of which you can implement both, the "last update
wins" as well as the "balance" conflict resolution type.

Passing the last common ancestor and the two conflicting heads to the
conflict resolution function (CRF) should be enough. That would easily
allow to implement the "balance" type (as you can calculate both
deltas). And if you want to rely on something as arbitrary as a
timestamp, you'd simply have to add a timestamp column to your table and
let the CRF decide uppon that.

This would allow pretty much any type of conflict resolution, for
example: higher priority cleanup transactions, which change lots of
tuples and should better not be aborted later on. Those could be
implemented by adding a priority column and having the CRF respect that
one, too.

To find the last common ancestor tuple, transaction ids and MVCC are
enough. You wouldn't need to add timestamps. You'd only have to make
sure VACUUM doesn't delete tuples you still need.

Regards

Markus

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #14 (permalink)  
Old 04-12-2008, 06:41 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Proposal: Commit timestamp

Jan Wieck wrote:
> But it is a datum that needs to be collected at the moment where
> basically the clog entry is made ... I don't think any external module
> can do that ever.


How atomic does it need to be? External modules can register callbacks
that get called right after the clog update and removing the xid from
MyProc entry. That's about as close to making the clog entry you can get.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #15 (permalink)  
Old 04-12-2008, 06:42 AM
Theo Schlossnagle
 
Posts: n/a
Default Re: Proposal: Commit timestamp

Jan, et. al.,

On Jan 26, 2007, at 2:37 AM, Naz Gassiep wrote:
> I would be *very* concerned that system time is not a guaranteed
> monotonic entity. Surely a counter or other internally managed
> mechanism would be a better solution.


As you should be concerned. Looking on my desk through the last few
issues in IEEE Transactions on Parallel and Distributed Systems, I
see no time synch stuff for clusters of machines that is actually
based on time. Almost all rely on something like a Lamport timestamp
or some relaxation thereof. A few are based off a tree based pulse.
Using actual times is fraught with problems and is typically
inappropriate for cluster synchronization needs.

> Furthermore, what would be the ramifications of master and slave
> system times being out of sync?


I'm much more concerned with the overall approach. The algorithm for
replication should be published in theoretic style with a thorough
analysis of its assumptions and a proof of correctness based on those
assumptions. Databases and replication therein are definitely
technologies that aren't "off-the-cuff," and rigorous academic
discussion and acceptance before they will get adopted. People
generally will not adopt technologies to store mission critical data
until they are confident that it will both work as designed and work
as implemented -- the second is far less important as the weakness
there are simply bugs.

I'm not implying that this rigorous dissection of replication design
hasn't happened, but I didn't see it referenced anywhere in this
thread. Can you point me to it? I've reviewed many of these papers
and would like to better understand what you are aiming at.

Best regards,

Theo Schlossnagle



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #16 (permalink)  
Old 04-12-2008, 06:42 AM
Jan Wieck
 
Posts: n/a
Default Re: Proposal: Commit timestamp

On 1/26/2007 2:37 AM, Naz Gassiep wrote:
> I would be *very* concerned that system time is not a guaranteed
> monotonic entity. Surely a counter or other internally managed mechanism
> would be a better solution.


Such a counter has only "local" relevance. How do you plan to compare
the two separate counters on different machines to tell which
transaction happened last?

Even if the system clock isn't monotonically increasing, the described
increment system guarantees the timestamp used to appear so. Granted,
this system will not work too well on a platform that doesn't allow to
slew the system clock.

>
> Furthermore, what would be the ramifications of master and slave system
> times being out of sync?


The origin of a transaction must scan all tuples it updates and make
sure that the timestamp it uses for commit appears in the future with
respect to them.

>
> Finally what if system time is rolled forward a few minutes as part of a
> correction and there were transactions completed in that time? There is
> a change, albeit small, that two transactions will have the same
> timestamp. More importantly, this will throw all kinds of issues in when
> the slave sees transactions in the future. Even with regular NTP syncs,
> drift can cause a clock to be rolled forward a few milliseconds,
> possibly resulting in duplicate transaction IDs.
>
> In summary, I don't think the use of system time has any place in
> PostgreSQL's internal consistency mechanisms, it is too unreliable an
> environment property. Why can't a counter be used for this instead?


This is nothing used for PostgreSQL's consistency. It is a vehicle
intended to be used to synchronize the "last update wins" decision
process of an asynchronous multimaster system. If not with a timestamp,
how would you make sure that the replication processes of two different
nodes will come to the same conclusion as to which update was last?
Especially considering that the replication might take place hours after
the original transaction happened.


Jan

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = JanWieck@Yahoo.com #

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #17 (permalink)  
Old 04-12-2008, 06:42 AM
Simon Riggs
 
Posts: n/a
Default Re: Proposal: Commit timestamp

On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:

> To provide this data, I would like to add another "log" directory,
> pg_tslog. The files in this directory will be similar to the clog, but
> contain arrays of timestamptz values. On commit, the current system time
> will be taken. As long as this time is lower or equal to the last taken
> time in this PostgreSQL instance, the value will be increased by one
> microsecond. The resulting time will be added to the commit WAL record
> and written into the pg_tslog file.


A transaction time table/log has other uses as well, so its fairly
interesting to have this.

> COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
>
> The extension is limited to superusers and will override the normally
> generated commit timestamp.


I don't think its acceptable to override the normal timestamp. That
could lead to non monotonic time values which could screw up PITR. My
view is that you still need PITR even when you are using replication,
because the former provides recoverability and the latter provides
availability.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #18 (permalink)  
Old 04-12-2008, 06:42 AM
Jan Wieck
 
Posts: n/a
Default Re: Proposal: Commit timestamp

On 1/26/2007 8:26 AM, Simon Riggs wrote:
> On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote:
>
>> To provide this data, I would like to add another "log" directory,
>> pg_tslog. The files in this directory will be similar to the clog, but
>> contain arrays of timestamptz values. On commit, the current system time
>> will be taken. As long as this time is lower or equal to the last taken
>> time in this PostgreSQL instance, the value will be increased by one
>> microsecond. The resulting time will be added to the commit WAL record
>> and written into the pg_tslog file.

>
> A transaction time table/log has other uses as well, so its fairly
> interesting to have this.
>
>> COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>];
>>
>> The extension is limited to superusers and will override the normally
>> generated commit timestamp.

>
> I don't think its acceptable to override the normal timestamp. That
> could lead to non monotonic time values which could screw up PITR. My
> view is that you still need PITR even when you are using replication,
> because the former provides recoverability and the latter provides
> availability.


Without that it is rendered useless for conflict resolution purposes.

The timestamp used does not necessarily have much to do with the real
time at commit. Although I'd like it to be as close as possible. This
timestamp marks the age of the new datum in an update. Since the
replication is asynchronous, the update on the remote systems will
happen later, but the timestamp recorded with that datum must be the
timestamp of the original transaction, not the current time when it is
replicated remotely. All we have to determine that is the xmin in the
rows tuple header, so that xmin must resolve to the original
transactions timestamp.


Jan

--
#================================================= =====================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================= = JanWieck@Yahoo.com #

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #19 (permalink)  
Old 04-12-2008, 06:42 AM
Stephen Frost
 
Posts: n/a
Default Re: Proposal: Commit timestamp

* Jan Wieck (JanWieck@Yahoo.com) wrote:
> On 1/26/2007 2:37 AM, Naz Gassiep wrote:
> >I would be *very* concerned that system time is not a guaranteed
> >monotonic entity. Surely a counter or other internally managed mechanism
> >would be a better solution.

>
> Such a counter has only "local" relevance. How do you plan to compare
> the two separate counters on different machines to tell which
> transaction happened last?


I'd also suggest you look into Lamport timestamps... Trusting the
system clock just isn't practical, even with NTP. I've developed
(albeit relatively small) systems using Lamport timestamps and would be
happy to talk about it offlist. I've probably got some code I could
share as well.

Thanks,

Stephen

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFuhJ5rzgMPqB3kigRAgMXAJ4kncyi+AJJKK/lzijcy7Jl4lyBlwCfVQd/
2WIhnNNqsQsg/2WMCfIYCa8=
=V3NO
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #20 (permalink)  
Old 04-12-2008, 06:42 AM
Andrew Dunstan
 
Posts: n/a
Default Re: Proposal: Commit timestamp

Stephen Frost wrote:
> I'd also suggest you look into Lamport timestamps... Trusting the
> system clock just isn't practical, even with NTP. I've developed
> (albeit relatively small) systems using Lamport timestamps and would be
> happy to talk about it offlist. I've probably got some code I could
> share as well.
>


that looks like what Oracle RAC uses:
http://www.lc.leidenuniv.nl/awcourse...6597/coord.htm

cheers

andrew



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 03:10 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com