vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| For a future multimaster replication system, I will need a couple of features in the PostgreSQL server itself. I will submit separate proposals per feature so that discussions can be kept focused on one feature per thread. For conflict resolution purposes in an asynchronous multimaster system, the "last update" definition often comes into play. For this to work, the system must provide a monotonically increasing timestamp taken at the commit of a transaction. During replication, the replication process must be able to provide the remote nodes timestamp so that the replicated data will be "as of the time it was written on the remote node", and not the current local time of the replica, which is by definition of "asynchronous" later. To provide this data, I would like to add another "log" directory, pg_tslog. The files in this directory will be similar to the clog, but contain arrays of timestamptz values. On commit, the current system time will be taken. As long as this time is lower or equal to the last taken time in this PostgreSQL instance, the value will be increased by one microsecond. The resulting time will be added to the commit WAL record and written into the pg_tslog file. If a per database configurable tslog_priority is given, the timestamp will be truncated to milliseconds and the increment logic is done on milliseconds. The priority is added to the timestamp. This guarantees that no two timestamps for commits will ever be exactly identical, even across different servers. The COMMIT syntax will get extended to COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>]; The extension is limited to superusers and will override the normally generated commit timestamp. This will be used to give the replicating transaction on the replica the exact same timestamp it got on the originating master node. The pg_tslog segments will be purged like the clog segments, after all transactions belonging to them have been stamped frozen. A frozen xid by definition has a timestamp of epoch. To ensure a system using this timestamp feature has enough time to perform its work, a new GUC variable defining an interval will prevent vacuum from freezing xid's that are younger than that. A function get_commit_timestamp(xid) returning timpstamptz will return the commit time of a transaction as recorded by this feature. Comments, changes, additions? Jan -- #================================================= =====================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================= = JanWieck@Yahoo.com # ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote: > For conflict resolution purposes in an asynchronous multimaster system, > the "last update" definition often comes into play. For this to work, > the system must provide a monotonically increasing timestamp taken at > the commit of a transaction. Do you really need an actual timestamptz derived from the system clock, or would a monotonically increasing 64-bit counter be sufficient? (The assumption that the system clock is monotonically increasing seems pretty fragile, in the presence of manual system clock changes, ntpd, etc.) > Comments, changes, additions? Would this feature have any use beyond the specific project/algorithm you have in mind? -Neil ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Jan Wieck <JanWieck@Yahoo.com> writes: > To provide this data, I would like to add another "log" directory, > pg_tslog. The files in this directory will be similar to the clog, but > contain arrays of timestamptz values. Why should everybody be made to pay this overhead? > The COMMIT syntax will get extended to > COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>]; > The extension is limited to superusers and will override the normally > generated commit timestamp. This will be used to give the replicating > transaction on the replica the exact same timestamp it got on the > originating master node. I'm not convinced you've even thought this through. If you do that then you have no guarantee of commit timestamp monotonicity on the slave (if it has either multi masters or any locally generated transactions). Since this is supposedly for a multi-master system, that seems a rather fatal objection --- no node in the system will actually have commit timestamp monotonicity. What are you hoping to accomplish with this? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On 1/25/2007 6:47 PM, Neil Conway wrote: > On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote: >> For conflict resolution purposes in an asynchronous multimaster system, >> the "last update" definition often comes into play. For this to work, >> the system must provide a monotonically increasing timestamp taken at >> the commit of a transaction. > > Do you really need an actual timestamptz derived from the system clock, > or would a monotonically increasing 64-bit counter be sufficient? (The > assumption that the system clock is monotonically increasing seems > pretty fragile, in the presence of manual system clock changes, ntpd, > etc.) Yes, I do need it to be a timestamp, and one assumption is that all servers in the multimaster cluster are ntp synchronized. The reason is that this is for asynchronous multimaster (in my case). Two sequences running on separate systems don't tell which was the "last update" on a timeline. This conflict resolution method alone is of course completely inadequate. > >> Comments, changes, additions? > > Would this feature have any use beyond the specific project/algorithm > you have in mind? The tablelog project on pgfoundry currently uses the transactions start time but would be very delighted to have the commit time available instead. Jan -- #================================================= =====================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================= = JanWieck@Yahoo.com # ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| Jan Wieck <JanWieck@Yahoo.com> writes: > On 1/25/2007 6:47 PM, Neil Conway wrote: >> Would this feature have any use beyond the specific project/algorithm >> you have in mind? > The tablelog project on pgfoundry currently uses the transactions start > time but would be very delighted to have the commit time available instead. BTW, it's not clear to me why you need a new log area for this. (We don't log transaction start time anywhere, so certainly tablelog's needs would not include it.) Commit timestamps are available from WAL commit records in a crash-and-restart scenario, so wouldn't that be enough? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On 1/25/2007 6:49 PM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> To provide this data, I would like to add another "log" directory, >> pg_tslog. The files in this directory will be similar to the clog, but >> contain arrays of timestamptz values. > > Why should everybody be made to pay this overhead? It could be made an initdb time option. If you intend to use a product that requires this feature, you will be willing to pay that price. > >> The COMMIT syntax will get extended to >> COMMIT [TRANSACTION] [WITH TIMESTAMP <timestamptz>]; >> The extension is limited to superusers and will override the normally >> generated commit timestamp. This will be used to give the replicating >> transaction on the replica the exact same timestamp it got on the >> originating master node. > > I'm not convinced you've even thought this through. If you do that then > you have no guarantee of commit timestamp monotonicity on the slave > (if it has either multi masters or any locally generated transactions). > Since this is supposedly for a multi-master system, that seems a rather > fatal objection --- no node in the system will actually have commit > timestamp monotonicity. What are you hoping to accomplish with this? Maybe I wasn't clear enough about this. If the commit timestamps on the local machine are guaranteed to increase at least by one millisecond (okay that limits the system to a sustained 1000 commits per second before it really seems to run ahead of time), then no two commits on the same instance will ever have the same timestamp. If furthermore each instance in a cluster has a distinct priority (the microsecond part added to the millisecond-truncated timestamp), each commit timestamp could even act as a globally unique ID. It does require that all the nodes in the cluster are configured with a distinct priority. What I hope to accomplish with this is a very easy, commit time based "last update wins" conflict resolution for data fields of the overwrite nature. The replication system I have in mind will have another field type of the balance nature, where it will never communicate the current value but only deltas that get applied regardless of the two timestamps. Jan -- #================================================= =====================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================= = JanWieck@Yahoo.com # ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On 1/25/2007 7:41 PM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> On 1/25/2007 6:47 PM, Neil Conway wrote: >>> Would this feature have any use beyond the specific project/algorithm >>> you have in mind? > >> The tablelog project on pgfoundry currently uses the transactions start >> time but would be very delighted to have the commit time available instead. > > BTW, it's not clear to me why you need a new log area for this. (We > don't log transaction start time anywhere, so certainly tablelog's needs > would not include it.) Commit timestamps are available from WAL commit > records in a crash-and-restart scenario, so wouldn't that be enough? First, I need the timestamp of the original transaction that caused the data to change, which can be a remote or a local transaction. So the timestamp currently recorded in the WAL commit record is useless and the commit record has to be extended by one more timestamp. Second, I don't think that an API scanning for WAL commit records by xid would be efficient enough to satisfy the needs of a timestamp based conflict resolution system, which would have to retrieve the timestamp for every rows xmin that it is about to update in order to determine if the old or the new values should be used. Third, keeping the timestamp information in the WAL only would require to keep the WAL segments around until they are older than the admin chosen minimum freeze age. I hope you don't want to force that penalty on everyone who intends to use multimaster replication. Jan -- #================================================= =====================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================= = JanWieck@Yahoo.com # ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Thu, 25 Jan 2007, Jan Wieck wrote: > > For a future multimaster replication system, I will need a couple of > features in the PostgreSQL server itself. I will submit separate > proposals per feature so that discussions can be kept focused on one > feature per thread. Hmm... "will need" ... Have you prototyped this system yet? ISTM you can prototype your proposal using "external" components so you can work out the kinks first. Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 rtroy@ScienceTools.com, http://ScienceTools.com/ ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On 1/25/2007 8:42 PM, Richard Troy wrote: > On Thu, 25 Jan 2007, Jan Wieck wrote: >> >> For a future multimaster replication system, I will need a couple of >> features in the PostgreSQL server itself. I will submit separate >> proposals per feature so that discussions can be kept focused on one >> feature per thread. > > Hmm... "will need" ... Have you prototyped this system yet? ISTM you can > prototype your proposal using "external" components so you can work out > the kinks first. These details are pretty drilled down and are needed with the described functionality. And I will not make the same mistake as with Slony-I again and develop things, that require backend support, as totally external (look at the catalog corruption mess I created there and you know what I'm talking about). Jan -- #================================================= =====================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================= = JanWieck@Yahoo.com # ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| ||||
| Jan Wieck wrote: > On 1/25/2007 6:49 PM, Tom Lane wrote: > > Jan Wieck <JanWieck@Yahoo.com> writes: > >> To provide this data, I would like to add another "log" directory, > >> pg_tslog. The files in this directory will be similar to the clog, but > >> contain arrays of timestamptz values. > > > > Why should everybody be made to pay this overhead? > > It could be made an initdb time option. If you intend to use a product > that requires this feature, you will be willing to pay that price. That is going to cut your usage by like 80%. There must be a better way. -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |