Unix Technical Forum

Frequent Update Project: Design Overview of HOT Updates

This is a discussion on Frequent Update Project: Design Overview of HOT Updates within the pgsql Hackers forums, part of the PostgreSQL category; --> Design Overview of HOT Updates ------------------------------ The objective is to increase the speed of the UPDATE case, while minimizing ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-12-2008, 05:34 AM
Simon Riggs
 
Posts: n/a
Default Frequent Update Project: Design Overview of HOT Updates

Design Overview of HOT Updates
------------------------------

The objective is to increase the speed of the UPDATE case, while
minimizing the overall negative effects of the UPDATE. We refer to the
general requirement as *Frequent Update Optimization*, though this
design proposal is for Heap Overflow Tuple (HOT) Updates. It is similar
in some ways to the design for SITC already proposed, though has a
number of additional features drawn from other designs to make it a
practical and effective implementation.

EnterpriseDB have a working, performant prototype of this design. There
are still a number of issues to resolve and the intention is to follow
an open community process to find the best way forward. All required
detail will be provided for the work conducted so far.

Current PGSQL behaviour is for UPDATEs to create a new tuple version
within the heap, so acts from many perspectives as if it were an INSERT.
All of the tuple versions are chained together, so that whichever of the
tuples is visible to your Snapshot, you can walk the chain to find the
most recent tuple version to update.

The HOT proposal splits the heap into two areas: the main heap and an
overflow relation, both of which are regular, transactional relations.
INSERT and DELETE effect only the main heap exactly as they do now.
UPDATE places new tuple versions into the overflow relation in most
cases, maintaining all of the current capabilities of the MVCC model:
all tuple versions remain accessible, plus the chain of updated tuple
versions is maintained.

UPDATEs do not insert into indexes at all - no index pointers exist at
all into the overflow relation. So during a stream of UPDATEs the main
heap and indexes stay the same size, while the indexes are used
read-only, avoiding additional database writes and the need for eventual
index rebuilding.

The current tuple version within the main heap is referred to as the
"root tuple" from now on. When reading the main heap, if the root tuple
is not visible we "walk the chain" into the overflow relation until we
find a tuple version that is visible - we follow the t_ctid pointer from
tuple to tuple, testing for MVCC visibility as we go.

As more UPDATEs take place these tuple chains would grow, making
locating the latest tuple take progressively longer. Intuitively this
sounds like a bad design, though an additional feature turns that from
bad to good performance:

- when anybody walks the chain into the overflow relation, if we find
that the root tuple is vacuumable we copy back the first visible tuple
version over the now-dead root tuple.

This allows the length of a typical tuple chain to be extremely short in
practice. For a single connection issuing a stream of UPDATEs the chain
length will no more than 1 at any time. Even under heavy concurrent
UPDATEs the modal chain length remains 1, with the chain length varying
in approximately Poisson distribution.

The overflow relation is similar in concept to a TOAST table, so we
might describe this approach as TOAST-for-updated-versions. The code
implementation is similar also, with reasonably similar modularity. HOT
does sound radical, but no more so than TOAST was when first discussed.

We can locate any tuple version and thereby allow MVCC to work
correctly, while at the same time preserving the crucial property of the
Postgres non-overwriting storage manager. This works very well for
indexed tuple access since the index and main heap never grow, thus
maintaining access speeds. HOT supports both Read Committed and
Serializable transaction isolation levels, with transactions of any
length, just as with current MVCC.

Performance results against the previously described test cases are
approximately:

1. pgbench (TPC-B) - Around 200-300% better

2. DBT-2 (TPC-C) - Signficicant improvement - possibly 10% improvement,
somewhat difficult to measure exactly because of the way the test itself
works.

3. truckin - Around 500% improvement

The performance gain remains better than REL8_2 base even in the
presence of longer running transactions.

[There is also considerable synergy with changes to b-tree indexes that
are both useful in their own right, and even better when HOT is added,
more on that on a separate thread.]

We expect that people will want to measure this for themselves, so we
don't publish all of the detailed internal tests here since YMMV.

Using HOT
---------

HOT can only work in cases where a tuple does not modify one of the
columns defined in an index on the table, and when we do not alter the
row length of the tuple. [We'll be able to do that more efficiently when
we have plan invalidation]

If we perform an update that meets the HOT criteria then we put the
new version into the overflow relation; we describe this as a HOT
UPDATE. If we perform an update that does not meet the criteria, then we
carry on with the existing/old MVCC behaviour; we describe this as a
non-HOT UPDATE.

So with HOT we must support both HOT and non-HOT UPDATEs. Each tuple
whether it is in the main heap or the overflow relation can point to
another tuple with the t_ctid, which is extended by using a previously
unused bit to flag whether the destination is a block from the main heap
or the overflow relation.

Just to reiterate, since a HOT update doesn't touch index values that
means that all members of a tuple chain have identical primary keys and
index values. So we only ever need to grovel through the overflow
relation *if* we have already met the index search criteria - so we
don't have to re-check the index criteria for each tuple version we
touch. All normal updates of a table will be read-only on the index,
plus writes to heap/overflow.

Simple Example of Tuple Chaining
--------------------------------

Lets look at a simple case:

1. INSERT INTO table (1,....)
Tuple is inserted into main heap (Version1 or V1)

Heap Page P1 Overflow Relation
---------------- -----------------
| | | |
| ______ | | |
| | V1 | | | |
| |______| | | |
| | | |
|--------------| |---------------|

2. UPDATE table SET nonindexcol = 6 WHERE pkcol = 1;
Same length tuple, so V2 written using overflow

Heap Page P1 Overflow Relation
---------------- -----------------
| | | ______ |
| ______ | |------>| V2 | |
| | V1 |------| | |______| |
| |______| | | |
| | | |
|--------------| |---------------|

3. UPDATE table SET nonindexcol = 7 WHERE pkcol = 1;
Same length tuple, so V3 written using overflow
We update v2.t_ctid to maintain the forward tuple chain

Heap Page P1 Overflow Relation
---------------- -----------------
| | | ______ |
| ______ | |------>| V2 | |
| | V1 |------| | |______|--| |
| |______| | | ______ | |
| | | | V3 |<-| |
| | | |______| |
|--------------| |---------------|


Simple Example of Copying-back
------------------------------

4. UPDATE table SET nonindexcol = 8 WHERE pkcol = 1;
Same length tuple, so V4 will be written using overflow.
While selecting for UPDATE, we look at V1, we see it is now vacuumable,
so we first copy back the first visible tuple over V1 before proceeding.
[Assume V1 is dead, V2 is still visible]

Heap Page P1 Overflow Relation
---------------- -----------------
| | | ______ |
| ______ | | | V2* | |
| | V2 |------| | |______| |
| |______| | | | ______ |
| | |------>| V3 | |
| | | |______| |
|--------------| |---------------|

The copy back operation leaves V2* as the now-unwritable old copy of V2.
We update it also to break the chain to V3. Now orphaned, it can be
removed later.

5. Now we continue with the UPDATE as planned
Same length tuple, so V4 will be written using overflow.
We update v3.t_ctid to maintain the forward tuple chain

Heap Page P1 Overflow Relation
---------------- -----------------
| | | ______ |
| ______ | | | V2* | |
| | V2 |------| | |______| |
| |______| | | | ______ |
| | |------>| V3 | |
| | | |______|--| |
| | | ______ | |
| | | | V4 |<-| |
| | | |______| |
|--------------| |---------------|


Copying Back
------------

The copy back operation is required to keep the tuple chains short and
efficient. When we copy back we leave a tuple version and possibly a
long part of the tuple chain orphaned. These are then marked for removal
by the next VACUUM.

To support copying back, we add an additional header fields onto
every tuple in the overflow relation (only). We generate WAL
appropriately for this action.

VACUUMing becomes significantly easier with HOT than with non-HOT
UPDATEs. Specifically, since there are no indexes on the overflow
relation we should be able to VACUUM it more easily and possibly
piece-by-piece (i.e. "retail vacuum").

Behavioural Characteristics
---------------------------

With HOT, it is easily possible that the chain of prior versions spans
many blocks. The chain always starts with the block of the root tuple
but possibly includes many overflow relation blocks.

A SeqScan of a HOT table will turn into a large number of
random accesses to the overflow relation, which could be considerably
more expensive than sequential I/O. Clearly very large tables would not
be able to be HOT enabled without additional cost, so we make HOT a
table-level option: WITH (freqUPDATE = on) [discuss...]

This means that a SeqScan of a heap with heavy concurrent update
activity *could* be fairly costly. Most times it won't be, because the
appropriate overflow relation blocks will most likely be in-memory. That
is the price we pay for having this optimization - a suitable warning
would apply, though this would not be a surprise to anybody since Oracle
people know that reading heavily-updated data takes more CPU and
DB2/SQLServer know that it takes ages to get around each of the row
level locks. So this isn't bad *at all* in comparison.

We hope/expect/recommend HOT to be used in conjunction with UPDATE
RETURNING, so that fewer SELECTs and non-index operations will be used.

[There's an extra prize if you can think of a *correct* way of
SeqScanning with HOT more efficiently; we've not focused on that yet.]

In the presence of a long running transaction, the overflow relation
could become very large since VACUUM becomes less effective. (This same
situation occurs with current MVCC). The overflow relation will then
start to page out to disk and could grow large. Oracle has this problem
also and this is why Oracle 10g has moved away from fixed-size Rollback
Segments to the use of an overflow Tablespace. Some size controls are
available to limit the growth there. That would be a useful optimization
for HOT also and one much easier than any such enhancement of existing
MVCC.

Next Steps
----------

Deeper technical details will be published shortly; the design goes much
deeper than the discussion here, but we wanted to present the overall
summary first.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com



---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-12-2008, 05:34 AM
Martijn van Oosterhout
 
Posts: n/a
Default Re: Frequent Update Project: Design Overview of HOT Updates

Nice idea, just one question:

On Thu, Nov 09, 2006 at 05:13:16PM +0000, Simon Riggs wrote:
> Behavioural Characteristics
> ---------------------------
>
> With HOT, it is easily possible that the chain of prior versions spans
> many blocks. The chain always starts with the block of the root tuple
> but possibly includes many overflow relation blocks.
>
> A SeqScan of a HOT table will turn into a large number of
> random accesses to the overflow relation, which could be considerably
> more expensive than sequential I/O. Clearly very large tables would not
> be able to be HOT enabled without additional cost, so we make HOT a
> table-level option: WITH (freqUPDATE = on) [discuss...]


It seems to me that bitmap index scans will get these same
characteristics also, right? The bitmap scan will have to follow the
chain of any possibly matching tuple in any of the blocks that are in
the bitmap, right?

Have a ncie day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFU2o5IB7bNG8LQkwRAi7tAJ9djnrnpkgOw/tYJFVLssNArm5GgQCfYqgL
8KEhu7Baq8bDLd6kpeLFED0=
=bKCn
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-12-2008, 05:34 AM
Simon Riggs
 
Posts: n/a
Default Re: Frequent Update Project: Design Overview of HOTUpdates

On Thu, 2006-11-09 at 18:49 +0100, Martijn van Oosterhout wrote:
> Nice idea, just one question:


> It seems to me that bitmap index scans will get these same
> characteristics also, right? The bitmap scan will have to follow the
> chain of any possibly matching tuple in any of the blocks that are in
> the bitmap, right?


Yes, they would identify the root tuples. The whole chain has matching
index values, by definition, so the re-evaluation for lossy bitmaps will
work just the same before the actual tuple is retrieved by walking the
chain (if required).

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com



---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-12-2008, 05:34 AM
Josh Berkus
 
Posts: n/a
Default Re: Frequent Update Project: Design Overview of HOT Updates

Simon,

> If we perform an update that meets the HOT criteria then we put the
> new version into the overflow relation; we describe this as a HOT
> UPDATE. If we perform an update that does not meet the criteria, then we
> carry on with the existing/old MVCC behaviour; we describe this as a
> non-HOT UPDATE.


Making the essential performance analysis question, "Am I HOT or Not?"

Sorry, but someone had to say it. ;-)

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-12-2008, 05:34 AM
Simon Riggs
 
Posts: n/a
Default Re: Frequent Update Project: Design Overview of HOTUpdates

On Thu, 2006-11-09 at 13:21 -0800, Josh Berkus wrote:
> Simon,
>
> > If we perform an update that meets the HOT criteria then we put the
> > new version into the overflow relation; we describe this as a HOT
> > UPDATE. If we perform an update that does not meet the criteria, then we
> > carry on with the existing/old MVCC behaviour; we describe this as a
> > non-HOT UPDATE.

>
> Making the essential performance analysis question, "Am I HOT or Not?"


Very good. ;-)

Well, we had Overflow Update CHaining as an alternative name... :-)

The naming sounds silly, but we had a few alternate designs, so we
needed to be able to tell them apart sensibly. We've had TVR, SITC, UIP
and now HOT. Software research...

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-12-2008, 05:34 AM
Tom Lane
 
Posts: n/a
Default Re: Frequent Update Project: Design Overview of HOT Updates

"Simon Riggs" <simon@2ndquadrant.com> writes:
> As more UPDATEs take place these tuple chains would grow, making
> locating the latest tuple take progressively longer.


This is the part that bothers me --- particularly the random-access
nature of the search. I wonder whether you couldn't do something
involving an initial table fill-factor of less than 50%, and having
the first updated version living on the same heap page as its parent.
Only when the active chain length was more than one (which you
hypothesize is rare) would it actually be necessary to do a random
access into the overflow table.

More generally, do we need an overflow table at all, rather than having
these overflow tuples living in the same file as the root tuples? As
long as there's a bit available to mark a tuple as being this special
not-separately-indexed type, you don't need a special location to know
what it is. This might break down in the presence of seqscans though.

Actually, you omitted to mention the locking aspects of moving tuples
around --- exactly how are you going to make that work without breaking
concurrent scans?

> This allows the length of a typical tuple chain to be extremely short in
> practice. For a single connection issuing a stream of UPDATEs the chain
> length will no more than 1 at any time.


Only if there are no other transactions being held open, which makes
this claim a lot weaker.

> HOT can only work in cases where a tuple does not modify one of the
> columns defined in an index on the table, and when we do not alter the
> row length of the tuple.


Seems like "altering the row length" isn't the issue, it's just "is
there room on the page for the new version". Again, a generous
fillfactor would give you more flexibility.

> [We'll be able to do that more efficiently when
> we have plan invalidation]


Uh, what's that got to do with it?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-12-2008, 05:34 AM
Josh Berkus
 
Posts: n/a
Default Re: Frequent Update Project: Design Overview of HOT Updates

Tom,

> Actually, you omitted to mention the locking aspects of moving tuples
> around --- exactly how are you going to make that work without breaking
> concurrent scans?


I believe that's the "unsolved technical issue" in the prototype, unless
Pavan has solved it in the last two weeks. Pavan?

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-12-2008, 05:34 AM
Pavan Deolasee
 
Posts: n/a
Default Re: Frequent Update Project: Design Overview of HOT Updates

On 11/10/06, Josh Berkus <josh@agliodbs.com> wrote:
>
> Tom,
>
> > Actually, you omitted to mention the locking aspects of moving tuples
> > around --- exactly how are you going to make that work without breaking
> > concurrent scans?

>
> I believe that's the "unsolved technical issue" in the prototype, unless
> Pavan has solved it in the last two weeks. Pavan?
>
>

When an overflow tuple is copied back to the main heap, the overflow tuple
is
marked with a special flag (HEAP_OVERFLOW_MOVEDBACK). Subsequently,
when a backend tries to lock the overflow version of the tuple, it checks
the flag
and jumps to the main heap if the flag is set.

We use the same technique to update the correct version of a tuple when a
tuple is moved back to the main heap.

Regards,
Pavan

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-12-2008, 05:34 AM
Tom Lane
 
Posts: n/a
Default Re: Frequent Update Project: Design Overview of HOT Updates

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> On 11/10/06, Josh Berkus <josh@agliodbs.com> wrote:
>> I believe that's the "unsolved technical issue" in the prototype, unless
>> Pavan has solved it in the last two weeks. Pavan?
>>

> When an overflow tuple is copied back to the main heap, the overflow tuple
> is
> marked with a special flag (HEAP_OVERFLOW_MOVEDBACK). Subsequently,
> when a backend tries to lock the overflow version of the tuple, it checks
> the flag
> and jumps to the main heap if the flag is set.


(1) How does it "jump to the main heap"? The links go the other
direction.

(2) Isn't this full of race conditions?

(3) I thought you already used up the one remaining t_infomask bit.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-12-2008, 05:34 AM
Pavan Deolasee
 
Posts: n/a
Default Re: Frequent Update Project: Design Overview of HOT Updates

On 11/10/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> "Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> > On 11/10/06, Josh Berkus < josh@agliodbs.com> wrote:
> >> I believe that's the "unsolved technical issue" in the prototype,

> unless
> >> Pavan has solved it in the last two weeks. Pavan?
> >>

> > When an overflow tuple is copied back to the main heap, the overflow

> tuple
> > is
> > marked with a special flag (HEAP_OVERFLOW_MOVEDBACK). Subsequently,
> > when a backend tries to lock the overflow version of the tuple, it

> checks
> > the flag
> > and jumps to the main heap if the flag is set.

>
> (1) How does it "jump to the main heap"? The links go the other
> direction.



The overflow tuple has a special header to store the back pointer to the
main heap.
This increases the tuple header size by 6 bytes, but the overhead is
restricted only to the overflow
tuples.

(2) Isn't this full of race conditions?


I agree, there could be race conditions. But IMO we can handle those. When
we
follow the tuple chain, we hold a SHARE lock on the main heap buffer. Also,
when
the root tuple is vacuumable and needs to be overwritten, we acquire and
keep EXCLUSIVE
lock on the main heap buffer.

This reduces the race conditions to a great extent.

(3) I thought you already used up the one remaining t_infomask bit.


Yes. The last bit in the t_infomask is used up to mark presence of overflow
tuple header. But I believe there are few more bits that can be reused.
There are three bits available in the t_ctid field as well (since ip_posid
needs maximum 13 bits). One bit is used to identify whether a given tid
points to the main heap or the overflow heap. This helps when tids are
passed around in the code.

Since the back pointer from the overflow tuple always points to the main
heap, the same bit can be used to mark copied-back tuples (we are doing it
in a slight different way in the current prototype though).

Regards,
Pavan

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:20 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com