Unix Technical Forum

Practical error logging for very large COPY statements

This is a discussion on Practical error logging for very large COPY statements within the pgsql Hackers forums, part of the PostgreSQL category; --> If you've ever loaded 100 million rows, you'll know just how annoying it is to find that you have ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 06:55 AM
Simon Riggs
 
Posts: n/a
Default Practical error logging for very large COPY statements

If you've ever loaded 100 million rows, you'll know just how annoying it
is to find that you have a duplicate row somewhere in there. Experience
shows that there is always one, whatever oath the analyst swears
beforehand.

It's hard to find out which row is the duplicate, plus you've just
screwed up a big table. It needs a VACUUM, then a reload.

I'd like to find a way to handle this manual task programmatically.

What I'd like to do is add an ERRORTABLE clause to COPY. The main
problem is how we detect a duplicate row violation, yet prevent it from
aborting the transaction.

What I propose is to log uniqueness violations only when there is only a
single unique index on a table.

Flow of control would be to:

locate page of index where value should go
lock index block
_bt_check_unique, but don't error
if violation then insert row into ERRORTABLE
else
insert row into data block
insert row into unique index
unlock index block
do other indexes

Which is very similar code to the recently proposed MERGE logic.

With that logic, a COPY will run to completion, yet be able to report
the odd couple of unique index violations in found along the way. More
importantly we can then handle rows those with another program to locate
where those errors came from and resolve them.



In most cases with a single unique index, the index inserts are
rightmost index entries anyway, so there is scope here for an additional
optimisation: keep both index and data blocks locked across multiple row
inserts until either the unique index or the data block fills. Thats
better than taking a full table lock, since it allows concurrent access
to the rest of the table, but its also more efficient than continually
re-requesting the same blocks (which looks like about 10-15% saving on
performance from hash lookups, lock/unlock, etc).

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 06:55 AM
Tom Lane
 
Posts: n/a
Default Re: Practical error logging for very large COPY statements

Simon Riggs <simon@2ndquadrant.com> writes:
> What I'd like to do is add an ERRORTABLE clause to COPY. The main
> problem is how we detect a duplicate row violation, yet prevent it from
> aborting the transaction.


If this only solves the problem of duplicate keys, and not any other
kind of COPY error, it's not going to be much of an advance.

> Flow of control would be to:


> locate page of index where value should go
> lock index block
> _bt_check_unique, but don't error
> if violation then insert row into ERRORTABLE
> else
> insert row into data block
> insert row into unique index
> unlock index block
> do other indexes


Ugh. Do you realize how many levels of modularity violation are implied
by that sketch? Have you even thought about the fact that we have more
than one kind of index?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-11-2008, 06:55 AM
Andrew Dunstan
 
Posts: n/a
Default Re: Practical error logging for very large COPY statements



Tom Lane wrote:

>Simon Riggs <simon@2ndquadrant.com> writes:
>
>
>>What I'd like to do is add an ERRORTABLE clause to COPY. The main
>>problem is how we detect a duplicate row violation, yet prevent it from
>>aborting the transaction.
>>
>>

>
>If this only solves the problem of duplicate keys, and not any other
>kind of COPY error, it's not going to be much of an advance.
>
>


Yeah, and I see errors from bad data as often as from violating
constraints. Maybe the best way if we do something like this would be to
have the error table contain a single text, or maybe bytea, field which
contained the raw offending input line.

cheers

andrew


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-11-2008, 06:55 AM
Christopher Kings-Lynne
 
Posts: n/a
Default Re: Practical error logging for very large COPY statements

Seems similar to the pgloader project on pgfoundry.org.

Chris

Simon Riggs wrote:
> If you've ever loaded 100 million rows, you'll know just how annoying it
> is to find that you have a duplicate row somewhere in there. Experience
> shows that there is always one, whatever oath the analyst swears
> beforehand.
>
> It's hard to find out which row is the duplicate, plus you've just
> screwed up a big table. It needs a VACUUM, then a reload.
>
> I'd like to find a way to handle this manual task programmatically.
>
> What I'd like to do is add an ERRORTABLE clause to COPY. The main
> problem is how we detect a duplicate row violation, yet prevent it from
> aborting the transaction.
>
> What I propose is to log uniqueness violations only when there is only a
> single unique index on a table.
>
> Flow of control would be to:
>
> locate page of index where value should go
> lock index block
> _bt_check_unique, but don't error
> if violation then insert row into ERRORTABLE
> else
> insert row into data block
> insert row into unique index
> unlock index block
> do other indexes
>
> Which is very similar code to the recently proposed MERGE logic.
>
> With that logic, a COPY will run to completion, yet be able to report
> the odd couple of unique index violations in found along the way. More
> importantly we can then handle rows those with another program to locate
> where those errors came from and resolve them.
>
>
>
> In most cases with a single unique index, the index inserts are
> rightmost index entries anyway, so there is scope here for an additional
> optimisation: keep both index and data blocks locked across multiple row
> inserts until either the unique index or the data block fills. Thats
> better than taking a full table lock, since it allows concurrent access
> to the rest of the table, but its also more efficient than continually
> re-requesting the same blocks (which looks like about 10-15% saving on
> performance from hash lookups, lock/unlock, etc).
>
> Best Regards, Simon Riggs
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-11-2008, 06:56 AM
Simon Riggs
 
Posts: n/a
Default Re: Practical error logging for very large COPY

On Mon, 2005-11-21 at 19:38 -0500, Andrew Dunstan wrote:
>
> Tom Lane wrote:
>
> >Simon Riggs <simon@2ndquadrant.com> writes:
> >
> >
> >>What I'd like to do is add an ERRORTABLE clause to COPY. The main
> >>problem is how we detect a duplicate row violation, yet prevent it from
> >>aborting the transaction.
> >>

> >If this only solves the problem of duplicate keys, and not any other
> >kind of COPY error, it's not going to be much of an advance.
> >


> Yeah, and I see errors from bad data as often as from violating
> constraints. Maybe the best way if we do something like this would be to
> have the error table contain a single text, or maybe bytea, field which
> contained the raw offending input line.


I have committed the sin of omission again.

Duplicate row violation is the big challenge, but not the only function
planned. Formatting errors occur much more frequently, so yes we'd want
to log all of that too. And yes, it would be done in the way you
suggest.

Here's a fuller, but still brief sketch:

COPY ... FROM ....
[ERRORTABLES format1 [uniqueness1]
[ERRORLIMIT percent]]

where Format1, Uniqueness1 would be created from new by this command (or
error if they already exist)

Format1 would hold formatting errors so would be in a blob table with
cols (line number, col number, error number, fullrowstring)

Uniqueness1 would be same definition as table, but with no indexes
This table would be optional, indicating no uniqueness violation checks
would be needed to be carried out. If present and yet no unique indexes
exist, then Uniqueness1 would be ignored (and not created).

ERRORLIMIT percent would abort the COPY if more than percent errors were
found, after the first 1000 records (that limit could also be stated if
required).

Without the ERRORTABLES clause, COPY would work exactly as it does now.

How does that sound?

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-11-2008, 06:56 AM
Simon Riggs
 
Posts: n/a
Default Re: Practical error logging for very large COPY

On Tue, 2005-11-22 at 10:00 +0800, Christopher Kings-Lynne wrote:
> Seems similar to the pgloader project on pgfoundry.org.


It is similar and good, but I regard that as a workaround rather than
the way forward.

Best Regards, Simon Riggs



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-11-2008, 06:56 AM
Simon Riggs
 
Posts: n/a
Default Re: Practical error logging for very large COPY

On Mon, 2005-11-21 at 19:05 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Flow of control would be to:

>
> > locate page of index where value should go
> > lock index block
> > _bt_check_unique, but don't error
> > if violation then insert row into ERRORTABLE
> > else
> > insert row into data block
> > insert row into unique index
> > unlock index block
> > do other indexes

>
> Ugh. Do you realize how many levels of modularity violation are implied
> by that sketch?


IMHO the above is fairly ugly, but I suggest it now because:
1. I want to avoid uniqueness violations in COPY
2. The logic used is very similar to that recently proposed for MERGE.

If anybody has a better idea for (1), shout it out now.

If the logic is OK for MERGE, then it should be OK for COPY with
uniqeness violation trapping also. Both use uniqueness checking first,
so you'd need to argue against both or neither.

> Have you even thought about the fact that we have more
> than one kind of index?


Yes, but they don't support unique indexes do they?

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-11-2008, 06:56 AM
Tom Lane
 
Posts: n/a
Default Re: Practical error logging for very large COPY

Simon Riggs <simon@2ndquadrant.com> writes:
> I have committed the sin of omission again.


> Duplicate row violation is the big challenge, but not the only function
> planned. Formatting errors occur much more frequently, so yes we'd want
> to log all of that too. And yes, it would be done in the way you
> suggest.


> Here's a fuller, but still brief sketch:


> COPY ... FROM ....
> [ERRORTABLES format1 [uniqueness1]
> [ERRORLIMIT percent]]



This is getting worse, not better :-(

The general problem that needs to be solved is "trap any error that
occurs during attempted insertion of a COPY row, and instead of aborting
the copy, record the data and the error message someplace else". Seen
in that light, implementing a special path for uniqueness violations is
pretty pointless.

You could almost do this today in about five minutes with a PG_TRY
construct. The hard part is to distinguish errors that COPY can safely
trap from errors that must be allowed to abort the transaction anyway
(usually because the backend won't be in a consistent state if it's not
allowed to do post-abort cleanup). I think the latter class would
mostly be "internal" errors, and so not trapping them shouldn't be a big
problem for usefulness; but we can't simply ignore the possibility that
they would occur during COPY.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-11-2008, 06:56 AM
Martijn van Oosterhout
 
Posts: n/a
Default Re: Practical error logging for very large COPY

On Tue, Nov 22, 2005 at 09:58:44AM -0500, Tom Lane wrote:
> The general problem that needs to be solved is "trap any error that
> occurs during attempted insertion of a COPY row, and instead of aborting
> the copy, record the data and the error message someplace else". Seen
> in that light, implementing a special path for uniqueness violations is
> pretty pointless.


Actually, there are really only a few errors people want to trap I
imagine:

- CHECK constraints (all handled in ExecConstraints)
- Duplicate keys
- Foreign key violations (all handled by triggers)

Rather than worry about all the events we can't safely trap, how about
we simply deal with the handful that are trappable. For example, we let
people create an ON ERROR trigger and use the existing trigger
interface. We have three possibilities:

- They return the same tuple, throw the error
- They return NULL, ignore error, goto next tuple
- They return a different tuple, retest the conditions

The trigger can then do anything a normal trigger can do, including
copying to another table if people like that.

This doesn't seem like awfully hard work, does it? Initially at least,
no TRY blocks needed...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFDgzpbIB7bNG8LQkwRAmR2AJ9tlaJofdUrXU/QB+498U2lBdYt4wCdEwCn
n+JDHqKw2wI8oYWNRW+hCcs=
=6LXP
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-11-2008, 06:56 AM
Tom Lane
 
Posts: n/a
Default Re: Practical error logging for very large COPY

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Tue, Nov 22, 2005 at 09:58:44AM -0500, Tom Lane wrote:
>> The general problem that needs to be solved is "trap any error that
>> occurs during attempted insertion of a COPY row, and instead of aborting
>> the copy, record the data and the error message someplace else".


> Actually, there are really only a few errors people want to trap I
> imagine:


You've forgotten bad data, eg "foo" in an integer field, or an
untranslatable multibyte character. The bad-data problem is what lets
out trigger-based solutions, or indeed anything that presumes that the
bad data can be forced into a particular representation.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 02:17 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com