This is a discussion on Practical error logging for very large COPY statements within the pgsql Hackers forums, part of the PostgreSQL category; --> If you've ever loaded 100 million rows, you'll know just how annoying it is to find that you have ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| If you've ever loaded 100 million rows, you'll know just how annoying it is to find that you have a duplicate row somewhere in there. Experience shows that there is always one, whatever oath the analyst swears beforehand. It's hard to find out which row is the duplicate, plus you've just screwed up a big table. It needs a VACUUM, then a reload. I'd like to find a way to handle this manual task programmatically. What I'd like to do is add an ERRORTABLE clause to COPY. The main problem is how we detect a duplicate row violation, yet prevent it from aborting the transaction. What I propose is to log uniqueness violations only when there is only a single unique index on a table. Flow of control would be to: locate page of index where value should go lock index block _bt_check_unique, but don't error if violation then insert row into ERRORTABLE else insert row into data block insert row into unique index unlock index block do other indexes Which is very similar code to the recently proposed MERGE logic. With that logic, a COPY will run to completion, yet be able to report the odd couple of unique index violations in found along the way. More importantly we can then handle rows those with another program to locate where those errors came from and resolve them. In most cases with a single unique index, the index inserts are rightmost index entries anyway, so there is scope here for an additional optimisation: keep both index and data blocks locked across multiple row inserts until either the unique index or the data block fills. Thats better than taking a full table lock, since it allows concurrent access to the rest of the table, but its also more efficient than continually re-requesting the same blocks (which looks like about 10-15% saving on performance from hash lookups, lock/unlock, etc). Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > What I'd like to do is add an ERRORTABLE clause to COPY. The main > problem is how we detect a duplicate row violation, yet prevent it from > aborting the transaction. If this only solves the problem of duplicate keys, and not any other kind of COPY error, it's not going to be much of an advance. > Flow of control would be to: > locate page of index where value should go > lock index block > _bt_check_unique, but don't error > if violation then insert row into ERRORTABLE > else > insert row into data block > insert row into unique index > unlock index block > do other indexes Ugh. Do you realize how many levels of modularity violation are implied by that sketch? Have you even thought about the fact that we have more than one kind of index? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Tom Lane wrote: >Simon Riggs <simon@2ndquadrant.com> writes: > > >>What I'd like to do is add an ERRORTABLE clause to COPY. The main >>problem is how we detect a duplicate row violation, yet prevent it from >>aborting the transaction. >> >> > >If this only solves the problem of duplicate keys, and not any other >kind of COPY error, it's not going to be much of an advance. > > Yeah, and I see errors from bad data as often as from violating constraints. Maybe the best way if we do something like this would be to have the error table contain a single text, or maybe bytea, field which contained the raw offending input line. cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Seems similar to the pgloader project on pgfoundry.org. Chris Simon Riggs wrote: > If you've ever loaded 100 million rows, you'll know just how annoying it > is to find that you have a duplicate row somewhere in there. Experience > shows that there is always one, whatever oath the analyst swears > beforehand. > > It's hard to find out which row is the duplicate, plus you've just > screwed up a big table. It needs a VACUUM, then a reload. > > I'd like to find a way to handle this manual task programmatically. > > What I'd like to do is add an ERRORTABLE clause to COPY. The main > problem is how we detect a duplicate row violation, yet prevent it from > aborting the transaction. > > What I propose is to log uniqueness violations only when there is only a > single unique index on a table. > > Flow of control would be to: > > locate page of index where value should go > lock index block > _bt_check_unique, but don't error > if violation then insert row into ERRORTABLE > else > insert row into data block > insert row into unique index > unlock index block > do other indexes > > Which is very similar code to the recently proposed MERGE logic. > > With that logic, a COPY will run to completion, yet be able to report > the odd couple of unique index violations in found along the way. More > importantly we can then handle rows those with another program to locate > where those errors came from and resolve them. > > > > In most cases with a single unique index, the index inserts are > rightmost index entries anyway, so there is scope here for an additional > optimisation: keep both index and data blocks locked across multiple row > inserts until either the unique index or the data block fills. Thats > better than taking a full table lock, since it allows concurrent access > to the rest of the table, but its also more efficient than continually > re-requesting the same blocks (which looks like about 10-15% saving on > performance from hash lookups, lock/unlock, etc). > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Mon, 2005-11-21 at 19:38 -0500, Andrew Dunstan wrote: > > Tom Lane wrote: > > >Simon Riggs <simon@2ndquadrant.com> writes: > > > > > >>What I'd like to do is add an ERRORTABLE clause to COPY. The main > >>problem is how we detect a duplicate row violation, yet prevent it from > >>aborting the transaction. > >> > >If this only solves the problem of duplicate keys, and not any other > >kind of COPY error, it's not going to be much of an advance. > > > Yeah, and I see errors from bad data as often as from violating > constraints. Maybe the best way if we do something like this would be to > have the error table contain a single text, or maybe bytea, field which > contained the raw offending input line. I have committed the sin of omission again. Duplicate row violation is the big challenge, but not the only function planned. Formatting errors occur much more frequently, so yes we'd want to log all of that too. And yes, it would be done in the way you suggest. Here's a fuller, but still brief sketch: COPY ... FROM .... [ERRORTABLES format1 [uniqueness1] [ERRORLIMIT percent]] where Format1, Uniqueness1 would be created from new by this command (or error if they already exist) Format1 would hold formatting errors so would be in a blob table with cols (line number, col number, error number, fullrowstring) Uniqueness1 would be same definition as table, but with no indexes This table would be optional, indicating no uniqueness violation checks would be needed to be carried out. If present and yet no unique indexes exist, then Uniqueness1 would be ignored (and not created). ERRORLIMIT percent would abort the COPY if more than percent errors were found, after the first 1000 records (that limit could also be stated if required). Without the ERRORTABLES clause, COPY would work exactly as it does now. How does that sound? Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Tue, 2005-11-22 at 10:00 +0800, Christopher Kings-Lynne wrote: > Seems similar to the pgloader project on pgfoundry.org. It is similar and good, but I regard that as a workaround rather than the way forward. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Mon, 2005-11-21 at 19:05 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Flow of control would be to: > > > locate page of index where value should go > > lock index block > > _bt_check_unique, but don't error > > if violation then insert row into ERRORTABLE > > else > > insert row into data block > > insert row into unique index > > unlock index block > > do other indexes > > Ugh. Do you realize how many levels of modularity violation are implied > by that sketch? IMHO the above is fairly ugly, but I suggest it now because: 1. I want to avoid uniqueness violations in COPY 2. The logic used is very similar to that recently proposed for MERGE. If anybody has a better idea for (1), shout it out now. If the logic is OK for MERGE, then it should be OK for COPY with uniqeness violation trapping also. Both use uniqueness checking first, so you'd need to argue against both or neither. > Have you even thought about the fact that we have more > than one kind of index? Yes, but they don't support unique indexes do they? Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > I have committed the sin of omission again. > Duplicate row violation is the big challenge, but not the only function > planned. Formatting errors occur much more frequently, so yes we'd want > to log all of that too. And yes, it would be done in the way you > suggest. > Here's a fuller, but still brief sketch: > COPY ... FROM .... > [ERRORTABLES format1 [uniqueness1] > [ERRORLIMIT percent]] This is getting worse, not better :-( The general problem that needs to be solved is "trap any error that occurs during attempted insertion of a COPY row, and instead of aborting the copy, record the data and the error message someplace else". Seen in that light, implementing a special path for uniqueness violations is pretty pointless. You could almost do this today in about five minutes with a PG_TRY construct. The hard part is to distinguish errors that COPY can safely trap from errors that must be allowed to abort the transaction anyway (usually because the backend won't be in a consistent state if it's not allowed to do post-abort cleanup). I think the latter class would mostly be "internal" errors, and so not trapping them shouldn't be a big problem for usefulness; but we can't simply ignore the possibility that they would occur during COPY. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On Tue, Nov 22, 2005 at 09:58:44AM -0500, Tom Lane wrote: > The general problem that needs to be solved is "trap any error that > occurs during attempted insertion of a COPY row, and instead of aborting > the copy, record the data and the error message someplace else". Seen > in that light, implementing a special path for uniqueness violations is > pretty pointless. Actually, there are really only a few errors people want to trap I imagine: - CHECK constraints (all handled in ExecConstraints) - Duplicate keys - Foreign key violations (all handled by triggers) Rather than worry about all the events we can't safely trap, how about we simply deal with the handful that are trappable. For example, we let people create an ON ERROR trigger and use the existing trigger interface. We have three possibilities: - They return the same tuple, throw the error - They return NULL, ignore error, goto next tuple - They return a different tuple, retest the conditions The trigger can then do anything a normal trigger can do, including copying to another table if people like that. This doesn't seem like awfully hard work, does it? Initially at least, no TRY blocks needed... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDgzpbIB7bNG8LQkwRAmR2AJ9tlaJofdUrXU/QB+498U2lBdYt4wCdEwCn n+JDHqKw2wI8oYWNRW+hCcs= =6LXP -----END PGP SIGNATURE----- |
| ||||
| Martijn van Oosterhout <kleptog@svana.org> writes: > On Tue, Nov 22, 2005 at 09:58:44AM -0500, Tom Lane wrote: >> The general problem that needs to be solved is "trap any error that >> occurs during attempted insertion of a COPY row, and instead of aborting >> the copy, record the data and the error message someplace else". > Actually, there are really only a few errors people want to trap I > imagine: You've forgotten bad data, eg "foo" in an integer field, or an untranslatable multibyte character. The bad-data problem is what lets out trigger-based solutions, or indeed anything that presumes that the bad data can be forced into a particular representation. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |