vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I have a DELETE statement that deletes duplicate data from a table. It takes a long time to execute, so I thought I'd seek advice here. The structure of the table is little funny. The following is NOT the table, but the representation of the data in the table: +-----------+ | a | b | +-----+-----+ | 123 | 234 | | 345 | 456 | | 123 | 123 | +-----+-----+ As you can see, the data is tabular. This is how it is stored in the table: +-----+-----------+------------+ | Row | FieldName | FieldValue | +-----+-----------+------------+ | 1 | a | 123 | | 1 | b | 234 | | 2 | a | 345 | | 2 | b | 456 | | 3 | a | 123 | | 3 | b | 234 | +-----+-----------+------------+ What I need is to delete all records having the same "Row" when there exists the same set of records with a different (smaller, to be precise) "Row". Using the example above, what I need to get is: +-----+-----------+------------+ | Row | FieldName | FieldValue | +-----+-----------+------------+ | 1 | a | 123 | | 1 | b | 234 | | 2 | a | 345 | | 2 | b | 456 | +-----+-----------+------------+ A slow way of doing this seem to be: DELETE FROM X WHERE Row IN (SELECT DISTINCT Row FROM X x1 WHERE EXISTS (SELECT * FROM X x2 WHERE x2.Row < x1.Row AND NOT EXISTS (SELECT * FROM X x3 WHERE x3.Row = x2.Row AND x3.FieldName = x2.FieldName AND x3.FieldValue <> x1.FieldValue))) Can this be done faster, better, and cheaper? |
| |||
| my knee-jerk reaction is: Why is it important to optimize it? I think you should delete the duplicates, then create a constraint that prevents them from recurring. If, for some reason, you are unable to fix the application that creates these duplicates, and creating a constraint causes errors in the application that you can't tolerate, then I suppose an alternative would be to create a trigger that deletes them upon entry. Having a composite index on the columns that are being duplicated would enable such a trigger to run quickly. But looking at your query, I find it strangely complex. Why not just: DELETE FROM X WHERE EXISTS (SELECT * FROM X x2 WHERE x2.Row < x.Row AND X.FieldName = x2.FieldName AND X.FieldValue = x2.FieldValue) Am I missing something? Your NOT EXISTS has me a bit confused... I think it might delete data in situations other than described. Also, NOT EXISTS is generally slow. |
| |||
| On 2004-07-15, Aaron W. West <tallpeak@hotmail.NO.SPAM> wrote: > Why is it important to optimize it? I think you should delete the > duplicates, then create a constraint that prevents them from recurring. Such constraint may not be created. This table is a temporary table, where data from an input file is loaded. Duplicate sets of records must be deleted because the data then goes into permanent tables. Those table have constraints against duplicates. > But looking at your query, I find it strangely complex. Me too. I'm trying to improve it. Its complexity seems to hinder its performance. > Why not just: > > DELETE FROM X > WHERE EXISTS (SELECT * FROM X x2 > WHERE x2.Row < x.Row > AND X.FieldName = x2.FieldName > AND X.FieldValue = x2.FieldValue) This would delete records that should not be deleted. Here's an example: +-----+-----------+------------+ | Row | FieldName | FieldValue | +-----+-----------+------------+ | 1 | a | 123 | | 1 | b | 234 | | 2 | a | 345 | | 2 | b | 456 | | 3 | a | 123 | | 3 | b | 666 | +-----+-----------+------------+ Here the combination of values for "a" and "b" on every "Row" is different. There are no duplicates here. The query that you proposed would delete the second to last row +-----+-----------+------------+ | 3 | a | 123 | +-----+-----------+------------+ because it has the same FieldName and FieldValue as the first row. Think of it the data this way: +-----+-----+ | a | b | +-----+-----+ | 123 | 234 | | 345 | 456 | | 123 | 666 | +-----+-----+ No duplicate rows here. |
| ||||
| Hi You could try only selecting the correct data when you move it into the permanent tables. But the following may work better: DELETE FROM X1 FROM X X1 JOIN X X2 ON x2.Row < x1.Row AND x1.Fieldvalue = x2.Fieldvalue AND x1.FieldName = x2.FieldName John "Alexander Anderson" <no@spam.com> wrote in message news:slrncfe0ft.mk1.alex@Toronto-HSE-ppp3682122.sympatico.ca... > I have a DELETE statement that deletes duplicate data from a table. It > takes a long time to execute, so I thought I'd seek advice here. The > structure of the table is little funny. The following is NOT the table, > but the representation of the data in the table: > > +-----------+ > | a | b | > +-----+-----+ > | 123 | 234 | > | 345 | 456 | > | 123 | 123 | > +-----+-----+ > > As you can see, the data is tabular. This is how it is stored in the table: > > +-----+-----------+------------+ > | Row | FieldName | FieldValue | > +-----+-----------+------------+ > | 1 | a | 123 | > | 1 | b | 234 | > | 2 | a | 345 | > | 2 | b | 456 | > | 3 | a | 123 | > | 3 | b | 234 | > +-----+-----------+------------+ > > What I need is to delete all records having the same "Row" when there exists > the same set of records with a different (smaller, to be precise) "Row". > Using the example above, what I need to get is: > > +-----+-----------+------------+ > | Row | FieldName | FieldValue | > +-----+-----------+------------+ > | 1 | a | 123 | > | 1 | b | 234 | > | 2 | a | 345 | > | 2 | b | 456 | > +-----+-----------+------------+ > > A slow way of doing this seem to be: > > DELETE FROM X > WHERE Row IN > (SELECT DISTINCT Row FROM X x1 > WHERE EXISTS > (SELECT * FROM X x2 > WHERE x2.Row < x1.Row > AND NOT EXISTS > (SELECT * FROM X x3 > WHERE x3.Row = x2.Row > AND x3.FieldName = x2.FieldName > AND x3.FieldValue <> x1.FieldValue))) > > Can this be done faster, better, and cheaper? |