This is a discussion on Finding duplicates without primary key within the MySQL forums, part of the Database Server Software category; --> Hi, is there a way to find all duplicates in a table without a primary key? Regards, André...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| |||
| "André Hänsel" <andre@webkr.de> wrote in message news:droqad$s8c$1@sagnix.uni-muenster.de... > is there a way to find all duplicates in a table without a primary key? SELECT COUNT(col), col FROM myTable GROUP BY col HAVING COUNT(col) > 1 Regards, Bill K. |
| |||
| Bill Karwin wrote: > "André Hänsel" <andre@webkr.de> wrote in message > news:droqad$s8c$1@sagnix.uni-muenster.de... >> is there a way to find all duplicates in a table without a primary >> key? > > SELECT COUNT(col), col > FROM myTable > GROUP BY col > HAVING COUNT(col) > 1 No, this will give me distinct values of col, not the actual rows that are double. |
| |||
| "André Hänsel" <andre@webkr.de> wrote in message news:drrgpp$eqs$1@sagnix.uni-muenster.de... > Bill Karwin wrote: >> "André Hänsel" <andre@webkr.de> wrote in message >> news:droqad$s8c$1@sagnix.uni-muenster.de... >>> is there a way to find all duplicates in a table without a primary >>> key? >> >> SELECT COUNT(col), col >> FROM myTable >> GROUP BY col >> HAVING COUNT(col) > 1 > > No, this will give me distinct values of col, not the actual rows that are > double. I see what you mean. How does one distinguish between two identical rows, if the table has no primary key? Here's a different method, using self-joins. It works nicely for testing for duplicates across multiple columns: CREATE TABLE foo (col INTEGER, col2 INTEGER, col3 INTEGER); INSERT INTO foo VALUES (1,1,1), (2,2,2), (2,2,2), (3,3,3), (3,3,3), (3,3,3), (4,4,4), (4,4,4), (4,4,4), (4,4,4); SELECT COUNT(*), a.* FROM foo AS a INNER JOIN foo AS b ON (a.col = b.col AND a.col2 = b.col2 AND a.col3 = b.col3) GROUP BY a.col, a.col2, a.col3 HAVING COUNT(*) > 1; The result shows count of 4 for (2,2,2), 9 for (3,3,3), and 16 for (4,4,4). This is the square of the actual number of copies, because there's no way to prevent rows from being joined to themselves when there's no primary key. Now say you want to delete the duplicate rows and leave only one copy of each? One solution is to use DELETE with a LIMIT clause. For each row returned by the above SELECT, run this statement once, and provide the values as parameters. DELETE FROM myTable WHERE col = ? AND col2 = ? AND col3 = ? LIMIT 1 So you would execute this once with parameter values 2,2,2, once with values 3,3,3, and once with values 4,4,4. Then run the SELECT again and see if the duplicates have been eliminated. In this case, the 2,2,2 result should vanish. The 3,3,3 should find a count of 4, and the 4,4,4 should find a count of 9. You still have some duplicates, but they are fewer in number. If all the duplicates occured in sets of two, then the job should be complete after one pass. If the duplicates rows have three or more copies, then you may have to loop through the SELECT & DELETE test several times. Keep repeating this until the SELECT returns no rows. Does this help? Regards, Bill K. |
| |||
| Bill Karwin wrote: > "André Hänsel" <andre@webkr.de> wrote in message > news:drrgpp$eqs$1@sagnix.uni-muenster.de... >> Bill Karwin wrote: >>> "André Hänsel" <andre@webkr.de> wrote in message >>> news:droqad$s8c$1@sagnix.uni-muenster.de... >>>> is there a way to find all duplicates in a table without a primary >>>> key? >>> >>> SELECT COUNT(col), col >>> FROM myTable >>> GROUP BY col >>> HAVING COUNT(col) > 1 >> >> No, this will give me distinct values of col, not the actual rows >> that are double. > > I see what you mean. How does one distinguish between two identical > rows, if the table has no primary key? > > Here's a different method, using self-joins. It works nicely for > testing for duplicates across multiple columns: > > CREATE TABLE foo (col INTEGER, col2 INTEGER, col3 INTEGER); > > INSERT INTO foo VALUES (1,1,1), (2,2,2), (2,2,2), > (3,3,3), (3,3,3), (3,3,3), (4,4,4), (4,4,4), (4,4,4), (4,4,4); > > SELECT COUNT(*), a.* > FROM foo AS a INNER JOIN foo AS b > ON (a.col = b.col AND a.col2 = b.col2 AND a.col3 = b.col3) > GROUP BY a.col, a.col2, a.col3 > HAVING COUNT(*) > 1; > > The result shows count of 4 for (2,2,2), 9 for (3,3,3), and 16 for > (4,4,4). This is the square of the actual number of copies, because > there's no way to prevent rows from being joined to themselves when > there's no primary key. > > Now say you want to delete the duplicate rows and leave only one copy > of each? > One solution is to use DELETE with a LIMIT clause. For each row > returned by the above SELECT, run this statement once, and provide > the values as parameters. > > DELETE FROM myTable WHERE col = ? AND col2 = ? AND col3 = ? LIMIT 1 > > So you would execute this once with parameter values 2,2,2, once with > values 3,3,3, and once with values 4,4,4. > > Then run the SELECT again and see if the duplicates have been > eliminated. In this case, the 2,2,2 result should vanish. The 3,3,3 > should find a count of 4, and the 4,4,4 should find a count of 9. > You still have some duplicates, but they are fewer in number. > > If all the duplicates occured in sets of two, then the job should be > complete after one pass. If the duplicates rows have three or more > copies, then you may have to loop through the SELECT & DELETE test > several times. Keep repeating this until the SELECT returns no rows. > > Does this help? > Yes, this helps, thanks. Althought I probably have to put the grouped result in a temporary table since MySQL (or no DBE at all?) doesn't support DELETEing and sub-QUERYing from the same table. Regards, André |
| ||||
| "André Hänsel" <andre@webkr.de> wrote in message news:dsjir5$pur$1@sagnix.uni-muenster.de... > Althought I probably have to put the grouped result in a temporary table > since MySQL (or no DBE at all?) doesn't support DELETEing and sub-QUERYing > from the same table. I would just fetch the rows from the SELECT in a loop in some application code, and issue the DELETE operations based on these values. SQL was designed to be used in cooperation with an application language. Regards, Bill K. |