Unix Technical Forum

Finding duplicates without primary key

This is a discussion on Finding duplicates without primary key within the MySQL forums, part of the Database Server Software category; --> Hi, is there a way to find all duplicates in a table without a primary key? Regards, André...


Go Back   Unix Technical Forum > Database Server Software > MySQL

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 02-28-2008, 07:19 AM
André Hänsel
 
Posts: n/a
Default Finding duplicates without primary key

Hi,

is there a way to find all duplicates in a table without a primary key?

Regards,
André


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 02-28-2008, 07:19 AM
Bill Karwin
 
Posts: n/a
Default Re: Finding duplicates without primary key

"André Hänsel" <andre@webkr.de> wrote in message
news:droqad$s8c$1@sagnix.uni-muenster.de...
> is there a way to find all duplicates in a table without a primary key?


SELECT COUNT(col), col
FROM myTable
GROUP BY col
HAVING COUNT(col) > 1

Regards,
Bill K.


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 02-28-2008, 07:19 AM
André Hänsel
 
Posts: n/a
Default Re: Finding duplicates without primary key

Bill Karwin wrote:
> "André Hänsel" <andre@webkr.de> wrote in message
> news:droqad$s8c$1@sagnix.uni-muenster.de...
>> is there a way to find all duplicates in a table without a primary
>> key?

>
> SELECT COUNT(col), col
> FROM myTable
> GROUP BY col
> HAVING COUNT(col) > 1


No, this will give me distinct values of col, not the actual rows that are
double.


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 02-28-2008, 07:19 AM
Bill Karwin
 
Posts: n/a
Default Re: Finding duplicates without primary key

"André Hänsel" <andre@webkr.de> wrote in message
news:drrgpp$eqs$1@sagnix.uni-muenster.de...
> Bill Karwin wrote:
>> "André Hänsel" <andre@webkr.de> wrote in message
>> news:droqad$s8c$1@sagnix.uni-muenster.de...
>>> is there a way to find all duplicates in a table without a primary
>>> key?

>>
>> SELECT COUNT(col), col
>> FROM myTable
>> GROUP BY col
>> HAVING COUNT(col) > 1

>
> No, this will give me distinct values of col, not the actual rows that are
> double.


I see what you mean. How does one distinguish between two identical rows,
if the table has no primary key?

Here's a different method, using self-joins. It works nicely for testing
for duplicates across multiple columns:

CREATE TABLE foo (col INTEGER, col2 INTEGER, col3 INTEGER);

INSERT INTO foo VALUES (1,1,1), (2,2,2), (2,2,2),
(3,3,3), (3,3,3), (3,3,3), (4,4,4), (4,4,4), (4,4,4), (4,4,4);

SELECT COUNT(*), a.*
FROM foo AS a INNER JOIN foo AS b
ON (a.col = b.col AND a.col2 = b.col2 AND a.col3 = b.col3)
GROUP BY a.col, a.col2, a.col3
HAVING COUNT(*) > 1;

The result shows count of 4 for (2,2,2), 9 for (3,3,3), and 16 for (4,4,4).
This is the square of the actual number of copies, because there's no way to
prevent rows from being joined to themselves when there's no primary key.

Now say you want to delete the duplicate rows and leave only one copy of
each?
One solution is to use DELETE with a LIMIT clause. For each row returned by
the above SELECT, run this statement once, and provide the values as
parameters.

DELETE FROM myTable WHERE col = ? AND col2 = ? AND col3 = ? LIMIT 1

So you would execute this once with parameter values 2,2,2, once with values
3,3,3, and once with values 4,4,4.

Then run the SELECT again and see if the duplicates have been eliminated.
In this case, the 2,2,2 result should vanish. The 3,3,3 should find a count
of 4, and the 4,4,4 should find a count of 9. You still have some
duplicates, but they are fewer in number.

If all the duplicates occured in sets of two, then the job should be
complete after one pass. If the duplicates rows have three or more copies,
then you may have to loop through the SELECT & DELETE test several times.
Keep repeating this until the SELECT returns no rows.

Does this help?

Regards,
Bill K.


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 02-28-2008, 07:21 AM
André Hänsel
 
Posts: n/a
Default Re: Finding duplicates without primary key

Bill Karwin wrote:
> "André Hänsel" <andre@webkr.de> wrote in message
> news:drrgpp$eqs$1@sagnix.uni-muenster.de...
>> Bill Karwin wrote:
>>> "André Hänsel" <andre@webkr.de> wrote in message
>>> news:droqad$s8c$1@sagnix.uni-muenster.de...
>>>> is there a way to find all duplicates in a table without a primary
>>>> key?
>>>
>>> SELECT COUNT(col), col
>>> FROM myTable
>>> GROUP BY col
>>> HAVING COUNT(col) > 1

>>
>> No, this will give me distinct values of col, not the actual rows
>> that are double.

>
> I see what you mean. How does one distinguish between two identical
> rows, if the table has no primary key?
>
> Here's a different method, using self-joins. It works nicely for
> testing for duplicates across multiple columns:
>
> CREATE TABLE foo (col INTEGER, col2 INTEGER, col3 INTEGER);
>
> INSERT INTO foo VALUES (1,1,1), (2,2,2), (2,2,2),
> (3,3,3), (3,3,3), (3,3,3), (4,4,4), (4,4,4), (4,4,4), (4,4,4);
>
> SELECT COUNT(*), a.*
> FROM foo AS a INNER JOIN foo AS b
> ON (a.col = b.col AND a.col2 = b.col2 AND a.col3 = b.col3)
> GROUP BY a.col, a.col2, a.col3
> HAVING COUNT(*) > 1;
>
> The result shows count of 4 for (2,2,2), 9 for (3,3,3), and 16 for
> (4,4,4). This is the square of the actual number of copies, because
> there's no way to prevent rows from being joined to themselves when
> there's no primary key.
>
> Now say you want to delete the duplicate rows and leave only one copy
> of each?
> One solution is to use DELETE with a LIMIT clause. For each row
> returned by the above SELECT, run this statement once, and provide
> the values as parameters.
>
> DELETE FROM myTable WHERE col = ? AND col2 = ? AND col3 = ? LIMIT 1
>
> So you would execute this once with parameter values 2,2,2, once with
> values 3,3,3, and once with values 4,4,4.
>
> Then run the SELECT again and see if the duplicates have been
> eliminated. In this case, the 2,2,2 result should vanish. The 3,3,3
> should find a count of 4, and the 4,4,4 should find a count of 9.
> You still have some duplicates, but they are fewer in number.
>
> If all the duplicates occured in sets of two, then the job should be
> complete after one pass. If the duplicates rows have three or more
> copies, then you may have to loop through the SELECT & DELETE test
> several times. Keep repeating this until the SELECT returns no rows.
>
> Does this help?
>


Yes, this helps, thanks.

Althought I probably have to put the grouped result in a temporary table
since MySQL (or no DBE at all?) doesn't support DELETEing and sub-QUERYing
from the same table.

Regards,
André


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 02-28-2008, 07:21 AM
Bill Karwin
 
Posts: n/a
Default Re: Finding duplicates without primary key

"André Hänsel" <andre@webkr.de> wrote in message
news:dsjir5$pur$1@sagnix.uni-muenster.de...
> Althought I probably have to put the grouped result in a temporary table
> since MySQL (or no DBE at all?) doesn't support DELETEing and sub-QUERYing
> from the same table.


I would just fetch the rows from the SELECT in a loop in some application
code, and issue the DELETE operations based on these values.

SQL was designed to be used in cooperation with an application language.

Regards,
Bill K.


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 09:06 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com