Unix Technical Forum

Finding orphan records

This is a discussion on Finding orphan records within the Pgsql General forums, part of the PostgreSQL category; --> I'm trying to find/delete all records in table A that are no longer referenced by tables B or C. ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql General

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-09-2008, 07:49 AM
Wes
 
Posts: n/a
Default Finding orphan records

I'm trying to find/delete all records in table A that are no longer
referenced by tables B or C. There are about 4 million records in table A,
and several hundred million in tables B and C.

Is there something more efficient than:

select address_key, address from addresses where ( not exists(select 1 from
B where BField=addresses.address_key limit 1) ) and ( not exists(select 1
from C where CField=addresses.address_key limit 1) )

Of course, all fields above are indexed.

There are foreign key references in B and C to A. Is there some way to
safely leverage that?

Wes



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-09-2008, 07:49 AM
Jonel Rienton
 
Posts: n/a
Default Re: Finding orphan records

I would use an outer join and check only those null-value records in the
right table with id's referencing table A

Sample query:

select a.*,b.* from a
left outer join b on a.id = b.a_id -- assuming a_id is my referencing
column to a
where b.id is null;

This will yield all columns in table a which has a null value on table b

This is just from the top of my head, just a concept, I might have some
syntax error.

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailtogsql-general-owner@postgresql.org] On Behalf Of Wes
Sent: Wednesday, January 11, 2006 11:51 PM
To: Postgresql-General
Subject: [GENERAL] Finding orphan records

I'm trying to find/delete all records in table A that are no longer
referenced by tables B or C. There are about 4 million records in table A,
and several hundred million in tables B and C.

Is there something more efficient than:

select address_key, address from addresses where ( not exists(select 1 from
B where BField=addresses.address_key limit 1) ) and ( not exists(select 1
from C where CField=addresses.address_key limit 1) )

Of course, all fields above are indexed.

There are foreign key references in B and C to A. Is there some way to
safely leverage that?

Wes



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.16/225 - Release Date: 1/9/2006



---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-09-2008, 07:49 AM
Jonel Rienton
 
Posts: n/a
Default Re: Finding orphan records

Resending sample query, darn where clause didn't wrap

select a.*,b.* from a
left outer join b on a.id = b.a_id
where b.id is null;

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailtogsql-general-owner@postgresql.org] On Behalf Of Wes
Sent: Wednesday, January 11, 2006 11:51 PM
To: Postgresql-General
Subject: [GENERAL] Finding orphan records

I'm trying to find/delete all records in table A that are no longer
referenced by tables B or C. There are about 4 million records in table A,
and several hundred million in tables B and C.

Is there something more efficient than:

select address_key, address from addresses where ( not exists(select 1 from
B where BField=addresses.address_key limit 1) ) and ( not exists(select 1
from C where CField=addresses.address_key limit 1) )

Of course, all fields above are indexed.

There are foreign key references in B and C to A. Is there some way to
safely leverage that?

Wes



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.16/225 - Release Date: 1/9/2006



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-09-2008, 07:49 AM
Wes
 
Posts: n/a
Default Re: Finding orphan records

On 1/12/06 12:23 AM, "Jonel Rienton" <jonel@rientongroup.com> wrote:

> Resending sample query, darn where clause didn't wrap
>
> select a.*,b.* from a
> left outer join b on a.id = b.a_id
> where b.id is null;


I tried something along those lines a while back, and it was orders of
magnitude slower. The above produces:

explain select address_key, address from addresses left outer join messages
on addresses.address_key=originator where originator is null;
QUERY PLAN
----------------------------------------------------------------------------
---------------------
Merge Left Join (cost=35684870.14..38457295.97 rows=4090203 width=40)
Merge Cond: ("outer".address_key = "inner".originator)
Filter: ("inner".originator IS NULL)
-> Index Scan using addresses_pkey on addresses (cost=0.00..97213.17
rows=4090203 width=40)
-> Sort (cost=35684870.14..36129462.74 rows=177837040 width=11)
Sort Key: messages.originator
-> Seq Scan on messages (cost=0.00..7215718.40 rows=177837040
width=11)


This appears to be very inefficient. B is almost two orders of magnitude
larger than A. C is about 3-4 times as big as B (record count). My
statement (with the same single 'B' table as above) produces:

narc=> explain select address_key, address from addresses where ( not
exists(select 1 from messages where originator=addresses.address_key limit
1) );
QUERY PLAN
----------------------------------------------------------------------------
--------------------------------
Seq Scan on addresses (cost=0.00..3398462.98 rows=2045102 width=40)
Filter: (NOT (subplan))
SubPlan
-> Limit (cost=0.00..0.81 rows=1 width=0)
-> Index Scan using messages_i_orig_mdate on messages
(cost=0.00..35148.46 rows=43301 width=0)
Index Cond: (originator = $0)


Which seems like it should be much more efficient.

Wes



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-09-2008, 07:50 AM
Greg Stark
 
Posts: n/a
Default Re: Finding orphan records


Wes <wespvp@syntegra.com> writes:

> This appears to be very inefficient. B is almost two orders of magnitude
> larger than A. C is about 3-4 times as big as B (record count). My
> statement (with the same single 'B' table as above) produces:


If it's only a factor of 3-4 then the merge join should be faster. If it's
really two orders of magnitude (100x?) then the nested loop below would be
faster. I think in 8.1 (and I think in 8.0 too) the planner is capable of
coming up with both plans for the NOT IN query though.

--
greg


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-09-2008, 07:50 AM
Wes
 
Posts: n/a
Default Re: Finding orphan records

On 1/12/06 10:37 AM, "Greg Stark" <gsstark@mit.edu> wrote:

> If it's only a factor of 3-4 then the merge join should be faster. If it's
> really two orders of magnitude (100x?) then the nested loop below would be
> faster. I think in 8.1 (and I think in 8.0 too) the planner is capable of
> coming up with both plans for the NOT IN query though.


Yep, it's really two orders of magnitude. A is about 4 million records. B
is around 200 million, and C is about 800 million records. That's why I
figured a sequential pass on A with index lookups on B and C would be
fastest. It took about 3.25 hours to complete the SELECT.

I'm working on the 8.1 upgrade now. My biggest concern is making sure no
bugs have crept in to ecpg that would affect us (bitten once before, so
being careful). So far everything else is looking good with 8.1.

I'm pondering dumping the keys for A to a file, and B and C to another file,
sorting with uniqueness on the B/C file, then programmatically determining
which keys have been freed. I'm pretty sure this will be much faster, but I
had hoped to avoid an external process like this.

Wes



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:27 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com