Unix Technical Forum

Join question

This is a discussion on Join question within the pgsql Sql forums, part of the PostgreSQL category; --> Hey All, I have a query I'm trying to speed up, and I was hoping someone could help me. ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Sql

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 05:49 PM
tyrrill_ed@emc.com
 
Posts: n/a
Default Join question

Hey All,

I have a query I'm trying to speed up, and I was hoping someone could
help me. I have a three tables a and b hold data, and c just references
between a and b:

create table a (
a_id int,
x int
);

create table b (
b_id int,
x int
);

create table c (
a_id int,
b_id int
);

I am doing a query like this:

SELECT a.x, max(b.x) FROM a, b, c WHERE a.a_id = c.a_id AND b.b_id =
c.b_id GROUP by a.x;

I only need to get one row from b for each row in a, and it really
doesn't matter which one. I use max() to get a single value from table
b. There are generally be dozens to hundreds of rows in b for each row
in a. The problem is when I have a query with tens of thousands of rows
in a that the join with b will have millions of rows, and is really
slow. The group by effectively reduces the results down to what I want,
but it still has to process the millions of rows. Does anyone know a
way I could restructure this query to get only one b for each a in a
faster way?

Thanks,
Ed Tyrrill

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 05:49 PM
Richard Broersma Jr
 
Posts: n/a
Default Re: Join question

--- tyrrill_ed@emc.com wrote:

> create table c (
> a_id int,
> b_id int
> );
>
> I am doing a query like this:
>
> SELECT a.x, max(b.x) FROM a, b, c WHERE a.a_id = c.a_id AND b.b_id =
> c.b_id GROUP by a.x;
>
> I only need to get one row from b for each row in a, and it really
> doesn't matter which one. I use max() to get a single value from table
> b. There are generally be dozens to hundreds of rows in b for each row
> in a. The problem is when I have a query with tens of thousands of rows
> in a that the join with b will have millions of rows, and is really
> slow. The group by effectively reduces the results down to what I want,
> but it still has to process the millions of rows. Does anyone know a
> way I could restructure this query to get only one b for each a in a
> faster way?


CREATE INDEX table_c_foreign_key ON c ( a, b );

SELECT a.x, b.x
FROM ( SELECT DISTINCT( a_id ) a_id, b_id
FROM c ) AS c( a_id, b_id )
INNER JOIN a
ON c.a_id = a.id
INNER JOIN b
ON c.b_id = b.id;

Regards,
Richard Broersma Jr.




---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 05:49 PM
Michael Glaesemann
 
Posts: n/a
Default Re: Join question


On Aug 21, 2007, at 12:48 , tyrrill_ed@emc.com wrote:

> SELECT a.x, max(b.x) FROM a, b, c WHERE a.a_id = c.a_id AND b.b_id =
> c.b_id GROUP by a.x;


Shouldn't affect performance, but another way to write this which you
may find more readable is to list your join conditions with the joins
rather than grouping them all in the WHERE clause:

SELECT a.x, max(b.x)
FROM a
JOIN c USING (a_id)
JOIN b USING (b_id)
GROUP BY a.x;

You can also write this using NATURAL JOIN which joins on like-named
columns.

SELECT a.x, max(b.x)
FROM a
NATURAL JOIN c
JOIN b USING (b_id)
GROUP BY a.x;

You can't use NATURAL JOIN to join b because you've got columns named
x in both a and b that you're *not* joining on.

Another nice thing about the USING and NATURAL JOIN syntax is that
the result includes only one column for the joined columns, rather
than two. In this case, there would be only one b_id and one a_id
column in the result set. Using ON or putting the join condition in
the WHERE clause puts two b_id and two a_id columns in the result set.

> Does anyone know a
> way I could restructure this query to get only one b for each a in a
> faster way?


You might want to ask on the pgsql-performance list as well, as
people there are generally interested in improving query performance.
One thing they'll ask you for is the output of EXPLAIN ANALYZE for
your query.

Here's an alternative, but I don't know how it'd compare in terms of
performance:

EXPLAIN ANALYZE
SELECT DISTINCT ON (a.x)
a.x, b.x
FROM a
NATURAL JOIN c
JOIN b USING (b_id);

Give that a shot. (DISTINCT ON is a non-standard PostgreSQL
extension, if non-portable syntax is something you're looking to avoid.)

EXPLAIN ANALYZE shows you how the planner decided to proceed with the
query, which can be useful when comparing alternatives, so you can
compare using DISTINCT ON with your own query using max.

Hope this helps.

Michael Glaesemann
grzm seespotcode net



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:11 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com