Unix Technical Forum

A counter argument about DISTINCT and GROUP BY in PostgreSQL

This is a discussion on A counter argument about DISTINCT and GROUP BY in PostgreSQL within the Pgsql General forums, part of the PostgreSQL category; --> I've just spent a few hours searching and reading about the postgres way of selecting distinct records. I understand ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql General

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-09-2008, 11:16 PM
dterrors@hotmail.com
 
Posts: n/a
Default A counter argument about DISTINCT and GROUP BY in PostgreSQL


I've just spent a few hours searching and reading about the postgres
way of selecting distinct records. I understand the points made about
the ORDER BY limitation of DISTINCT ON, and the abiguity of GROUP BY,
but I think there's a (simple, common) case that have been missed in
the discussion. Here is my sitation:

table "projects":
id title more stuff (pretend there's 20 more columns.)
-----------------------------------------------------------
1 buildrome moredata inothercolumns
2 housework evenmoredata letssay20columns

table "todos":
id projectid name duedate
-----------------------------------------
1 1 conquer 1pm
2 1 laybricks 10pm
3 2 dolaundry 5pm


In english, I want to "select projects and order them by the ones that
have todos due the soonest." Does that sound like a reasonable
request?

This won't work in postgresql 8.2.4:

select distinct on(a.id) a.* from projects a inner join todos b on
b.projectid = a.id order by b.duedate offset 10 limit 20;
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY
expressions

What to do? I could:

Option A. select distinct on(b.duedate, a.id) a.* from projects a
inner join todos b on b.projectid = a.id order by b.duedate, a.id
offset 10 limit 20;

But that's not equivalent because the duedates will break uniqueness.
The results will have two records where the a.id is 1. So I guess
that's a non-option.

Option B. I could try group by:
select a.* from projects a inner join todos b on b.projectid = a.id
group by a.id order by b.duedate offset 10 limit 20;
Query failed: ERROR: column "a.title" must appear in the GROUP BY
clause or be used in an aggregate function

(And presumably every other column of projects would too).

Option C. I could list out each and every one of the 20+ columns of
the project table in MAX() functions so that they'd be "in an
aggregate function":
select MAX(a.title) as title, MAX(a.column2) as column2,
MAX(a.column3) as column3, MAX(a.column4) as column4, MAX(a.column5)
as column5, MAX(a.column6) as column6, MAX(a.column7) as column7,
MAX(a.column8) as column8, MAX(a.column9) as column9, MAX(a.column8)
as ihatepostgresql, MAX(a.column9) as imgoingbacktomysql, from
projects a inner join todos b on b.projectid = a.id group by a.id
order by b.duedate offset 10 limit 20;

http://groups.google.com/group/comp....3aed5b8ea5faa1

Option D: select distinct a.* from projects a inner join todos b on
b.projectid = a.id order by b.duedate offset 10 limit 20;
Query failed: ERROR: for SELECT DISTINCT, ORDER BY expressions must
appear in select list

But I don't want b.duedate in my results. And if I add it, it will,
again, break the desired uniqueness (see option A). (And, actually
even if that one worked, it would suck. Instead of making postgres
compare only one column to determine uniqueness, it now has to compare
all of them. (In other words, the whole advantage of DISTINCT ON is
out the window).)

Option E: I could use a subselect. But notice my offset, limit. If I
use a subselect, then postgresql would have to build ALL of the
results in memory (to create the subselect virtual table), before I
apply the offset and limit on the subselect.

Any suggestion would be appreciated.

BTW for those of you who are curious, in mysql (that other db), this
would be:

select a.* from projects a inner join todos b on b.projectid = a.id
group by a.id order by b.duedate limit 10,20;

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-09-2008, 11:17 PM
Gregory Stark
 
Posts: n/a
Default Re: A counter argument about DISTINCT and GROUP BY in PostgreSQL

<dterrors@hotmail.com> writes:

> I've just spent a few hours searching and reading about the postgres
> way of selecting distinct records. I understand the points made about
> the ORDER BY limitation of DISTINCT ON, and the abiguity of GROUP BY,
> but I think there's a (simple, common) case that have been missed in
> the discussion. Here is my sitation:
>
> table "projects":
> id title more stuff (pretend there's 20 more columns.)
> -----------------------------------------------------------
> 1 buildrome moredata inothercolumns
> 2 housework evenmoredata letssay20columns
>
> table "todos":
> id projectid name duedate
> -----------------------------------------
> 1 1 conquer 1pm
> 2 1 laybricks 10pm
> 3 2 dolaundry 5pm
>
>
> In english, I want to "select projects and order them by the ones that
> have todos due the soonest." Does that sound like a reasonable
> request?


SELECT *
FROM (
SELECT DISTINCT ON (projects.id) projects.*
FROM projects
JOIN todos ON (todos.projectid = projects.id)
ORDER BY projects.id, projects.duedate ASC
)
ORDER BY duedate ASC
OFFSET 10
LIMIT 20

> Option E: I could use a subselect. But notice my offset, limit. If I
> use a subselect, then postgresql would have to build ALL of the
> results in memory (to create the subselect virtual table), before I
> apply the offset and limit on the subselect.


Don't assume Postgres has to do things a particular way just because there's a
subselect involved. In this case however I expect Postgres would have to build
the results in memory, but not because of the subselect, just because that's
the only way to do what you're asking.

You're asking for it to pick out distinct values according to one sort key
then return the results sorted according to another key. Even if you had an
index for the first key or Postgres used a hash to perform the distinct, the
ORDER BY will require a sort.

Another way to do it would be:

SELECT *,
(select min(duedate) from todos where projectid = projects.id) as duedate
FROM projects
ORDER BY duedate
OFFSET 10
LIMIT 20


> Any suggestion would be appreciated.
>
> BTW for those of you who are curious, in mysql (that other db), this
> would be:
>
> select a.* from projects a inner join todos b on b.projectid = a.id
> group by a.id order by b.duedate limit 10,20;


And what does the plan look like?


--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-09-2008, 11:17 PM
Alvaro Herrera
 
Posts: n/a
Default Re: A counter argument about DISTINCT and GROUP BY inPostgreSQL

dterrors@hotmail.com wrote:
>
> I've just spent a few hours searching and reading about the postgres
> way of selecting distinct records. I understand the points made about
> the ORDER BY limitation of DISTINCT ON, and the abiguity of GROUP BY,
> but I think there's a (simple, common) case that have been missed in
> the discussion. Here is my sitation:


Did you try putting the DISTINCT ON in a subselect, and the ORDER BY in
an outer select?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-09-2008, 11:18 PM
dterrors@hotmail.com
 
Posts: n/a
Default Re: A counter argument about DISTINCT and GROUP BY in PostgreSQL

On Jan 4, 11:48 am, st...@enterprisedb.com (Gregory Stark) wrote:
> <dterr...@hotmail.com> writes:
> > I've just spent a few hours searching and reading about the postgres
> > way of selecting distinct records. I understand the points made about
> > the ORDER BY limitation of DISTINCT ON, and the abiguity of GROUP BY,
> > but I think there's a (simple, common) case that have been missed in
> > the discussion. Here is my sitation:

>
> > table "projects":
> > id title more stuff (pretend there's 20 more columns.)
> > -----------------------------------------------------------
> > 1 buildrome moredata inothercolumns
> > 2 housework evenmoredata letssay20columns

>
> > table "todos":
> > id projectid name duedate
> > -----------------------------------------
> > 1 1 conquer 1pm
> > 2 1 laybricks 10pm
> > 3 2 dolaundry 5pm

>
> > In english, I want to "select projects and order them by the ones that
> > have todos due the soonest." Does that sound like a reasonable
> > request?

>
> SELECT *
> FROM (
> SELECT DISTINCT ON (projects.id) projects.*
> FROM projects
> JOIN todos ON (todos.projectid = projects.id)
> ORDER BY projects.id, projects.duedate ASC
> )
> ORDER BY duedate ASC
> OFFSET 10
> LIMIT 20
>
> > Option E: I could use a subselect. But notice my offset, limit. If I
> > use a subselect, then postgresql would have to build ALL of the
> > results in memory (to create the subselect virtual table), before I
> > apply the offset and limit on the subselect.

>
> Don't assume Postgres has to do things a particular way just because there's a
> subselect involved. In this case however I expect Postgres would have to build
> the results in memory, but not because of the subselect, just because that's
> the only way to do what you're asking.


When you say it would build the results in memory, do you mean the
entire subselected table? The subselect in your example doesn't do
any offset, limit. (And, do you think what I'm asking for is odd or
unusual? I can think of a hundred examples besides a todo list.)

> You're asking for it to pick out distinct values according to one sort key
> then return the results sorted according to another key. Even if you had an
> index for the first key or Postgres used a hash to perform the distinct, the
> ORDER BY will require a sort.


I'm not trying to avoid doing a sort, actually.

> > Any suggestion would be appreciated.

>
> > BTW for those of you who are curious, in mysql (that other db), this
> > would be:

>
> > select a.* from projects a inner join todos b on b.projectid = a.id
> > group by a.id order by b.duedate limit 10,20;

>
> And what does the plan look like?


It looks great in mysql! Works perfectly- that was from my old mysql
code before I switched, or well tried to switch to postgres. I get:

id title more stuff....
-----------------------------------------------------------
1 buildrome moredata inothercolumns
2 housework evenmoredata letssay20columns

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-09-2008, 11:18 PM
dterrors@hotmail.com
 
Posts: n/a
Default Re: A counter argument about DISTINCT and GROUP BY in PostgreSQL

On Jan 4, 11:48 am, st...@enterprisedb.com (Gregory Stark) wrote:
> <dterr...@hotmail.com> writes:
> > I've just spent a few hours searching and reading about the postgres
> > way of selecting distinct records. I understand the points made about
> > the ORDER BY limitation of DISTINCT ON, and the abiguity of GROUP BY,
> > but I think there's a (simple, common) case that have been missed in
> > the discussion. Here is my sitation:

>
> > table "projects":
> > id title more stuff (pretend there's 20 more columns.)
> > -----------------------------------------------------------
> > 1 buildrome moredata inothercolumns
> > 2 housework evenmoredata letssay20columns

>
> > table "todos":
> > id projectid name duedate
> > -----------------------------------------
> > 1 1 conquer 1pm
> > 2 1 laybricks 10pm
> > 3 2 dolaundry 5pm

>
> > In english, I want to "select projects and order them by the ones that
> > have todos due the soonest." Does that sound like a reasonable
> > request?

>
> SELECT *
> FROM (
> SELECT DISTINCT ON (projects.id) projects.*
> FROM projects
> JOIN todos ON (todos.projectid = projects.id)
> ORDER BY projects.id, projects.duedate ASC
> )
> ORDER BY duedate ASC
> OFFSET 10
> LIMIT 20


Actually, this won't work. You meant to type "todos.duedate", right?
Projects don't have duedates, todos due. So if you change it to
"todos.duedate", you'll have the same error that my queries did.

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-09-2008, 11:18 PM
Gregory Stark
 
Posts: n/a
Default Re: A counter argument about DISTINCT and GROUP BY in PostgreSQL

<dterrors@hotmail.com> writes:

> On Jan 4, 11:48 am, st...@enterprisedb.com (Gregory Stark) wrote:
>
>> And what does the plan look like?

>
> It looks great in mysql!


Like what?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-09-2008, 11:18 PM
dterrors@hotmail.com
 
Posts: n/a
Default Re: A counter argument about DISTINCT and GROUP BY in PostgreSQL

On Jan 16, 7:36 pm, st...@enterprisedb.com (Gregory Stark) wrote:
> <dterr...@hotmail.com> writes:
> > On Jan 4, 11:48 am, st...@enterprisedb.com (Gregory Stark) wrote:

>
> >> And what does the plan look like?

>
> > It looks great in mysql!

>
> Like what?
>


Er I just showed you in my response:

> It looks great in mysql! Works perfectly- that was from my old mysql
> code before I switched, or well tried to switch to postgres. I get:
>
> id title more stuff....
> -----------------------------------------------------------
> 1 buildrome moredata inothercolumns
> 2 housework evenmoredata letssay20columns


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:36 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com