Unix Technical Forum

quickly getting the top N rows

This is a discussion on quickly getting the top N rows within the Pgsql Performance forums, part of the PostgreSQL category; --> If I have this: create table foo (bar int primary key); ....then in my ideal world, Postgres would be ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 10:38 AM
Ben
 
Posts: n/a
Default quickly getting the top N rows

If I have this:

create table foo (bar int primary key);

....then in my ideal world, Postgres would be able to use that index on bar
to help me with this:

select bar from foo order by bar desc limit 20;

But in my experience, PG8.2 is doing a full table scan on foo, then
sorting it, then doing the limit. I have a more complex primary key, but I
was hoping the same concept would still apply. Am I doing something wrong,
or just expecting something that doesn't exist?

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 10:38 AM
Mark Lewis
 
Posts: n/a
Default Re: quickly getting the top N rows

On Thu, 2007-10-04 at 11:00 -0700, Ben wrote:
> If I have this:
>
> create table foo (bar int primary key);
>
> ...then in my ideal world, Postgres would be able to use that index on bar
> to help me with this:
>
> select bar from foo order by bar desc limit 20;
>
> But in my experience, PG8.2 is doing a full table scan on foo, then
> sorting it, then doing the limit. I have a more complex primary key, but I
> was hoping the same concept would still apply. Am I doing something wrong,
> or just expecting something that doesn't exist?


It has to do with the way that NULL values are stored in the index.
This page has details and instructions for how to get it to work:

http://developer.postgresql.org/pgdo...-ordering.html

-- Mark Lewis

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 10:38 AM
Bill Moran
 
Posts: n/a
Default Re: quickly getting the top N rows

In response to Ben <bench@silentmedia.com>:

> If I have this:
>
> create table foo (bar int primary key);
>
> ...then in my ideal world, Postgres would be able to use that index on bar
> to help me with this:
>
> select bar from foo order by bar desc limit 20;
>
> But in my experience, PG8.2 is doing a full table scan on foo, then
> sorting it, then doing the limit. I have a more complex primary key, but I
> was hoping the same concept would still apply. Am I doing something wrong,
> or just expecting something that doesn't exist?


Show us the explain.

However, 2 guesses:
1) You never analyzed the table, thus PG has awful statistics and
doesn't know how to pick a good plan.
2) You have so few rows in the table that a seq scan is actually
faster than an index scan, which is why PG uses it instead.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 10:38 AM
Andreas Kretschmer
 
Posts: n/a
Default Re: quickly getting the top N rows

Ben <bench@silentmedia.com> schrieb:

> If I have this:
>
> create table foo (bar int primary key);
>
> ...then in my ideal world, Postgres would be able to use that index on bar
> to help me with this:
>
> select bar from foo order by bar desc limit 20;
>
> But in my experience, PG8.2 is doing a full table scan on foo, then sorting
> it, then doing the limit. I have a more complex primary key, but I was


Please show us the output from

EXPLAIN ANALYSE select bar from foo order by bar desc limit 20;

I try it, with 8.1, and i can see an index scan. You have, maybe, wrong
statistics ot not enough (to few) rows in your table.


Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect. (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly." (unknow)
Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889°

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 10:38 AM
Ben
 
Posts: n/a
Default Re: quickly getting the top N rows

On Thu, 4 Oct 2007, Bill Moran wrote:

> However, 2 guesses:
> 1) You never analyzed the table, thus PG has awful statistics and
> doesn't know how to pick a good plan.
> 2) You have so few rows in the table that a seq scan is actually
> faster than an index scan, which is why PG uses it instead.


No, the tables are recently analyzed and there are a couple hundred
thousand rows in there. But I think I just figured it out.... it's a
3-column index, and two columns of that index are the same for every row.
When I drop those two columns from the ordering restriction, the index
gets used and things speed up 5 orders of magnitude.

Maybe the planner is smart enough to think that if a column in the order
by clause is identical for most rows, then using an index won't help....
but not smart enough to realize that if said column is at the *end* of the
order by arguments, after columns which do sort quite well, then it should
use an index after all.

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 10:38 AM
Tom Lane
 
Posts: n/a
Default Re: quickly getting the top N rows

Ben <bench@silentmedia.com> writes:
> No, the tables are recently analyzed and there are a couple hundred
> thousand rows in there. But I think I just figured it out.... it's a
> 3-column index, and two columns of that index are the same for every row.
> When I drop those two columns from the ordering restriction, the index
> gets used and things speed up 5 orders of magnitude.


> Maybe the planner is smart enough to think that if a column in the order
> by clause is identical for most rows, then using an index won't help....
> but not smart enough to realize that if said column is at the *end* of the
> order by arguments, after columns which do sort quite well, then it should
> use an index after all.


You're being about as clear as mud here, except that you obviously lied
about what you were doing in your first message. If you have a planner
problem, show us the *exact* query, the *exact* table definition, and
unfaked EXPLAIN ANALYZE output.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 10:39 AM
Ben
 
Posts: n/a
Default Re: quickly getting the top N rows

On Thu, 4 Oct 2007, Tom Lane wrote:

> You're being about as clear as mud here, except that you obviously lied
> about what you were doing in your first message. If you have a planner
> problem, show us the *exact* query, the *exact* table definition, and
> unfaked EXPLAIN ANALYZE output.


I didn't realize that simplification was viewed as so sinister, but
thanks, I'll remember that in the future.

The table:
Table "public.log"
Column | Type | Modifiers
----------------+-----------------------------+---------------------
clientkey | character(30) | not null
premiseskey | character(30) | not null
logkey | character(30) | not null
logicaldel | character(1) | default 'N'::bpchar
lockey | character(30) |
devlockey | character(30) |
eventkey | character(30) |
logshorttext | character varying(255) |
logdesc | character varying(255) |
loguserkey | character(30) |
logassetkey | character(30) |
logresourcekey | character(30) |
logtime | timestamp without time zone |
logip | character varying(50) |
logarchived | character(1) |
logarchivedate | timestamp without time zone |
loghasvideo | character(1) |
loghasaudio | character(1) |
resvehiclekey | character(30) |
synccreated | character(1) |
logtypekey | character(30) |
Indexes:
"log_pkey" PRIMARY KEY, btree (clientkey, premiseskey, logkey)
"eventkey_idx" btree (eventkey),
"log_ak1" btree (clientkey, premiseskey, logtime, logkey)


The original, slow query:

explain analyze SELECT * FROM log WHERE clientkey in
('000000004500000000010000000001') AND premiseskey in
('000000004500000000010000000001') and logicaldel = 'N'
ORDER BY logtime desc, logkey desc, clientkey desc, premiseskey desc LIMIT 20 offset 0;

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=356402.58..356402.63 rows=20 width=563) (actual time=215858.481..215858.527 rows=20 loops=1)
-> Sort (cost=356402.58..357598.25 rows=478267 width=563) (actual time=215858.478..215858.498 rows=20 loops=1)
Sort Key: logtime, logkey, clientkey, premiseskey
-> Seq Scan on log (cost=0.00..52061.67 rows=478267 width=563) (actual time=29.340..100043.313 rows=475669 loops=1)
Filter: ((clientkey = '000000004500000000010000000001'::bpchar) AND (premiseskey = '000000004500000000010000000001'::bpchar) AND (logicaldel = 'N'::bpchar))
Total runtime: 262462.582 ms
(6 rows)


Every row in log has identical clientkey and premiseskey values, so if I
just remove those columns from the order by clause, I get this far
superior plan:

explain analyze SELECT * FROM log WHERE clientkey in
('000000004500000000010000000001') AND premiseskey in
('000000004500000000010000000001') and logicaldel = 'N'
ORDER BY logtime desc, logkey desc LIMIT 20 offset 0;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..12.33 rows=20 width=563) (actual time=0.047..0.105 rows=20 loops=1)
-> Index Scan Backward using log_ak1 on log (cost=0.00..294735.70 rows=478267 width=563) (actual time=0.044..0.076 rows=20 loops=1)
Index Cond: ((clientkey = '000000004500000000010000000001'::bpchar) AND (premiseskey = '000000004500000000010000000001'::bpchar))
Filter: (logicaldel = 'N'::bpchar)
Total runtime: 0.165 ms
(5 rows)


....which made me to think that maybe postgres is not using log_ak1 in the
former case because two of the columns in the order by match every row.

Unfortunately, in this case it's not an option to alter the query. I'm
just trying to figure out an explaination.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 10:39 AM
Simon Riggs
 
Posts: n/a
Default Re: quickly getting the top N rows

On Thu, 2007-10-04 at 12:52 -0700, Ben wrote:

> The original, slow query:
>
> explain analyze SELECT * FROM log WHERE clientkey in
> ('000000004500000000010000000001') AND premiseskey in
> ('000000004500000000010000000001') and logicaldel = 'N'
> ORDER BY logtime desc, logkey desc, clientkey desc, premiseskey desc LIMIT 20 offset 0;
>
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=356402.58..356402.63 rows=20 width=563) (actual time=215858.481..215858.527 rows=20 loops=1)
> -> Sort (cost=356402.58..357598.25 rows=478267 width=563) (actual time=215858.478..215858.498 rows=20 loops=1)
> Sort Key: logtime, logkey, clientkey, premiseskey
> -> Seq Scan on log (cost=0.00..52061.67 rows=478267 width=563) (actual time=29.340..100043.313 rows=475669 loops=1)
> Filter: ((clientkey = '000000004500000000010000000001'::bpchar) AND (premiseskey = '000000004500000000010000000001'::bpchar) AND (logicaldel = 'N'::bpchar))
> Total runtime: 262462.582 ms
> (6 rows)
>
>
> Every row in log has identical clientkey and premiseskey values, so if I
> just remove those columns from the order by clause, I get this far
> superior plan:
>
> explain analyze SELECT * FROM log WHERE clientkey in
> ('000000004500000000010000000001') AND premiseskey in
> ('000000004500000000010000000001') and logicaldel = 'N'
> ORDER BY logtime desc, logkey desc LIMIT 20 offset 0;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=0.00..12.33 rows=20 width=563) (actual time=0.047..0.105 rows=20 loops=1)
> -> Index Scan Backward using log_ak1 on log (cost=0.00..294735.70 rows=478267 width=563) (actual time=0.044..0.076 rows=20 loops=1)
> Index Cond: ((clientkey = '000000004500000000010000000001'::bpchar) AND (premiseskey = '000000004500000000010000000001'::bpchar))
> Filter: (logicaldel = 'N'::bpchar)
> Total runtime: 0.165 ms
> (5 rows)
>
>
> ...which made me to think that maybe postgres is not using log_ak1 in the
> former case because two of the columns in the order by match every row.
>
> Unfortunately, in this case it's not an option to alter the query. I'm
> just trying to figure out an explaination.


In the first query, Postgres cannot use the index because the sort order
of the index does not match the sort order of the query. When you change
the sort order of the query so that it matches that of the index, then
the index is used.

If you define your index on (logtime, logkey, clientkey, premiseskey)
rather than on (clientkey, premiseskey, logtime, logkey) you will have a
fast query. Yes, the column order matters.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 10:39 AM
Scott Marlowe
 
Posts: n/a
Default Re: quickly getting the top N rows

On 10/4/07, Ben <bench@silentmedia.com> wrote:
> If I have this:
>
> create table foo (bar int primary key);
>
> ...then in my ideal world, Postgres would be able to use that index on bar
> to help me with this:
>
> select bar from foo order by bar desc limit 20;
>
> But in my experience, PG8.2 is doing a full table scan on foo, then
> sorting it, then doing the limit. I have a more complex primary key, but I
> was hoping the same concept would still apply. Am I doing something wrong,
> or just expecting something that doesn't exist?


pg uses an intelligent planner. It looks at the table, the number of
rows, the distribution of values, and makes a decision whether to use
seq scan or index. Do you have any evidence that in your case seq
scan is a bad choice?

try this experiment:

psql mydb
=# select * from foo; -- this will prime the db and put the table in
memory if it will fit
=# \timing
=# set enable_seqscan=off;
=# select bar from foo order by bar desc limit 20;
=# set enable_seqscan=on;
=# select bar from foo order by bar desc limit 20;

and compare the times each takes. run each way a few times to be sure
you're not getting random variance.

On my reporting db with somewhere around 75 million tables, a similar
query 0.894 mS and uses an index scan. Which is good, because a
sequential scan on that table takes about 15 to 30 minutes.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 10:39 AM
Ben
 
Posts: n/a
Default Re: quickly getting the top N rows

On Thu, 4 Oct 2007, Simon Riggs wrote:

> In the first query, Postgres cannot use the index because the sort order
> of the index does not match the sort order of the query. When you change
> the sort order of the query so that it matches that of the index, then
> the index is used.
>
> If you define your index on (logtime, logkey, clientkey, premiseskey)
> rather than on (clientkey, premiseskey, logtime, logkey) you will have a
> fast query. Yes, the column order matters.


I thought that might explain it, but then I'm surprised that it can still
use an index when the first two columns of the index aren't in the query.
Wouldn't that mean that it might have to walk the entire index to find
matching rows?

.....unless it's smart enough to realize that the first two columns will
match everything. Which would be cool.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 11:37 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com