Unix Technical Forum

Optimising a query requiring seqscans=0

This is a discussion on Optimising a query requiring seqscans=0 within the Pgsql General forums, part of the PostgreSQL category; --> Hi, We recently upgraded our trac backend from sqlite to postgres, and I decided to have a little fun ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql General

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-09-2008, 12:24 PM
Russ Brown
 
Posts: n/a
Default Optimising a query requiring seqscans=0

Hi,

We recently upgraded our trac backend from sqlite to postgres, and I
decided to have a little fun and write some reports that delve into
trac's subversion cache, and got stuck with a query optimisation
problem.

Table revision contains 2800+ rows
Table node_change contains 370000+.

rev is a 'TEXT' field on both containing numerical data (not my choice,
trac does it like this to support VCS backends with non-numerical
revision identifiers).

I've got stuck with this query:

SELECT author, COUNT(DISTINCT r.rev)
FROM revision AS r
LEFT JOIN node_change AS nc
ON r.rev=nc.rev
WHERE r.time >= EXTRACT(epoch FROM (NOW() - interval '30
days'))::integer
GROUP BY r.author;

Currently it shows the number of commits per author in the last 30 days.
The join to node_change is superfluous for this purpose but was added
because I intended to add new columns which draw on this table, such as
the number of files added, deleted etc.

I never got that far however due to the planner problem:

GroupAggregate (cost=56755.41..56852.28 rows=2 width=17) (actual
time=4836.433..4897.458 rows=25 loops=1)
-> Sort (cost=56755.41..56787.69 rows=12913 width=17) (actual
time=4836.233..4851.968 rows=22419 loops=1)
Sort Key: r.author
-> Merge Left Join (cost=53886.10..55873.68 rows=12913
width=17) (actual time=4600.733..4641.749 rows=22419 loops=1)
Merge Cond: ("outer".rev = "inner".rev)
-> Sort (cost=93.78..96.24 rows=982 width=17) (actual
time=7.050..7.383 rows=1088 loops=1)
Sort Key: r.rev
-> Index Scan using revision_time_idx on revision
r (cost=0.01..44.98 rows=982 width=17) (actual time=0.191..4.014
rows=1088 loops=1)
Index Cond: ("time" >=
(date_part('epoch'::text, (now() - '30 days'::interval)))::integer)
-> Sort (cost=53792.32..54719.09 rows=370707 width=8)
(actual time=4203.665..4443.748 rows=346238 loops=1)
Sort Key: nc.rev
-> Seq Scan on node_change nc
(cost=0.00..12852.07 rows=370707 width=8) (actual time=0.054..663.719
rows=370707 loops=1)
Total runtime: 4911.430 ms

If I disable sequential scans I get the following explain:

GroupAggregate (cost=221145.13..221242.01 rows=2 width=17) (actual
time=286.348..348.268 rows=25 loops=1)
-> Sort (cost=221145.13..221177.42 rows=12913 width=17) (actual
time=286.183..302.239 rows=22419 loops=1)
Sort Key: r.author
-> Nested Loop Left Join (cost=0.01..220263.40 rows=12913
width=17) (actual time=0.339..86.626 rows=22419 loops=1)
-> Index Scan using revision_time_idx on revision r
(cost=0.01..44.98 rows=982 width=17) (actual time=0.194..4.056 rows=1088
loops=1)
Index Cond: ("time" >= (date_part('epoch'::text,
(now() - '30 days'::interval)))::integer)
-> Index Scan using node_change_rev_idx on node_change
nc (cost=0.00..223.18 rows=86 width=8) (actual time=0.009..0.058
rows=21 loops=1088)
Index Cond: ("outer".rev = nc.rev)
Total runtime: 350.103 ms

Statistics are set to 20, and I have ANALYZEd both tables.

The report itself isn't important, but I'm using this as an exercise in
PostgreSQL query optimisation and planner tuning, so any help/hints
would be appreciated.

Thanks.

--

Russ


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-09-2008, 12:31 PM
Jim Nasby
 
Posts: n/a
Default Re: Optimising a query requiring seqscans=0

On Sep 14, 2006, at 11:15 AM, Russ Brown wrote:
> We recently upgraded our trac backend from sqlite to postgres, and I
> decided to have a little fun and write some reports that delve into
> trac's subversion cache, and got stuck with a query optimisation
> problem.
>
> Table revision contains 2800+ rows
> Table node_change contains 370000+.

<...>
> I've got stuck with this query:
>
> SELECT author, COUNT(DISTINCT r.rev)
> FROM revision AS r
> LEFT JOIN node_change AS nc
> ON r.rev=nc.rev
> WHERE r.time >= EXTRACT(epoch FROM (NOW() - interval '30
> days'))::integer


Man I really hate when people store time_t in a database...

> GROUP BY r.author;
>
> Statistics are set to 20, and I have ANALYZEd both tables.
>
> The report itself isn't important, but I'm using this as an
> exercise in
> PostgreSQL query optimisation and planner tuning, so any help/hints
> would be appreciated.


Setting statistics higher (100-200), at least for the large table
will likely help. Also make sure that you've set effective_cache_size
correctly (I generally set it to total memory - 1G, assuming the
server has at least 4G in it).
--
Jim Nasby jimn@enterprisedb.com
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-09-2008, 12:32 PM
Russ Brown
 
Posts: n/a
Default Re: Optimising a query requiring seqscans=0

On Thu, 2006-09-21 at 23:39 -0400, Jim Nasby wrote:
> On Sep 14, 2006, at 11:15 AM, Russ Brown wrote:
> > We recently upgraded our trac backend from sqlite to postgres, and I
> > decided to have a little fun and write some reports that delve into
> > trac's subversion cache, and got stuck with a query optimisation
> > problem.
> >
> > Table revision contains 2800+ rows
> > Table node_change contains 370000+.

> <...>
> > I've got stuck with this query:
> >
> > SELECT author, COUNT(DISTINCT r.rev)
> > FROM revision AS r
> > LEFT JOIN node_change AS nc
> > ON r.rev=nc.rev
> > WHERE r.time >= EXTRACT(epoch FROM (NOW() - interval '30
> > days'))::integer

>
> Man I really hate when people store time_t in a database...
>


I know. Probably something to do with database engine independence. I
don't know if sqlite even has a date type (probably does, but I haven't
checked).

> > GROUP BY r.author;
> >
> > Statistics are set to 20, and I have ANALYZEd both tables.
> >
> > The report itself isn't important, but I'm using this as an
> > exercise in
> > PostgreSQL query optimisation and planner tuning, so any help/hints
> > would be appreciated.

>
> Setting statistics higher (100-200), at least for the large table
> will likely help. Also make sure that you've set effective_cache_size
> correctly (I generally set it to total memory - 1G, assuming the
> server has at least 4G in it).


Thank you: the problem was the effective_cache_size (which I hadn't
changed from the default of 1000). This machine doesn't have loads of
RAM, but I knocked it up to 65536 and now the query uses the index,
without having to change the statistics.

Thanks a lot!

> --
> Jim Nasby jimn@enterprisedb.com
> EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
>
>



---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 09:11 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com