Unix Technical Forum

Optimising SELECT on a table with one million rows

This is a discussion on Optimising SELECT on a table with one million rows within the Pgsql General forums, part of the PostgreSQL category; --> Hi, I'm fairly new with Postgresql, so I am not sure if the performance problems I'm having are due ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql General

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-09-2008, 06:35 PM
Cultural Sublimation
 
Posts: n/a
Default Optimising SELECT on a table with one million rows

Hi,

I'm fairly new with Postgresql, so I am not sure if the performance
problems I'm having are due to poorly constructed queries/indices,
or if I bumped into more fundamental problems requiring a design of
my database structure. That's why I'm requesting your help.

Here's the situation: I have three tables: Users, Stories, and Comments.
Stories have an author (a user), and a comment is associated with a
story and with the user who posted it. The structure of the database
is therefore fairly simple: (there are also some sequences, which I've
omitted for clarity)


CREATE TABLE users
(
user_id int UNIQUE NOT NULL,
user_name text,
PRIMARY KEY (user_id)
);


CREATE TABLE stories
(
story_id int UNIQUE NOT NULL,
story_title text,
story_body text,
story_timestamp timestamptz,
story_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (story_id)
);


CREATE TABLE comments
(
comment_id int UNIQUE NOT NULL,
comment_title text,
comment_body text,
comment_timestamp timestamptz,
comment_story int REFERENCES stories (story_id) NOT NULL,
comment_author int REFERENCES users (user_id) NOT NULL,
PRIMARY KEY (comment_id)
);


I've also populated the database with some test data, comprising 1,000
users, 1,000 stories (one per user), and 1,000,000 comments (one comment
per user per story).

Now, the query I wish to optimise is also simple: get me all comments (the
comment_id suffices) and corresponding user *names* for a given story.
If for example the story_id is 100, the query looks like this:

SELECT comments.comment_id, users.user_name
FROM comments, users
WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id;

The problem is that this query takes a *very* long time. With the said
1,000,000 comments, it needs at least 1100ms on my system. "Explain
analyze" tells me that a sequential scan is being performed on both
users and comments:

Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual
time=3.674..1144.779 rows=1000 loops=1)
Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer)
-> Seq Scan on comments (cost=0.00..21847.00 rows=988 width=8) (actual
time=0.185..1136.067 rows=1000 loops=1)
Filter: ((comment_story)::integer = 100)
-> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425
rows=1000 loops=1)
-> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14) (actual
time=0.068..1.845 rows=1000 loops=1)
Total runtime: 1146.424 ms

On the long run, I guess one possible solution to this problem will be
to partition the comments table into a number of sub-tables, most likely
based on the timestamp attribute (by having current versus historic data).
Nevertheless, I am wondering if there are other more straightforward ways
to optimise this query. Some clever use of indices, perhaps? Or is
the way I am now constructing the select non-optimal? Or do I need
some pixie-magic on the Postgresql settings? Anyway, any suggestions
are welcome! (and thanks in advance)

Regards,
C.S.




__________________________________________________ __________________________________
Yahoo! oneSearch: Finally, mobile search
that gives answers, not web links.
http://mobile.yahoo.com/mobileweb/on...h?refer=1ONXIC

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org/

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-09-2008, 06:35 PM
=?iso-8859-1?q?Rodrigo_De_Le=F3n?=
 
Posts: n/a
Default Re: Optimising SELECT on a table with one million rows

On Jul 30, 12:01 pm, cultural_sublimat...@yahoo.com (Cultural
Sublimation) wrote:
> Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual
> time=3.674..1144.779 rows=1000 loops=1)
> Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer)
> -> Seq Scan on comments (cost=0.00..21847.00 rows=988 width=8) (actual
> time=0.185..1136.067 rows=1000 loops=1)
> Filter: ((comment_story)::integer = 100)
> -> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425
> rows=1000 loops=1)
> -> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14) (actual
> time=0.068..1.845 rows=1000 loops=1)
> Total runtime: 1146.424 ms


Create an index on comments.comment_story column.

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-09-2008, 06:35 PM
Richard Huxton
 
Posts: n/a
Default Re: Optimising SELECT on a table with one million rows

Cultural Sublimation wrote:
> CREATE TABLE users
> (
> user_id int UNIQUE NOT NULL,
> user_name text,
> PRIMARY KEY (user_id)
> );
>
>
> CREATE TABLE stories
> (
> story_id int UNIQUE NOT NULL,
> story_title text,
> story_body text,
> story_timestamp timestamptz,
> story_author int REFERENCES users (user_id) NOT NULL,
> PRIMARY KEY (story_id)
> );
>
>
> CREATE TABLE comments
> (
> comment_id int UNIQUE NOT NULL,
> comment_title text,
> comment_body text,
> comment_timestamp timestamptz,
> comment_story int REFERENCES stories (story_id) NOT NULL,
> comment_author int REFERENCES users (user_id) NOT NULL,
> PRIMARY KEY (comment_id)
> );
>
>
> I've also populated the database with some test data, comprising 1,000
> users, 1,000 stories (one per user), and 1,000,000 comments (one comment
> per user per story).
>
> Now, the query I wish to optimise is also simple: get me all comments (the
> comment_id suffices) and corresponding user *names* for a given story.
> If for example the story_id is 100, the query looks like this:
>
> SELECT comments.comment_id, users.user_name
> FROM comments, users
> WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id;
>
> The problem is that this query takes a *very* long time. With the said
> 1,000,000 comments, it needs at least 1100ms on my system. "Explain
> analyze" tells me that a sequential scan is being performed on both
> users and comments:


What else is it supposed to do? You haven't created any indexes. I'm
also guessing that you haven't analysed the tables either.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-09-2008, 06:35 PM
Bryan Murphy
 
Posts: n/a
Default Re: Optimising SELECT on a table with one million rows

First question... did you create the appropriate indexes on the appropriate
columns for these tables? Foreign keys do not implicitly create indexes in
postgres.

Bryan

On 7/30/07, Cultural Sublimation <cultural_sublimation@yahoo.com> wrote:
>
> Hi,
>
> I'm fairly new with Postgresql, so I am not sure if the performance
> problems I'm having are due to poorly constructed queries/indices,
> or if I bumped into more fundamental problems requiring a design of
> my database structure. That's why I'm requesting your help.
>
> Here's the situation: I have three tables: Users, Stories, and Comments.
> Stories have an author (a user), and a comment is associated with a
> story and with the user who posted it. The structure of the database
> is therefore fairly simple: (there are also some sequences, which I've
> omitted for clarity)
>
>
> CREATE TABLE users
> (
> user_id int UNIQUE NOT NULL,
> user_name text,
> PRIMARY KEY (user_id)
> );
>
>
> CREATE TABLE stories
> (
> story_id int UNIQUE NOT NULL,
> story_title text,
> story_body text,
> story_timestamp timestamptz,
> story_author int REFERENCES users (user_id) NOT NULL,
> PRIMARY KEY (story_id)
> );
>
>
> CREATE TABLE comments
> (
> comment_id int UNIQUE NOT NULL,
> comment_title text,
> comment_body text,
> comment_timestamp timestamptz,
> comment_story int REFERENCES stories (story_id) NOT
> NULL,
> comment_author int REFERENCES users (user_id) NOT NULL,
> PRIMARY KEY (comment_id)
> );
>
>
> I've also populated the database with some test data, comprising 1,000
> users, 1,000 stories (one per user), and 1,000,000 comments (one comment
> per user per story).
>
> Now, the query I wish to optimise is also simple: get me all comments (the
> comment_id suffices) and corresponding user *names* for a given story.
> If for example the story_id is 100, the query looks like this:
>
> SELECT comments.comment_id, users.user_name
> FROM comments, users
> WHERE comments.comment_story = 100 AND comments.comment_author =
> users.user_id;
>
> The problem is that this query takes a *very* long time. With the said
> 1,000,000 comments, it needs at least 1100ms on my system. "Explain
> analyze" tells me that a sequential scan is being performed on both
> users and comments:
>
> Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual
> time=3.674..1144.779 rows=1000 loops=1)
> Hash Cond: ((comments.comment_author)::integer = (users.user_id
> )::integer)
> -> Seq Scan on comments (cost=0.00..21847.00 rows=988 width=8)
> (actual
> time=0.185..1136.067 rows=1000 loops=1)
> Filter: ((comment_story)::integer = 100)
> -> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=
> 3.425..3.425
> rows=1000 loops=1)
> -> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14)
> (actual
> time=0.068..1.845 rows=1000 loops=1)
> Total runtime: 1146.424 ms
>
> On the long run, I guess one possible solution to this problem will be
> to partition the comments table into a number of sub-tables, most likely
> based on the timestamp attribute (by having current versus historic data).
> Nevertheless, I am wondering if there are other more straightforward ways
> to optimise this query. Some clever use of indices, perhaps? Or is
> the way I am now constructing the select non-optimal? Or do I need
> some pixie-magic on the Postgresql settings? Anyway, any suggestions
> are welcome! (and thanks in advance)
>
> Regards,
> C.S.
>
>
>
>
>
> __________________________________________________ __________________________________
> Yahoo! oneSearch: Finally, mobile search
> that gives answers, not web links.
> http://mobile.yahoo.com/mobileweb/on...h?refer=1ONXIC
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org/
>


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-09-2008, 06:35 PM
=?ISO-8859-1?Q?Nis_J=F8rgensen?=
 
Posts: n/a
Default Re: Optimising SELECT on a table with one million rows

Cultural Sublimation skrev:
> Hi,
>
> I'm fairly new with Postgresql, so I am not sure if the performance
> problems I'm having are due to poorly constructed queries/indices,
> or if I bumped into more fundamental problems requiring a design of
> my database structure. That's why I'm requesting your help.
>
> Here's the situation: I have three tables: Users, Stories, and Comments.
> Stories have an author (a user), and a comment is associated with a
> story and with the user who posted it. The structure of the database
> is therefore fairly simple: (there are also some sequences, which I've
> omitted for clarity)
>
>
> CREATE TABLE users
> (
> user_id int UNIQUE NOT NULL,
> user_name text,
> PRIMARY KEY (user_id)
> );
>
>
> CREATE TABLE stories
> (
> story_id int UNIQUE NOT NULL,
> story_title text,
> story_body text,
> story_timestamp timestamptz,
> story_author int REFERENCES users (user_id) NOT NULL,
> PRIMARY KEY (story_id)
> );
>
>
> CREATE TABLE comments
> (
> comment_id int UNIQUE NOT NULL,
> comment_title text,
> comment_body text,
> comment_timestamp timestamptz,
> comment_story int REFERENCES stories (story_id) NOT NULL,
> comment_author int REFERENCES users (user_id) NOT NULL,
> PRIMARY KEY (comment_id)
> );


You need indices on comment.comment_story (and probably later for
comment_author). You should ALWAYS add an index on a FOREIGN KEY column
unless you have a very good reason not to. So:

CREATE INDEX comments_story_idx ON comments(comment_story);
CREATE INDEX comments_author_idx ON comments(comment_author);
CREATE INDEX story_author_idx ON story(story_author);

Thge first of these should remove the need for a seqscan on comments for
your query. The seqscan on users is not a problem - you are returning
data from all the rows, so a seqscan is the smart thing to do.

Nis


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-09-2008, 06:35 PM
Jon Sime
 
Posts: n/a
Default Re: Optimising SELECT on a table with one million rows

Cultural Sublimation wrote:
> SELECT comments.comment_id, users.user_name
> FROM comments, users
> WHERE comments.comment_story = 100 AND comments.comment_author = users.user_id;
>
> The problem is that this query takes a *very* long time. With the said
> 1,000,000 comments, it needs at least 1100ms on my system. "Explain
> analyze" tells me that a sequential scan is being performed on both
> users and comments:
>
> Hash Join (cost=28.50..21889.09 rows=988 width=14) (actual
> time=3.674..1144.779 rows=1000 loops=1)
> Hash Cond: ((comments.comment_author)::integer = (users.user_id)::integer)
> -> Seq Scan on comments (cost=0.00..21847.00 rows=988 width=8) (actual
> time=0.185..1136.067 rows=1000 loops=1)
> Filter: ((comment_story)::integer = 100)
> -> Hash (cost=16.00..16.00 rows=1000 width=14) (actual time=3.425..3.425
> rows=1000 loops=1)
> -> Seq Scan on users (cost=0.00..16.00 rows=1000 width=14) (actual
> time=0.068..1.845 rows=1000 loops=1)
> Total runtime: 1146.424 ms


If you have no index on comments.comment_author, then a seqscan will be
required for your join between comments and users. Similarly, if you
have no index on comments.comment_story, then any query against comments
that uses that column as part of a predicate will require a seqscan of
the comments table.

Note that an FK constraint does not automatically create an index on the
underlying column. You need to create the actual index yourself if it
will be necessary for your queries.

> On the long run, I guess one possible solution to this problem will be
> to partition the comments table into a number of sub-tables, most likely
> based on the timestamp attribute (by having current versus historic data).


Partitioning on comments.comment_timestamp won't help you at all for
this particular query, since you don't have a condition in your query
dependent upon that value. It might help you for other queries (such as
gathering up all the comments posted on a particular day, or during some
other time range), but it won't make any positive difference for this query.

-Jon

--
Senior Systems Developer
Media Matters for America
http://mediamatters.org/

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-09-2008, 06:35 PM
Cultural Sublimation
 
Posts: n/a
Default Re: Optimising SELECT on a table with one million rows

Hi,

> If you have no index on comments.comment_author, then a seqscan will be
> required for your join between comments and users. Similarly, if you
> have no index on comments.comment_story, then any query against comments
> that uses that column as part of a predicate will require a seqscan of
> the comments table.


I see. As I said, I'm still fairly new to this...


> Note that an FK constraint does not automatically create an index on the
> underlying column. You need to create the actual index yourself if it
> will be necessary for your queries.


I see what you mean. The basic idea then is to take a look at the typical
queries and to create indices based on them. Is there a good guide on index
creation for optimisation purposes?


> Partitioning on comments.comment_timestamp won't help you at all for
> this particular query, since you don't have a condition in your query
> dependent upon that value. It might help you for other queries (such as
> gathering up all the comments posted on a particular day, or during some
> other time range), but it won't make any positive difference for this query.


You are right. Come to think of it, partitioning the comments table based
on comment_story might make more sense, since the overwhelming majority of
queries will be like the one I just mentioned: asking for all comments of
a given story.

Anyway, thanks a lot for your help! (And that goes for all the other people
who also given their 2 cents)

Best regards,
C.S.




__________________________________________________ __________________________________Ready for the edge of your seat?
Check out tonight's top picks on Yahoo! TV.
http://tv.yahoo.com/

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:30 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com