Unix Technical Forum

SEO

vBulletin Search Engine Optimization


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 10:46 AM
John Beaver
 
Posts: n/a
Default large tables and simple "= constant" queries using indexes

Hi, I've started my first project with Postgres (after several years of
using Mysql), and I'm having an odd performance problem that I was
hoping someone might be able to explain the cause of.

----My query----
- select count(*) from gene_prediction_view where gene_ref = 523
- takes 26 seconds to execute, and returns 2400 (out of a total of
15 million records in the table)

---My problem---
Using a single-column index to count 2400 records which are exactly
one constant value doesn't sound like something that would take 26
seconds. What's the slowdown? Any silver bullets that might fix this?

----Steps I've taken----
- I ran vacuum and analyze
- I upped the shared_buffers to 58384, and I upped some of the other
postgresql.conf values as well. Nothing seemed to help significantly,
but maybe I missed something that would help specifically for this query
type?
- I tried to create a hash index, but gave up after more than 4
hours of waiting for it to finish indexing

----Table stats----
- 15 million rows; I'm expecting to have four or five times this
number eventually.
- 1.5 gigs of hard drive usage

----My development environment---
- 2.6ghz dual-core MacBook Pro with 4 gigs of ram and a 7200 rpm
hard drive
- OS X 10.5.2
- Postgres 8.3 (installed via MacPorts)

----My table----

CREATE TABLE gene_prediction_view
(
id serial NOT NULL,
gene_ref integer NOT NULL,
go_id integer NOT NULL,
go_description character varying(200) NOT NULL,
go_category character varying(50) NOT NULL,
function_verified_exactly boolean NOT NULL,
function_verified_with_parent_go boolean NOT NULL,
function_verified_with_child_go boolean NOT NULL,
score numeric(10,2) NOT NULL,
precision_score numeric(10,2) NOT NULL,
CONSTRAINT gene_prediction_view_pkey PRIMARY KEY (id),
CONSTRAINT gene_prediction_view_gene_ref_fkey FOREIGN KEY (gene_ref)
REFERENCES sgd_annotations (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT gene_prediction_view_go_id_fkey FOREIGN KEY (go_id)
REFERENCES go_terms (term) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT gene_prediction_view_gene_ref_key UNIQUE (gene_ref, go_id)
)
WITH (OIDS=FALSE);
ALTER TABLE gene_prediction_view OWNER TO postgres;

CREATE INDEX ix_gene_prediction_view_gene_ref
ON gene_prediction_view
USING btree
(gene_ref);




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 10:46 AM
Arjen van der Meijden
 
Posts: n/a
Default Re: large tables and simple "= constant" queries usingindexes

First of all, there is the 'explain analyze' output, which is pretty
helpful in postgresql.

My guess is, postgresql decides to do a table scan for some reason. It
might not have enough statistics for this particular table or column, to
make a sound decision. What you can try is to increase the statistics
target, which works pretty easy:
ALTER TABLE gene_prediction_view ALTER gene_ref SET STATISTICS 200;

Valid ranges are from 1(0?) - 1000, the default is 10, the default on my
systems is usually 100. For such a large table, I'd go with 200.

After that, you'll need to re-analyze your table and you can try again.

Perhaps analyze should try to establish its own best guess to how many
samples it should take? The default of 10 is rather limited for large
tables.

Best regards,

Arjen

On 9-4-2008 22:58 John Beaver wrote:
> Hi, I've started my first project with Postgres (after several years of
> using Mysql), and I'm having an odd performance problem that I was
> hoping someone might be able to explain the cause of.
>
> ----My query----
> - select count(*) from gene_prediction_view where gene_ref = 523
> - takes 26 seconds to execute, and returns 2400 (out of a total of 15
> million records in the table)
>
> ---My problem---
> Using a single-column index to count 2400 records which are exactly
> one constant value doesn't sound like something that would take 26
> seconds. What's the slowdown? Any silver bullets that might fix this?
>
> ----Steps I've taken----
> - I ran vacuum and analyze
> - I upped the shared_buffers to 58384, and I upped some of the other
> postgresql.conf values as well. Nothing seemed to help significantly,
> but maybe I missed something that would help specifically for this query
> type?
> - I tried to create a hash index, but gave up after more than 4 hours
> of waiting for it to finish indexing
>
> ----Table stats----
> - 15 million rows; I'm expecting to have four or five times this
> number eventually.
> - 1.5 gigs of hard drive usage
>
> ----My development environment---
> - 2.6ghz dual-core MacBook Pro with 4 gigs of ram and a 7200 rpm hard
> drive
> - OS X 10.5.2
> - Postgres 8.3 (installed via MacPorts)
>
> ----My table----
>
> CREATE TABLE gene_prediction_view
> (
> id serial NOT NULL,
> gene_ref integer NOT NULL,
> go_id integer NOT NULL,
> go_description character varying(200) NOT NULL,
> go_category character varying(50) NOT NULL,
> function_verified_exactly boolean NOT NULL,
> function_verified_with_parent_go boolean NOT NULL,
> function_verified_with_child_go boolean NOT NULL,
> score numeric(10,2) NOT NULL,
> precision_score numeric(10,2) NOT NULL,
> CONSTRAINT gene_prediction_view_pkey PRIMARY KEY (id),
> CONSTRAINT gene_prediction_view_gene_ref_fkey FOREIGN KEY (gene_ref)
> REFERENCES sgd_annotations (id) MATCH SIMPLE
> ON UPDATE NO ACTION ON DELETE NO ACTION,
> CONSTRAINT gene_prediction_view_go_id_fkey FOREIGN KEY (go_id)
> REFERENCES go_terms (term) MATCH SIMPLE
> ON UPDATE NO ACTION ON DELETE NO ACTION,
> CONSTRAINT gene_prediction_view_gene_ref_key UNIQUE (gene_ref, go_id)
> )
> WITH (OIDS=FALSE);
> ALTER TABLE gene_prediction_view OWNER TO postgres;
>
> CREATE INDEX ix_gene_prediction_view_gene_ref
> ON gene_prediction_view
> USING btree
> (gene_ref);
>
>
>
>


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 10:46 AM
Bill Moran
 
Posts: n/a
Default Re: large tables and simple "= constant" queries usingindexes


This is a FAQ, it comes up on an almost weekly basis. Please do a
little Googling on count(*) and PostgreSQL and you'll get all the
explanations and suggestions on how to fix the problem you could
ever want.

In response to Arjen van der Meijden <acmmailing@tweakers.net>:

> First of all, there is the 'explain analyze' output, which is pretty
> helpful in postgresql.
>
> My guess is, postgresql decides to do a table scan for some reason. It
> might not have enough statistics for this particular table or column, to
> make a sound decision. What you can try is to increase the statistics
> target, which works pretty easy:
> ALTER TABLE gene_prediction_view ALTER gene_ref SET STATISTICS 200;
>
> Valid ranges are from 1(0?) - 1000, the default is 10, the default on my
> systems is usually 100. For such a large table, I'd go with 200.
>
> After that, you'll need to re-analyze your table and you can try again.
>
> Perhaps analyze should try to establish its own best guess to how many
> samples it should take? The default of 10 is rather limited for large
> tables.
>
> Best regards,
>
> Arjen
>
> On 9-4-2008 22:58 John Beaver wrote:
> > Hi, I've started my first project with Postgres (after several years of
> > using Mysql), and I'm having an odd performance problem that I was
> > hoping someone might be able to explain the cause of.
> >
> > ----My query----
> > - select count(*) from gene_prediction_view where gene_ref = 523
> > - takes 26 seconds to execute, and returns 2400 (out of a total of 15
> > million records in the table)
> >
> > ---My problem---
> > Using a single-column index to count 2400 records which are exactly
> > one constant value doesn't sound like something that would take 26
> > seconds. What's the slowdown? Any silver bullets that might fix this?
> >
> > ----Steps I've taken----
> > - I ran vacuum and analyze
> > - I upped the shared_buffers to 58384, and I upped some of the other
> > postgresql.conf values as well. Nothing seemed to help significantly,
> > but maybe I missed something that would help specifically for this query
> > type?
> > - I tried to create a hash index, but gave up after more than 4 hours
> > of waiting for it to finish indexing
> >
> > ----Table stats----
> > - 15 million rows; I'm expecting to have four or five times this
> > number eventually.
> > - 1.5 gigs of hard drive usage
> >
> > ----My development environment---
> > - 2.6ghz dual-core MacBook Pro with 4 gigs of ram and a 7200 rpm hard
> > drive
> > - OS X 10.5.2
> > - Postgres 8.3 (installed via MacPorts)
> >
> > ----My table----
> >
> > CREATE TABLE gene_prediction_view
> > (
> > id serial NOT NULL,
> > gene_ref integer NOT NULL,
> > go_id integer NOT NULL,
> > go_description character varying(200) NOT NULL,
> > go_category character varying(50) NOT NULL,
> > function_verified_exactly boolean NOT NULL,
> > function_verified_with_parent_go boolean NOT NULL,
> > function_verified_with_child_go boolean NOT NULL,
> > score numeric(10,2) NOT NULL,
> > precision_score numeric(10,2) NOT NULL,
> > CONSTRAINT gene_prediction_view_pkey PRIMARY KEY (id),
> > CONSTRAINT gene_prediction_view_gene_ref_fkey FOREIGN KEY (gene_ref)
> > REFERENCES sgd_annotations (id) MATCH SIMPLE
> > ON UPDATE NO ACTION ON DELETE NO ACTION,
> > CONSTRAINT gene_prediction_view_go_id_fkey FOREIGN KEY (go_id)
> > REFERENCES go_terms (term) MATCH SIMPLE
> > ON UPDATE NO ACTION ON DELETE NO ACTION,
> > CONSTRAINT gene_prediction_view_gene_ref_key UNIQUE (gene_ref, go_id)
> > )
> > WITH (OIDS=FALSE);
> > ALTER TABLE gene_prediction_view OWNER TO postgres;
> >
> > CREATE INDEX ix_gene_prediction_view_gene_ref
> > ON gene_prediction_view
> > USING btree
> > (gene_ref);
> >
> >
> >
> >

>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

************************************************** **************
IMPORTANT: This message contains confidential information and is
intended only for the individual named. If the reader of this
message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.
************************************************** **************

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 10:46 AM
PFC
 
Posts: n/a
Default Re: large tables and simple "= constant" queries using indexes


> Hi, I've started my first project with Postgres (after several years of
> using Mysql), and I'm having an odd performance problem that I was
> hoping someone might be able to explain the cause of.
>
> ----My query----
> - select count(*) from gene_prediction_view where gene_ref = 523
> - takes 26 seconds to execute, and returns 2400 (out of a total of
> 15 million records in the table)
> ---My problem---
> Using a single-column index to count 2400 records which are exactly
> one constant value doesn't sound like something that would take 26
> seconds. What's the slowdown? Any silver bullets that might fix this?


* Please post an EXPLAIN ANALYZE of your query which will allow to choose
between these two options :
- If Postgres uses a bad plan (like a seq scan), you need to up the
statistics for this column
- If you get the correct plan (index scan or bitmap index scan) then it
is likely that postgres does one disk seek per row that has to be counted.
26 seconds for 2400 rows would be consistent with a 10ms seek time. The
unmistakable sign is that re-running the query will result in a very fast
runtime (I'd say a couple ms for counting 2400 rows if no disk IO is
involved).


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 10:46 AM
Jeremy Harris
 
Posts: n/a
Default Re: large tables and simple "= constant" queries usingindexes

Bill Moran wrote:
> This is a FAQ, it comes up on an almost weekly basis.


I don't think so. "where".

>>> - select count(*) from gene_prediction_view where gene_ref = 523


Cheers,
Jeremy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 10:46 AM
Arjen van der Meijden
 
Posts: n/a
Default Re: large tables and simple "= constant" queries usingindexes

Hi John,

You don't use the same 'gene_ref ='-value, so its not a perfect
comparison. And obviously, there is the fact that the data can be in the
disk cache, the second time you run it, which would explain the almost
instantaneous result for the second query.

If repeating the query a few times with 200 still makes it do its work
in 15 seconds and with 800 in less than 100ms, than you might have found
a bug, or it is at least something I don't know how to fix.

I doubt upping the default for all tables to 1000 is a good idea. The
data collected is used in the query-planning-stage, where more data
means more processing time. Obviously there is a tradeoff somewhere
between having more statistics and thus being able to plan the query
better versus requiring more time to process those statistics.

Best regards,

Arjen

On 10-4-2008 0:24 John Beaver wrote:
> Perfect - thanks Arjen. Using your value of 200 decreased the time to 15
> seconds, and using a value of 800 makes it almost instantaneous. I'm
> really not concerned about space usage; if having more statistics
> increases performance this much, maybe I'll just default it to 1000?
>
> Strangely, the steps taken in the explain analyze are all the same. The
> only differences are the predicted costs (and execution times).
>
> explain analyze for a statistics of 200:
> "Aggregate (cost=8831.27..8831.28 rows=1 width=0) (actual
> time=15198.407..15198.408 rows=1 loops=1)"
> " -> Bitmap Heap Scan on gene_prediction_view (cost=44.16..8825.29
> rows=2392 width=0) (actual time=19.719..15191.875 rows=2455 loops=1)"
> " Recheck Cond: (gene_ref = 500)"
> " -> Bitmap Index Scan on ix_gene_prediction_view_gene_ref
> (cost=0.00..43.56 rows=2392 width=0) (actual time=18.871..18.871
> rows=2455 loops=1)"
> " Index Cond: (gene_ref = 500)"
> "Total runtime: 15198.651 ms"
>
> explain analyze for a statistics of 800:
> "Aggregate (cost=8873.75..8873.76 rows=1 width=0) (actual
> time=94.473..94.473 rows=1 loops=1)"
> " -> Bitmap Heap Scan on gene_prediction_view (cost=44.25..8867.74
> rows=2404 width=0) (actual time=39.358..93.733 rows=2455 loops=1)"
> " Recheck Cond: (gene_ref = 301)"
> " -> Bitmap Index Scan on ix_gene_prediction_view_gene_ref
> (cost=0.00..43.65 rows=2404 width=0) (actual time=38.472..38.472
> rows=2455 loops=1)"
> " Index Cond: (gene_ref = 301)"
> "Total runtime: 94.622 ms"
>
>
>
>
> Arjen van der Meijden wrote:
>> First of all, there is the 'explain analyze' output, which is pretty
>> helpful in postgresql.
>>
>> My guess is, postgresql decides to do a table scan for some reason. It
>> might not have enough statistics for this particular table or column,
>> to make a sound decision. What you can try is to increase the
>> statistics target, which works pretty easy:
>> ALTER TABLE gene_prediction_view ALTER gene_ref SET STATISTICS 200;
>>
>> Valid ranges are from 1(0?) - 1000, the default is 10, the default on
>> my systems is usually 100. For such a large table, I'd go with 200.
>>
>> After that, you'll need to re-analyze your table and you can try again.
>>
>> Perhaps analyze should try to establish its own best guess to how many
>> samples it should take? The default of 10 is rather limited for large
>> tables.
>>
>> Best regards,
>>
>> Arjen
>>
>> On 9-4-2008 22:58 John Beaver wrote:
>>> Hi, I've started my first project with Postgres (after several years
>>> of using Mysql), and I'm having an odd performance problem that I was
>>> hoping someone might be able to explain the cause of.
>>>
>>> ----My query----
>>> - select count(*) from gene_prediction_view where gene_ref = 523
>>> - takes 26 seconds to execute, and returns 2400 (out of a total of
>>> 15 million records in the table)
>>>
>>> ---My problem---
>>> Using a single-column index to count 2400 records which are
>>> exactly one constant value doesn't sound like something that would
>>> take 26 seconds. What's the slowdown? Any silver bullets that might
>>> fix this?
>>>
>>> ----Steps I've taken----
>>> - I ran vacuum and analyze
>>> - I upped the shared_buffers to 58384, and I upped some of the
>>> other postgresql.conf values as well. Nothing seemed to help
>>> significantly, but maybe I missed something that would help
>>> specifically for this query type?
>>> - I tried to create a hash index, but gave up after more than 4
>>> hours of waiting for it to finish indexing
>>>
>>> ----Table stats----
>>> - 15 million rows; I'm expecting to have four or five times this
>>> number eventually.
>>> - 1.5 gigs of hard drive usage
>>>
>>> ----My development environment---
>>> - 2.6ghz dual-core MacBook Pro with 4 gigs of ram and a 7200 rpm
>>> hard drive
>>> - OS X 10.5.2
>>> - Postgres 8.3 (installed via MacPorts)
>>>
>>> ----My table----
>>>
>>> CREATE TABLE gene_prediction_view
>>> (
>>> id serial NOT NULL,
>>> gene_ref integer NOT NULL,
>>> go_id integer NOT NULL,
>>> go_description character varying(200) NOT NULL,
>>> go_category character varying(50) NOT NULL,
>>> function_verified_exactly boolean NOT NULL,
>>> function_verified_with_parent_go boolean NOT NULL,
>>> function_verified_with_child_go boolean NOT NULL,
>>> score numeric(10,2) NOT NULL,
>>> precision_score numeric(10,2) NOT NULL,
>>> CONSTRAINT gene_prediction_view_pkey PRIMARY KEY (id),
>>> CONSTRAINT gene_prediction_view_gene_ref_fkey FOREIGN KEY (gene_ref)
>>> REFERENCES sgd_annotations (id) MATCH SIMPLE
>>> ON UPDATE NO ACTION ON DELETE NO ACTION,
>>> CONSTRAINT gene_prediction_view_go_id_fkey FOREIGN KEY (go_id)
>>> REFERENCES go_terms (term) MATCH SIMPLE
>>> ON UPDATE NO ACTION ON DELETE NO ACTION,
>>> CONSTRAINT gene_prediction_view_gene_ref_key UNIQUE (gene_ref, go_id)
>>> )
>>> WITH (OIDS=FALSE);
>>> ALTER TABLE gene_prediction_view OWNER TO postgres;
>>>
>>> CREATE INDEX ix_gene_prediction_view_gene_ref
>>> ON gene_prediction_view
>>> USING btree
>>> (gene_ref);
>>>
>>>
>>>
>>>

>>

>


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 10:46 AM
PFC
 
Posts: n/a
Default Re: large tables and simple "= constant" queries using indexes


>> Perfect - thanks Arjen. Using your value of 200 decreased the time to
>> 15 seconds, and using a value of 800 makes it almost instantaneous. I'm
>> really not concerned about space usage; if having more statistics
>> increases performance this much, maybe I'll just default it to 1000?
>> Strangely, the steps taken in the explain analyze are all the same.
>> The only differences are the predicted costs (and execution times).
>> explain analyze for a statistics of 200:



Actually, since you got the exact same plans and the second one is a lot
faster, this can mean that the data is in the disk cache, or that the
second query has all the rows it needs contiguous on disk whereas the
first one has its rows all over the place. Therefore you are IO-bound.
Statistics helped, perhaps (impossible to know since you don't provide the
plan wit statistics set to 10), but your main problem is IO.
Usually setting the statistics to 100 is enough...

Now, here are some solutions to your problem in random order :

- Install 64 bit Linux, 64 bit Postgres, and get lots of RAM, lol.
- Switch to a RAID10 (4 times the IOs per second, however zero gain if
you're single-threaded, but massive gain when concurrent)

- If you just need a count by gene_ref, a simple solution is to keep it
in a separate table and update it via triggers, this is a frequently used
solution, it works well unless gene_ref is updated all the time (which is
probably not your case). Since you will be vacuuming this count-cache
table often, don't put the count as a field in your sgd_annotations table,
just create a small table with 2 fields, gene_ref and count (unless you
want to use the count for other things and you don't like the join).

From your table definition gene_ref references another table. It would
seem that you have many rows in gene_prediction_view with the same
gene_ref value.

- If you often query rows with the same gene_ref, consider using CLUSTER
to physically group those rows on disk. This way you can get all rows with
the same gene_ref in 1 seek instead of 2000. Clustered tables also make
Bitmap scan happy.
This one is good since it can also speed up other queries (not just the
count).
You could also cluster on (gene_ref,go_id) perhaps, I don't know what
your columns mean. Only you can decide that because clustering order has
to be meaningful (to group rows according to something that makes sense
and not at random).

* Lose some weight :

CREATE INDEX ix_gene_prediction_view_gene_ref
ON gene_prediction_view
USING btree
(gene_ref);

- This index is useless since you have an UNIQUE on (gene_ref, go_id)
which is also an index.
Remove the index on (gene_ref), it will leave space in the disk cache for
other things.

- Since (gene_ref, go_id) is UNIQUE NOT NULL, you might be able to use
that as your primary key, but only if it is never updated of course. Saves
another index.

- If you often do queries that fetch many rows, but seldom fetch the
description, tell PG to always store the description in offline compressed
form (read the docs on ALTER TABLE ... SET STORAGE ..., I forgot the
syntax). Point being to make the main table smaller.

- Also I see a category as VARCHAR. If you have a million different
categories, that's OK, but if you have 100 categories for your 15M rows,
put them in a separate table and replace that by a category_id (normalize
!)


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 10:46 AM
Matthew
 
Posts: n/a
Default Re: large tables and simple "= constant" queries usingindexes

On Thu, 10 Apr 2008, PFC wrote:

.... Lots of useful advice ...

> - If you often query rows with the same gene_ref, consider using
> CLUSTER to physically group those rows on disk. This way you can get all rows
> with the same gene_ref in 1 seek instead of 2000. Clustered tables also make
> Bitmap scan happy.


In my opinion this is the one that will make the most difference. You will
need to run:

CLUSTER gene_prediction_view USING gene_prediction_view_gene_ref_key;

after you insert significant amounts of data into the table. This
re-orders the table according to the index, but new data is always written
out of order, so after adding lots more data the table will need to be
re-clustered again.

> - Switch to a RAID10 (4 times the IOs per second, however zero gain if
> you're single-threaded, but massive gain when concurrent)


Greg Stark has a patch in the pipeline that will change this, for bitmap
index scans, by using fadvise(), so a single thread can utilise multiple
discs in a RAID array.

Matthew

--
Prolog doesn't have enough parentheses. -- Computer Science Lecturer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 10:46 AM
John Beaver
 
Posts: n/a
Default Re: large tables and simple "= constant" queries usingindexes

Thanks a lot, all of you - this is excellent advice. With the data
clustered and statistics at a more reasonable value of 100, it now
reproducibly takes even less time - 20-57 ms per query.

After reading the section on "Statistics Used By the Planner" in the
manual, I was a little concerned that, while the statistics sped up the
queries that I tried immeasurably, that the most_common_vals array was
where the speedup was happening, and that the values which wouldn't fit
in this array wouldn't be sped up. Though I couldn't offhand find an
example where this occurred, the clustering approach seems intuitively
like a much more complete and scalable solution, at least for a
read-only table like this.

As to whether the entire index/table was getting into ram between my
statistics calls, I don't think this was the case. Here's the behavior
that I found:
- With statistics at 10, the query took 25 (or so) seconds no matter how
many times I tried different values. The query plan was the same as for
the 200 and 800 statistics below.
- Trying the same constant a second time gave an instantaneous result,
I'm guessing because of query/result caching.
- Immediately on increasing the statistics to 200, the query took a
reproducibly less amount of time. I tried about 10 different values
- Immediately on increasing the statistics to 800, the query
reproducibly took less than a second every time. I tried about 30
different values.
- Decreasing the statistics to 100 and running the cluster command
brought it to 57 ms per query.
- The Activity Monitor (OSX) lists the relevant postgres process as
taking a little less than 500 megs.
- I didn't try decreasing the statistics back to 10 before I ran the
cluster command, so I can't show the search times going up because of
that. But I tried killing the 500 meg process. The new process uses less
than 5 megs of ram, and still reproducibly returns a result in less than
60 ms. Again, this is with a statistics value of 100 and the data
clustered by gene_prediction_view_gene_ref_key.

And I'll consider the idea of using triggers with an ancillary table for
other purposes; seems like it could be a useful solution for something.

Matthew wrote:
> On Thu, 10 Apr 2008, PFC wrote:
>
> ... Lots of useful advice ...
>
>> - If you often query rows with the same gene_ref, consider using
>> CLUSTER to physically group those rows on disk. This way you can get
>> all rows with the same gene_ref in 1 seek instead of 2000. Clustered
>> tables also make Bitmap scan happy.

>
> In my opinion this is the one that will make the most difference. You
> will need to run:
>
> CLUSTER gene_prediction_view USING gene_prediction_view_gene_ref_key;
>
> after you insert significant amounts of data into the table. This
> re-orders the table according to the index, but new data is always
> written out of order, so after adding lots more data the table will
> need to be re-clustered again.
>
>> - Switch to a RAID10 (4 times the IOs per second, however zero gain
>> if you're single-threaded, but massive gain when concurrent)

>
> Greg Stark has a patch in the pipeline that will change this, for
> bitmap index scans, by using fadvise(), so a single thread can utilise
> multiple discs in a RAID array.
>
> Matthew
>


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 10:46 AM
Greg Smith
 
Posts: n/a
Default Re: large tables and simple "= constant" queries usingindexes

On Thu, 10 Apr 2008, Bill Moran wrote:

> If you install the pg_buffercache addon, you can actually look into
> PostgreSQL's internals and see what tables are in the buffer in real
> time.


The "Inside the PostgreSQL Buffer Cache" talk I did at the recent East
conference is now on-line at
http://www.westnet.com/~gsmith/content/postgresql/

The slides explain how that information gets updated and used internally,
and the separate "sample queries" file there shows some more complicated
views I've written against pg_buffercache. Here's a sample one:

relname |buffered| buffers % | % of rel
accounts | 306 MB | 65.3 | 24.7
accounts_pkey | 160 MB | 34.1 | 93.2

This shows that 65.3% of the buffer cache is filled with the accounts
table, which is caching 24.7% of the full table. These are labeled
"relations" because there's a mix of table and index data there.
accounts_pkey is an index for example, which is why almost all of it is
staying inside the buffer cache.

The queries that use usage_count only work against 8.3, that one above
should work on older versions as well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT. The time now is 11:47 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413