Unix Technical Forum

Two efficiency questions - clustering and ints

This is a discussion on Two efficiency questions - clustering and ints within the Pgsql General forums, part of the PostgreSQL category; --> I have a good-size DB (some tables approaching 100M rows), with essentially static data. Should I always cluster the ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql General

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-09-2008, 11:42 AM
John D. Burger
 
Posts: n/a
Default Two efficiency questions - clustering and ints

I have a good-size DB (some tables approaching 100M rows), with
essentially static data.

Should I always cluster the tables? That is, even if no column jumps
out as being involved in most queries, should I pick a likely one and
cluster on it? (Of course, this assumes that doing so won't cause
bad correlation with any other oft-used column.)

Another question, about integer types - if no cross-type coercion is
involved, is there any reason not to choose the smallest int type
that will fit my data? In particular, I have a column of small-
integer ratings with, say, values in [1, 10]. If I'm only comparing
within such ratings, and possibly computing floating point averages,
etc., what are the good and bad points of using, say, SMALLINT? What
about NUMERIC(1) or (2)?

Thanks in advance for the usual brilliant replies!

- John D. Burger
MITRE


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-09-2008, 11:42 AM
Richard Huxton
 
Posts: n/a
Default Re: Two efficiency questions - clustering and ints

John D. Burger wrote:
> I have a good-size DB (some tables approaching 100M rows), with
> essentially static data.
>
> Should I always cluster the tables? That is, even if no column jumps
> out as being involved in most queries, should I pick a likely one and
> cluster on it? (Of course, this assumes that doing so won't cause bad
> correlation with any other oft-used column.)


Well you cluster on an index, and if you don't think the index is
useful, I'd drop it. If you have an index, clustering isn't necessarily
going to help you unless you regularly read a series of rows in order.

> Another question, about integer types - if no cross-type coercion is
> involved, is there any reason not to choose the smallest int type that
> will fit my data? In particular, I have a column of small-integer
> ratings with, say, values in [1, 10]. If I'm only comparing within such
> ratings, and possibly computing floating point averages, etc., what are
> the good and bad points of using, say, SMALLINT? What about NUMERIC(1)
> or (2)?


(int2, int2) should pack into 4 bytes on-disk, but due to alignment
issues I think (int2, int4) still takes up 8 bytes. There has been
discussion about being able to have different physical column ordering
on-disk vs. in SQL but no decision as to whether the effort will be
worthwhile.

Numeric types tend to be slower than their int equivalent, and though
I've not checked their storage requirements, I'd assume they take more
space too.

HTH
--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-09-2008, 11:43 AM
John D. Burger
 
Posts: n/a
Default Re: Two efficiency questions - clustering and ints

Richard Huxton wrote:

>> Should I always cluster the tables? That is, even if no column
>> jumps out as being involved in most queries, should I pick a
>> likely one and cluster on it?

>
> Well you cluster on an index, and if you don't think the index is
> useful, I'd drop it. If you have an index, clustering isn't
> necessarily going to help you unless you regularly read a series of
> rows in order.


Right, but clustering on that index means an index scan (with that
index) will be faster. This is uninteresting if the index doesn't
get used, but is there any =downside= to doing it?

Here's a simpler question - for static data, should I always cluster
on the index I think will be used the most?

Thanks.

- John D. Burger
MITRE


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-09-2008, 11:43 AM
Casey Duncan
 
Posts: n/a
Default Re: Two efficiency questions - clustering and ints


On Oct 6, 2006, at 11:12 AM, John D. Burger wrote:

> Richard Huxton wrote:
>
>>> Should I always cluster the tables? That is, even if no column
>>> jumps out as being involved in most queries, should I pick a
>>> likely one and cluster on it?

>>
>> Well you cluster on an index, and if you don't think the index is
>> useful, I'd drop it. If you have an index, clustering isn't
>> necessarily going to help you unless you regularly read a series
>> of rows in order.

>
> Right, but clustering on that index means an index scan (with that
> index) will be faster. This is uninteresting if the index doesn't
> get used, but is there any =downside= to doing it?
>
> Here's a simpler question - for static data, should I always
> cluster on the index I think will be used the most?


Depends on the queries. If the index is on a foreign key value where
there may be many rows with the same key scattered about it will help
queries that lookup using that foreign key. Clustering on a column
with high cardinality isn't much of a win typically for single key
lookups (depends on the lookup pattern), but would be for ranges and
possibly for sorting on that column.

It also depends on the size of the table and indices. If they are
small enough to fit in memory then clustering to reduce random access
isn't really helpful.

I would suggest doing some timing tests on typical queries with the
data unclustered and clustered to know what you are gaining.

-Casey

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:38 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com