Unix Technical Forum

rtree/gist index taking enormous amount of space in 8.2.3

This is a discussion on rtree/gist index taking enormous amount of space in 8.2.3 within the Pgsql Performance forums, part of the PostgreSQL category; --> Hi, In version 8.1.5, I have an rtree index on a 1.5 GB table. The size of this index ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 10:08 AM
Dolafi, Tom
 
Posts: n/a
Default rtree/gist index taking enormous amount of space in 8.2.3

Hi,



In version 8.1.5, I have an rtree index on a 1.5 GB table. The size of
this index is 500 MB. After migrating to 8.2.3, the size of this index
has increased to 35GB. I've dropped are recreated the index and got the
same result. In 8.2.3 the index type is gist, does this have something
to do with it? At 35GB we have seen a decrease in performance. Any
help or hints are appreciated.



CREATE INDEX binloc_boxrange

ON featureloc

USING rtree

(boxrange(fmin, fmax));





CREATE TABLE featureloc

(

featureloc_id serial NOT NULL,

feature_id integer NOT NULL,

srcfeature_id integer,

fmin integer,

is_fmin_partial boolean NOT NULL DEFAULT false,

fmax integer,

is_fmax_partial boolean NOT NULL DEFAULT false,

strand smallint,

phase integer,

residue_info text,

locgroup integer NOT NULL DEFAULT 0,

rank integer NOT NULL DEFAULT 0,

.....





CREATE OR REPLACE FUNCTION boxrange(integer, integer)

RETURNS box AS

'SELECT box (create_point(0, $1), create_point($2,500000000))'

LANGUAGE 'sql' IMMUTABLE;

ALTER FUNCTION boxrange(integer, integer) OWNER TO cjm;





CREATE OR REPLACE FUNCTION create_point(integer, integer)

RETURNS point AS

'SELECT point ($1, $2)'

LANGUAGE 'sql' VOLATILE;

ALTER FUNCTION create_point(integer, integer) OWNER TO cjm;







Thanks,

Tom


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 10:08 AM
Tom Lane
 
Posts: n/a
Default Re: rtree/gist index taking enormous amount of space in 8.2.3

"Dolafi, Tom" <dolafit@janelia.hhmi.org> writes:
> In version 8.1.5, I have an rtree index on a 1.5 GB table. The size of
> this index is 500 MB. After migrating to 8.2.3, the size of this index
> has increased to 35GB. I've dropped are recreated the index and got the
> same result. In 8.2.3 the index type is gist, does this have something
> to do with it?


We dropped rtree in 8.2 on the strength of experiments that seemed to
show gist was always better, but you seem to have an outlier case...

Can you tell us something about the distribution of the data values
(fmin and fmax)? Is there anything particularly magic about the two
constants you're using in the boxes? I suppose you've hit some bad
corner case in the gist box opclass, but it's not clear what.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 10:08 AM
Dolafi, Tom
 
Posts: n/a
Default Re: rtree/gist index taking enormous amount of space in 8.2.3


min(fmin) | max(fmin) | avg(fmin)
1 | 55296469 | 11423945

min(fmax) | max(fmax) | avg(fmax)
18 | 55553288 | 11424491

There are 5,704,211 rows in the table.

This application has been inherited by us. As far as I can tell the
magic of the two constants seem to represent the notion of infinity.

Thanks,
Tom

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, June 28, 2007 12:09 PM
To: Dolafi, Tom
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] rtree/gist index taking enormous amount of space
in 8.2.3

"Dolafi, Tom" <dolafit@janelia.hhmi.org> writes:
> In version 8.1.5, I have an rtree index on a 1.5 GB table. The size

of
> this index is 500 MB. After migrating to 8.2.3, the size of this

index
> has increased to 35GB. I've dropped are recreated the index and got

the
> same result. In 8.2.3 the index type is gist, does this have

something
> to do with it?


We dropped rtree in 8.2 on the strength of experiments that seemed to
show gist was always better, but you seem to have an outlier case...

Can you tell us something about the distribution of the data values
(fmin and fmax)? Is there anything particularly magic about the two
constants you're using in the boxes? I suppose you've hit some bad
corner case in the gist box opclass, but it's not clear what.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 10:08 AM
Craig James
 
Posts: n/a
Default Re: rtree/gist index taking enormous amount of space in8.2.3

Dolafi, Tom wrote:
> min(fmin) | max(fmin) | avg(fmin)
> 1 | 55296469 | 11423945
>
> min(fmax) | max(fmax) | avg(fmax)
> 18 | 55553288 | 11424491
>
> There are 5,704,211 rows in the table.


When you're looking for weird index problems, it's more interesting to know if there are certain numbers that occur a LOT. From your statistics above, each number occurs about 10 times in the table. But do some particular numbers occur thousands, or even millions, of times?

Here is a query that will print a list of the highest-occuring values. You might expect a few occurances of 20, and maybe 30, but if you have thousands or millions of occurances of certain numbers, then that can screw up an index.

select fmax, c from
(select fmax, count(fmax) as c from your_table group by fmax) as foo
where c > 3 order by c desc;

Craig


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 10:08 AM
Tom Lane
 
Posts: n/a
Default Re: rtree/gist index taking enormous amount of space in 8.2.3

"Dolafi, Tom" <dolafit@janelia.hhmi.org> writes:
> min(fmin) | max(fmin) | avg(fmin)
> 1 | 55296469 | 11423945
> min(fmax) | max(fmax) | avg(fmax)
> 18 | 55553288 | 11424491


OK, I was able to reproduce a problem after making the further guess
that fmax is usually a little bit greater than fmin. The attached test
script generates an rtree index of around 800 pages on 8.1.9, and the
index build time is about 6 seconds on my machine. On CVS HEAD, the
script generates a gist index of over 30000 pages and the build time is
over 60 seconds. Since I'm using random() the numbers move around a
bit, but they're consistently awful. I experimented with a few other
distributions, such as fmin and fmax chosen independently in the same
range, and saw gist build time usually better than rtree and index size
only somewhat larger, so this particular distribution apparently fools
gist_box_picksplit rather badly. The problem seems nonlinear too ---
I had originally tried it with 1 million test rows instead of 100000,
and gave up waiting for the index build after more than an hour.

Oleg, Teodor, can this be improved?

regards, tom lane

drop table featureloc;

CREATE TABLE featureloc
(
fmin integer,
fmax integer
);

insert into featureloc
select r1, r1 + 1 + random() * 1000 from
(select 1 + random() * 55000000 as r1, 1 + random() * 55000000 as r2
from generate_series(1,100000) offset 0) as ss;

CREATE OR REPLACE FUNCTION boxrange(integer, integer)
RETURNS box AS
'SELECT box (point(0, $1), point($2, 500000000))'
LANGUAGE 'sql' STRICT IMMUTABLE;

CREATE INDEX binloc_boxrange
ON featureloc
USING rtree
(boxrange(fmin, fmax));

vacuum verbose featureloc;

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 10:08 AM
Dolafi, Tom
 
Posts: n/a
Default Re: rtree/gist index taking enormous amount of space in 8.2.3

The data is not distributed well...

Top 20 occurrences of fmin and fmax:
fmin | count
----------+--------
0 | 214476
19281576 | 2870
2490005 | 2290
1266332 | 2261
15539680 | 2086
11022233 | 2022
25559658 | 1923
3054411 | 1906
10237885 | 1890
13827272 | 1876
19187021 | 1847
18101335 | 1845
1518230 | 1843
21199488 | 1842
1922518 | 1826
1216144 | 1798
25802126 | 1762
8307335 | 1745
21271866 | 1736
8361667 | 1721


fmax | count
----------+--------
25 | 197551
21272002 | 547
21271988 | 335
21271969 | 321
6045781 | 247
1339301 | 243
21669151 | 235
7779506 | 232
2571422 | 229
7715946 | 228
27421323 | 222
7048089 | 221
87364 | 219
13656535 | 217
26034147 | 214
19184612 | 213
7048451 | 213
21668877 | 213
6587492 | 212
9484598 | 212

Also, out of 5.7 million rows there are 1.6 million unique fmin and 1.6
million unique fmax values.

Thanks,
Tom

-----Original Message-----
From: Craig James [mailto:craig_james@emolecules.com]
Sent: Friday, June 29, 2007 12:14 PM
To: Dolafi, Tom
Cc: Tom Lane; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] rtree/gist index taking enormous amount of space
in 8.2.3

Dolafi, Tom wrote:
> min(fmin) | max(fmin) | avg(fmin)
> 1 | 55296469 | 11423945
>
> min(fmax) | max(fmax) | avg(fmax)
> 18 | 55553288 | 11424491
>
> There are 5,704,211 rows in the table.


When you're looking for weird index problems, it's more interesting to
know if there are certain numbers that occur a LOT. From your
statistics above, each number occurs about 10 times in the table. But
do some particular numbers occur thousands, or even millions, of times?

Here is a query that will print a list of the highest-occuring values.
You might expect a few occurances of 20, and maybe 30, but if you have
thousands or millions of occurances of certain numbers, then that can
screw up an index.

select fmax, c from
(select fmax, count(fmax) as c from your_table group by fmax) as foo
where c > 3 order by c desc;

Craig

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 10:08 AM
Dolafi, Tom
 
Posts: n/a
Default Re: rtree/gist index taking enormous amount of space in 8.2.3

Thanks for looking into this and reproducing a similar result. The
index took 6 hours to complete on a 1.5GB table resulting in 35GB of
storage, and it took 36 hours to vacuum... I'm patient :-)

In the mean time I've dropped the index which has resulted in overall
performance gain on queries against the table, but we have not tested
the part of the application which would utilize this index.

- Tom

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, June 29, 2007 1:58 PM
To: Dolafi, Tom
Cc: pgsql-performance@postgresql.org; Oleg Bartunov; Teodor Sigaev
Subject: Re: [PERFORM] rtree/gist index taking enormous amount of space
in 8.2.3

"Dolafi, Tom" <dolafit@janelia.hhmi.org> writes:
> min(fmin) | max(fmin) | avg(fmin)
> 1 | 55296469 | 11423945
> min(fmax) | max(fmax) | avg(fmax)
> 18 | 55553288 | 11424491


OK, I was able to reproduce a problem after making the further guess
that fmax is usually a little bit greater than fmin. The attached test
script generates an rtree index of around 800 pages on 8.1.9, and the
index build time is about 6 seconds on my machine. On CVS HEAD, the
script generates a gist index of over 30000 pages and the build time is
over 60 seconds. Since I'm using random() the numbers move around a
bit, but they're consistently awful. I experimented with a few other
distributions, such as fmin and fmax chosen independently in the same
range, and saw gist build time usually better than rtree and index size
only somewhat larger, so this particular distribution apparently fools
gist_box_picksplit rather badly. The problem seems nonlinear too ---
I had originally tried it with 1 million test rows instead of 100000,
and gave up waiting for the index build after more than an hour.

Oleg, Teodor, can this be improved?

regards, tom lane

drop table featureloc;

CREATE TABLE featureloc
(
fmin integer,
fmax integer
);

insert into featureloc
select r1, r1 + 1 + random() * 1000 from
(select 1 + random() * 55000000 as r1, 1 + random() * 55000000 as r2
from generate_series(1,100000) offset 0) as ss;

CREATE OR REPLACE FUNCTION boxrange(integer, integer)
RETURNS box AS
'SELECT box (point(0, $1), point($2, 500000000))'
LANGUAGE 'sql' STRICT IMMUTABLE;

CREATE INDEX binloc_boxrange
ON featureloc
USING rtree
(boxrange(fmin, fmax));

vacuum verbose featureloc;

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 10:09 AM
Tom Lane
 
Posts: n/a
Default Re: rtree/gist index taking enormous amount of space in 8.2.3

"Dolafi, Tom" <dolafit@janelia.hhmi.org> writes:
> The data is not distributed well...


Can you show us min, max, and avg of fmax minus fmin? I'd like to
check my guess about that being a fairly narrow range.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 10:09 AM
Tom Lane
 
Posts: n/a
Default Re: rtree/gist index taking enormous amount of space in 8.2.3

"Dolafi, Tom" <dolafit@janelia.hhmi.org> writes:
> In the mean time I've dropped the index which has resulted in overall
> performance gain on queries against the table, but we have not tested
> the part of the application which would utilize this index.


I noted that with the same (guessed-at) distribution of fmin/fmax, the
index size remains reasonable if you change the derived boxes to

CREATE OR REPLACE FUNCTION boxrange(integer, integer)
RETURNS box AS
'SELECT box (point($1, $1), point($2, $2))'
LANGUAGE 'sql' STRICT IMMUTABLE;

which makes sense from the point of view of geometric intuition: instead
of a bunch of very tall, mostly very narrow, mostly overlapping boxes,
you have a bunch of small square boxes spread out along a line. So it
stands to reason that a geometrically-motivated index structure would
work a lot better on the latter. I don't know though whether your
queries can be adapted to work with this. What was the index being used
for, exactly?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 10:09 AM
Dolafi, Tom
 
Posts: n/a
Default Re: rtree/gist index taking enormous amount of space in 8.2.3

(fmax-fmin)...
min | max | avg
---------+---------+----------------------
1 | 2278225 | 546

I noticed 3000 occurrences where fmax is less than fmin. I excluded
these values to get the min difference between the two. Also, there are
20 "invalid"/"bogus" rows with negative values which were excluded from
the queries.

Thanks,
Tom

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, June 29, 2007 2:30 PM
To: Dolafi, Tom
Cc: Craig James; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] rtree/gist index taking enormous amount of space
in 8.2.3

"Dolafi, Tom" <dolafit@janelia.hhmi.org> writes:
> The data is not distributed well...


Can you show us min, max, and avg of fmax minus fmin? I'd like to
check my guess about that being a fairly narrow range.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:20 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com