This is a discussion on Select Last n Rows Matching an Index Condition (and caches) within the Pgsql General forums, part of the PostgreSQL category; --> We have a ~10million row table but are expecting it to get larger, possibly by a factor of 10 ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| We have a ~10million row table but are expecting it to get larger, possibly by a factor of 10 or more. The rows are quite long and fixed length (just over 500 bytes.) We have an index of (symbol, source, date) on this table and doing queries like this SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time DESC LIMIT 1000; To get the latest 1000 rows for that symbol and source. However this takes quite a while at the best of times, (1-10 seconds.) The query without the order by and the limit tends to return about 70000 rows which adds up to about 30MB of data. Once the pages are in the cache they take around 100ms but this is to be expected. Unfortunately the initial query required to cache it is unnacceptably long for web application like ours. My (not yet implemented) solution to this problem is to add a SEQUENCE and index it so that by adding a WHERE id > [max_id]-1000 and ordering by time DESC will reduce the I/O quite a lot. Am I right here? It would be nice if there was a way to get PostgreSQL to try and precache the tables pages as well, is there anyway I could achieve something like that? I have toyed with creating a ramdisk to store a lookup table of sorts on (we only care about a few columns initially) to speed this up a bit but its a right pain in the arse to do by the looks of things. ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Alex Stapleton wrote: > SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time DESC > LIMIT 1000; > > To get the latest 1000 rows for that symbol and source. > My (not yet implemented) solution to this problem is to add a SEQUENCE and > index it so that by adding a WHERE id > [max_id]-1000 and ordering by time > DESC will reduce the I/O quite a lot. Am I right here? It would be nice if > there was a way to get PostgreSQL to try and precache the tables pages as > well, is there anyway I could achieve something like that? I have toyed with > creating a ramdisk to store a lookup table of sorts on (we only care about a > few columns initially) to speed this up a bit but its a right pain in the > arse to do by the looks of things. First question that always gets asked here: What's the output of explain analyse? Without that, people here can't see where the slowdown is. I expect though, that the problem is the ordering by time. I imagine that you could create an index on time, maybe truncated to months or something similar (You can create indices based on functions). That index alone should speed up the ordering already. It could also be used to cluster the table, which should speed up things some more, I suppose. -- Alban Hertroys MAG Productions T: +31(0)53 4346874 F: +31(0)53 4346876 E: alban@magproductions.nl W: http://www.magproductions.nl ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| |||
| Woops sorry we have indexes on (symbol, source, time) and there is no date column :/ SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time DESC LIMIT 1000; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------- Limit (cost=321163.29..321165.79 rows=1000 width=466) (actual time=33243.924..33246.021 rows=1000 loops=1) -> Sort (cost=321163.29..321402.31 rows=95609 width=466) (actual time=33243.917..33244.626 rows=1000 loops=1) Sort Key: "time" -> Seq Scan on article (cost=0.00..301724.00 rows=95609 width=466) (actual time=0.022..32979.685 rows=42959 loops=1) Filter: ((symbol = 12646) AND (source = 19)) Total runtime: 33258.706 ms (6 rows) explain analyze SELECT * FROM article WHERE symbol=12646 AND source = 19; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------- Seq Scan on article (cost=0.00..301724.00 rows=95609 width=466) (actual time=0.021..33275.433 rows=42959 loops=1) Filter: ((symbol = 12646) AND (source = 19)) Total runtime: 33320.920 ms (3 rows) We can't use CLUSTER because we need the DB up all the time. The analyze suggests that it's the I/O taking most of the time to me. -----Original Message----- From: Alban Hertroys [mailto:alban@magproductions.nl] Sent: 18 March 2005 09:48 To: Alex Stapleton Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Select Last n Rows Matching an Index Condition (and caches) Alex Stapleton wrote: > SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time DESC > LIMIT 1000; > > To get the latest 1000 rows for that symbol and source. > My (not yet implemented) solution to this problem is to add a SEQUENCE and > index it so that by adding a WHERE id > [max_id]-1000 and ordering by time > DESC will reduce the I/O quite a lot. Am I right here? It would be nice if > there was a way to get PostgreSQL to try and precache the tables pages as > well, is there anyway I could achieve something like that? I have toyed with > creating a ramdisk to store a lookup table of sorts on (we only care about a > few columns initially) to speed this up a bit but its a right pain in the > arse to do by the looks of things. First question that always gets asked here: What's the output of explain analyse? Without that, people here can't see where the slowdown is. I expect though, that the problem is the ordering by time. I imagine that you could create an index on time, maybe truncated to months or something similar (You can create indices based on functions). That index alone should speed up the ordering already. It could also be used to cluster the table, which should speed up things some more, I suppose. -- Alban Hertroys MAG Productions T: +31(0)53 4346874 F: +31(0)53 4346876 E: alban@magproductions.nl W: http://www.magproductions.nl ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Fri, Mar 18, 2005 at 09:29:06AM -0000, Alex Stapleton wrote: > We have a ~10million row table but are expecting it to get larger, possibly > by a factor of 10 or more. The rows are quite long and fixed length (just > over 500 bytes.) > > We have an index of (symbol, source, date) on this table and doing queries > like this > > SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time DESC > LIMIT 1000; > > To get the latest 1000 rows for that symbol and source. > > However this takes quite a while at the best of times, (1-10 seconds.) The > query without the order by and the limit tends to return about 70000 rows > which adds up to about 30MB of data. Once the pages are in the cache they > take around 100ms but this is to be expected. Unfortunately the initial > query required to cache it is unnacceptably long for web application like > ours. I think the normal approach for this is an index on (symbol,source,time). You may need to change the query to: SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY symbol desc, source DESC, time DESC LIMIT 1000; The EXPLAIN ANALYZE output would also be very helpful... Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFCOqdtY5Twig3Ge+YRAn/YAJ93Fw6lyzmRQY4s4iRLhr29R/yXrgCeMCp/ 6nxCAcUQEZ9FbTgmTUMozI8= =BFQ9 -----END PGP SIGNATURE----- |
| |||
| I should point out that theres no garuntee our data is inserted in anything like the order we want (time desc) but there is a high correlation. Most of the time it is almost in order. -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto Sent: 18 March 2005 09:29 To: pgsql-general@postgresql.org Subject: [GENERAL] Select Last n Rows Matching an Index Condition (and caches) We have a ~10million row table but are expecting it to get larger, possibly by a factor of 10 or more. The rows are quite long and fixed length (just over 500 bytes.) We have an index of (symbol, source, date) on this table and doing queries like this SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time DESC LIMIT 1000; To get the latest 1000 rows for that symbol and source. However this takes quite a while at the best of times, (1-10 seconds.) The query without the order by and the limit tends to return about 70000 rows which adds up to about 30MB of data. Once the pages are in the cache they take around 100ms but this is to be expected. Unfortunately the initial query required to cache it is unnacceptably long for web application like ours. My (not yet implemented) solution to this problem is to add a SEQUENCE and index it so that by adding a WHERE id > [max_id]-1000 and ordering by time DESC will reduce the I/O quite a lot. Am I right here? It would be nice if there was a way to get PostgreSQL to try and precache the tables pages as well, is there anyway I could achieve something like that? I have toyed with creating a ramdisk to store a lookup table of sorts on (we only care about a few columns initially) to speed this up a bit but its a right pain in the arse to do by the looks of things. ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| ||||
| You might unknowingly be asking a FAQ. See the end of this section: http://www.postgresql.org/docs/faqs.FAQ.html#4.6 Depending on the version of postgres you're running and the data types of the symbol and source columns, you might need to quote (or cast) your constant data for symbol and source, e.g.: SELECT * FROM article WHERE symbol='12646' AND source = '19' ORDER BY time DESC LIMIT 1000; -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005 On Mar 18, 2005, at 3:56 AM, Alex Stapleton wrote: > Woops sorry we have indexes on (symbol, source, time) and there is no > date > column :/ > > SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time > DESC > LIMIT 1000; > QUERY PLAN > ----------------------------------------------------------------------- > ----- > ---------------------------------------- > Limit (cost=321163.29..321165.79 rows=1000 width=466) (actual > time=33243.924..33246.021 rows=1000 loops=1) > -> Sort (cost=321163.29..321402.31 rows=95609 width=466) (actual > time=33243.917..33244.626 rows=1000 loops=1) > Sort Key: "time" > -> Seq Scan on article (cost=0.00..301724.00 rows=95609 > width=466) (actual time=0.022..32979.685 rows=42959 loops=1) > Filter: ((symbol = 12646) AND (source = 19)) > Total runtime: 33258.706 ms > (6 rows) > > > explain analyze SELECT * FROM article WHERE symbol=12646 AND source = > 19; > QUERY PLAN > ----------------------------------------------------------------------- > ----- > ---------------------------------------- > Seq Scan on article (cost=0.00..301724.00 rows=95609 width=466) > (actual > time=0.021..33275.433 rows=42959 loops=1) > Filter: ((symbol = 12646) AND (source = 19)) > Total runtime: 33320.920 ms > (3 rows) > > We can't use CLUSTER because we need the DB up all the time. > > The analyze suggests that it's the I/O taking most of the time to me. > > > -----Original Message----- > From: Alban Hertroys [mailto:alban@magproductions.nl] > Sent: 18 March 2005 09:48 > To: Alex Stapleton > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Select Last n Rows Matching an Index Condition > (and caches) > > > Alex Stapleton wrote: >> SELECT * FROM article WHERE symbol=12646 AND source = 19 ORDER BY time > DESC >> LIMIT 1000; >> >> To get the latest 1000 rows for that symbol and source. > >> My (not yet implemented) solution to this problem is to add a >> SEQUENCE and >> index it so that by adding a WHERE id > [max_id]-1000 and ordering by >> time >> DESC will reduce the I/O quite a lot. Am I right here? It would be >> nice if >> there was a way to get PostgreSQL to try and precache the tables >> pages as >> well, is there anyway I could achieve something like that? I have >> toyed > with >> creating a ramdisk to store a lookup table of sorts on (we only care >> about > a >> few columns initially) to speed this up a bit but its a right pain in >> the >> arse to do by the looks of things. > > First question that always gets asked here: What's the output of > explain > analyse? Without that, people here can't see where the slowdown is. > > I expect though, that the problem is the ordering by time. I imagine > that you could create an index on time, maybe truncated to months or > something similar (You can create indices based on functions). That > index alone should speed up the ordering already. > It could also be used to cluster the table, which should speed up > things > some more, I suppose. > > -- > Alban Hertroys > MAG Productions > > T: +31(0)53 4346874 > F: +31(0)53 4346876 > E: alban@magproductions.nl > W: http://www.magproductions.nl > > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org |