vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On Apr 10, 2008, at 9:44 AM, John Beaver wrote: > Thanks a lot, all of you - this is excellent advice. With the data > clustered and statistics at a more reasonable value of 100, it now > reproducibly takes even less time - 20-57 ms per query. > > After reading the section on "Statistics Used By the Planner" in the > manual, I was a little concerned that, while the statistics sped up > the queries that I tried immeasurably, that the most_common_vals > array was where the speedup was happening, and that the values which > wouldn't fit in this array wouldn't be sped up. Though I couldn't > offhand find an example where this occurred, the clustering approach > seems intuitively like a much more complete and scalable solution, > at least for a read-only table like this. > > As to whether the entire index/table was getting into ram between my > statistics calls, I don't think this was the case. Here's the > behavior that I found: > - With statistics at 10, the query took 25 (or so) seconds no matter > how many times I tried different values. The query plan was the same > as for the 200 and 800 statistics below. > - Trying the same constant a second time gave an instantaneous > result, I'm guessing because of query/result caching. > - Immediately on increasing the statistics to 200, the query took a > reproducibly less amount of time. I tried about 10 different values > - Immediately on increasing the statistics to 800, the query > reproducibly took less than a second every time. I tried about 30 > different values. > - Decreasing the statistics to 100 and running the cluster command > brought it to 57 ms per query. > - The Activity Monitor (OSX) lists the relevant postgres process as > taking a little less than 500 megs. > - I didn't try decreasing the statistics back to 10 before I ran the > cluster command, so I can't show the search times going up because > of that. But I tried killing the 500 meg process. The new process > uses less than 5 megs of ram, and still reproducibly returns a > result in less than 60 ms. Again, this is with a statistics value of > 100 and the data clustered by gene_prediction_view_gene_ref_key. > > And I'll consider the idea of using triggers with an ancillary table > for other purposes; seems like it could be a useful solution for > something. FWIW, killing the backend process responsible for the query won't necessarily clear the table's data from memory as that will be in the shared_buffers. If you really want to flush the data from memory you need to read in data from other tables of a size total size greater than your shared_buffers setting. Erik Jones DBA | EmmaŽ erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| John Beaver wrote: > - Trying the same constant a second time gave an instantaneous result, > I'm guessing because of query/result caching. AFAIK no query/result caching is in place in postgres, what you are experiencing is OS disk/memory caching. Regards Gaetano Mendola |
| |||
| Thanks Eric and Gaestano - interesting, and both examples of my naivite. I tried running large select(*) queries on other tables followed by another try at the offending query, and it was still fast. Just to be absolutely sure this is a scalable solution, I'll try restarting my computer in a few hours to see if it affects anything cache-wise. Gaetano Mendola wrote: John Beaver wrote: - Trying the same constant a second time gave an instantaneous result, I'm guessing because of query/result caching. AFAIK no query/result caching is in place in postgres, what you are experiencing is OS disk/memory caching. Regards Gaetano Mendola |
| |||
| In response to John Beaver <john.e.beaver@gmail.com>: > Thanks Eric and Gaestano - interesting, and both examples of my naivite. > > I tried running large select(*) queries on other tables followed by another try at the offending query, and it was still fast. Just to be absolutely sure this is a scalable solution, I'll try restarting my computer in a few hours to see if it affects anything cache-wise. I say this over and over again ... because I think it's really cool and useful. If you install the pg_buffercache addon, you can actually look into PostgreSQL's internals and see what tables are in the buffer in real time. If you're having trouble, it can (potentially) be a helpful tool. -- Bill Moran Collaborative Fusion Inc. http://people.collaborativefusion.com/~wmoran/ wmoran@collaborativefusion.com Phone: 412-422-3463x4023 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| ||||
| > Thanks a lot, all of you - this is excellent advice. With the data > clustered and statistics at a more reasonable value of 100, it now > reproducibly takes even less time - 20-57 ms per query. 1000x speedup with proper tuning - always impressive, lol. IO seeks are always your worst enemy. > After reading the section on "Statistics Used By the Planner" in the > manual, I was a little concerned that, while the statistics sped up the > queries that I tried immeasurably, that the most_common_vals array was > where the speedup was happening, and that the values which wouldn't fit > in this array wouldn't be sped up. Though I couldn't offhand find an > example where this occurred, the clustering approach seems intuitively > like a much more complete and scalable solution, at least for a > read-only table like this. Actually, with statistics set to 100, then 100 values will be stored in most_common_vals. This would mean that the values not in most_common_vals will have less than 1% frequency, and probably much less than that. The choice of plan for these rare values is pretty simple. With two columns, "interesting" stuff can happen, like if you have col1 in [1...10] and col2 in [1...10] and use a condition on col1=const and col2=const, the selectivity of the result depends not only on the distribution of col1 and col2 but also their correlation. As for the tests you did, it's hard to say without seeing the explain analyze outputs. If you change the stats and the plan choice (EXPLAIN) stays the same, and you use the same values in your query, any difference in timing comes from caching, since postgres is executing the same plan and therefore doing the exact same thing. Caching (from PG and from the OS) can make the timings vary a lot. > - Trying the same constant a second time gave an instantaneous result, > I'm guessing because of query/result caching. PG does not cache queries or results. It caches data & index pages in its shared buffers, and then the OS adds another layer of the usual disk cache. A simple query like selecting one row based on PK takes about 60 microseconds of CPU time, but if it needs one seek for the index and one for the data it may take 20 ms waiting for the moving parts to move... Hence, CLUSTER is a very useful tool. Bitmap index scans love clustered tables because all the interesting rows end up being grouped together, so much less pages need to be visited. > - I didn't try decreasing the statistics back to 10 before I ran the > cluster command, so I can't show the search times going up because of > that. But I tried killing the 500 meg process. The new process uses less > than 5 megs of ram, and still reproducibly returns a result in less than > 60 ms. Again, this is with a statistics value of 100 and the data > clustered by gene_prediction_view_gene_ref_key. Killing it or just restarting postgres ? If you let postgres run (not idle) for a while, naturally it will fill the RAM up to the shared_buffers setting that you specified in the configuration file. This is good, since grabbing data from postgres' own cache is faster than having to make a syscall to the OS to get it from the OS disk cache (or disk). This isn't bloat. But what those 500 MB versus 6 MB show is that before, postgres had to read a lot of data for your query, so it stayed in the cache ; after tuning it needs to read much less data (thanks to CLUSTER) so the cache stays empty. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |