vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| In Sybase ASE (and I'm pretty sure the same is true in Microsoft SQL Server) the leaf level of the narrowest index on the table is scanned, following a linked list of leaf pages. Leaf pages can be pretty dense under Sybase, because they do use prefix compression. A count(*) on a table with 100 million rows is going to take a few minutes, but it is going to be at least an order of magnitude faster than a data page scan -- maybe two orders of magnitude faster. What I don't understand is why people need to do such things so frequently that it's a major issue, rather than an occassional annoyance. A solution which not only helped the count(*) issue but also allowed index scans to skip the trip to the data page to see if it's an active version seems like it would boost performance overall. As pointed out elsewhere, it could also allow new techniques for vacuum which could be beneficial. My view is that when tables get so big that a count(*) takes that much time, you don't typiclally need an EXACT count anyway -- you could normally check the statistics from your nightly analyze. -Kevin >>> Tom Lane <tgl@sss.pgh.pa.us> >>> Simon Riggs <simon@2ndquadrant.com> writes: > Bearing in mind other RDBMS' approach is to count the number of rows in > an index, their cost is probably about the same as scanning table > blocks/10 very roughly - so the cost is far from zero for them. Really? The impression I get is that people who ask for this expect the answer to be instantaneous, ie they think the system will maintain a running net total for each table. (In a non-MVCC system that isn't necessarily an unreasonable thing to do.) I really can't get excited about adding this level of complexity and overhead to the system just to support COUNT(*)-with-no-WHERE slightly better than we do now. The triggers-and-deltas approach previously proposed seems considerably more attractive to me, because (1) it's not invasive and (2) you only have to pay the overhead on tables where you want it. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Thu, 2005-11-17 at 16:30 -0600, Kevin Grittner wrote: > In Sybase ASE (and I'm pretty sure the same is true in Microsoft SQL > Server) the leaf level of the narrowest index on the table is scanned, > following a linked list of leaf pages. Leaf pages can be pretty dense > under Sybase, because they do use prefix compression. A count(*) > on a table with 100 million rows is going to take a few minutes, but it > is going to be at least an order of magnitude faster than a data page > scan -- maybe two orders of magnitude faster. > > What I don't understand is why people need to do such things so > frequently that it's a major issue, rather than an occassional > annoyance. Agreed, completely. (And it galls me to agree with multiple, potentially opposed opinions on my own thread). The trouble is, people moan and constantly. Perhaps we should stick to our guns and say, why do you care? From here, I think we should say, "show me an application package that needs this so badly we'll change PostgreSQL just for them". Prove it and we'll do it. Kinda polite in the TODO, but I think we should put something in there that says "things we haven't yet had any good reason to improve". > A solution which not only helped the count(*) issue > but also allowed index scans to skip the trip to the data page to > see if it's an active version seems like it would boost performance > overall. As pointed out elsewhere, it could also allow new > techniques for vacuum which could be beneficial. > > My view is that when tables get so big that a count(*) takes that > much time, you don't typiclally need an EXACT count anyway -- > you could normally check the statistics from your nightly analyze. Amen. >From here, another proposal. We have a GUC called count_uses_estimate that is set to off by default. If set to true, then a count(*) will use the planner logic to estimate number of rows in the table and return that as the answer, rather than actually count the row. Unless analyze statistics are not available, in which case it does the real count. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > From here, another proposal. We have a GUC called count_uses_estimate > that is set to off by default. If set to true, then a count(*) will use > the planner logic to estimate number of rows in the table and return > that as the answer, rather than actually count the row. Ugh. Why not just provide a function to retrieve the planner estimate, but *not* call it count(*)? It would fit nicely with the contrib/dbsize stuff (or I should say, the stuff formerly in dbsize...) regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Fri, 18 Nov 2005, Simon Riggs wrote: > >From here, another proposal. We have a GUC called count_uses_estimate > that is set to off by default. If set to true, then a count(*) will use > the planner logic to estimate number of rows in the table and return > that as the answer, rather than actually count the row. Unless analyze > statistics are not available, in which case it does the real count. I'm finishing off a tablesample patch a grad student on #postgresql was working on. template1=# select count(*)*100 from a tablesample system(1) repeatable (2); ?column? ---------- 8371100 (1 row) Time: 6366.757 ms template1=# select count(*)*50 from a tablesample system(2) repeatable (11); ?column? ---------- 8453550 (1 row) Time: 10521.871 ms template1=# select count(*)*10 from a tablesample system(10) repeatable (3); ?column? ---------- 8314350 (1 row) Time: 28744.498 ms template1=# select count(*) from a; count --------- 8388608 (1 row) Time: 33897.857 ms Seems like a better solution. I can finish the patch pretty soon. I need to contact the original author, who has disappeared, but I'll send it over to you. Gavin ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On Fri, 2005-11-18 at 11:51 +1100, Gavin Sherry wrote: > Seems like a better solution. I can finish the patch pretty soon. I need > to contact the original author, who has disappeared, but I'll send it over > to you. Sounds good. I wondered where he'd gone to. Sampling would be useful for 8,2 Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| > > Seems like a better solution. I can finish the patch pretty soon. I need > > to contact the original author, who has disappeared, but I'll send it over > > to you. > Sounds good. I wondered where he'd gone to. Still here :-) Just got swamped with too much work that the tablesample patch got paged out. Gavin has a working version of the patch. I can give a hand if need be. Thanks ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > >>From here, another proposal. We have a GUC called count_uses_estimate >>that is set to off by default. If set to true, then a count(*) will use >>the planner logic to estimate number of rows in the table and return >>that as the answer, rather than actually count the row. > > > Ugh. Why not just provide a function to retrieve the planner estimate, > but *not* call it count(*)? It would fit nicely with the contrib/dbsize > stuff (or I should say, the stuff formerly in dbsize...) That would completely remove my needs for a fast count() - all I want is a way to quickly estimate table sizes for an operator's display. Tom's suggestion would provide exactly that. -- Steve Wampler -- swampler@noao.edu The gods that smiled on your birth are now laughing out loud. ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| ||||
| On Fri, Nov 18, 2005 at 12:08:03AM +0000, Simon Riggs wrote: > The trouble is, people moan and constantly. Perhaps we should stick to > our guns and say, why do you care? From here, I think we should say, > "show me an application package that needs this so badly we'll change > PostgreSQL just for them". Prove it and we'll do it. Kinda polite in the > TODO, but I think we should put something in there that says "things we > haven't yet had any good reason to improve". FWIW, this is one of Tom Kyte's (of http://asktom.oracle.com fame) big complaints: if you have a query where count(*) isn't nearly instant then you probably don't need an exact count in the first place and should be happy enough with an estimate. He constantly cites Google ('Result 1-10 of about 38,923') as an example of this. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |