Unix Technical Forum

Performance of count(*)

This is a discussion on Performance of count(*) within the Pgsql Performance forums, part of the PostgreSQL category; --> Tino Wildenhain wrote: >> You guys can correct me if I'm wrong, but the key feature that's >> missing ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #21 (permalink)  
Old 04-19-2008, 09:25 AM
Craig A. James
 
Posts: n/a
Default Re: Performance of count(*)

Tino Wildenhain wrote:
>> You guys can correct me if I'm wrong, but the key feature that's
>> missing from Postgres's flexible indexing is the ability to maintain
>> state across queries. Something like this:
>>
>> select a, b, my_index_state() from foo where ...
>> offset 100 limit 10 using my_index(prev_my_index_state);
>>

>
> Yes, you are wrong :-) The technique is called "CURSOR"
> if you maintain persistent connection per session
> (e.g. stand allone application or clever pooling webapplication)


That's my whole point: If relational databases had a simple mechanism for storing their internal state in an external application, the need for cursors, connection pools, and all those other tricks would be eliminated.

As I said earlier, relational technology was invented in an earlier era, and hasn't caught up with the reality of modern web apps.

> If its a naive web application you just store your session
> in tables where you can easily maintain the scroll state
> as well.


One thing I've learned in 25 years of software development is that people who use my software have problems I never imagined. I've been the one who was naive when I said similar things about my customers, and was later embarrassed to learn that their problems were more complex than I ever imagined.

Craig

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #22 (permalink)  
Old 04-19-2008, 09:25 AM
Craig A. James
 
Posts: n/a
Default Re: Performance of count(*)

Tino Wildenhain wrote:
> Craig A. James schrieb:
> ...
>> In our case (for a variety of reasons, but this one is critical), we
>> actually can't use Postgres indexing at all -- we wrote an entirely
>> separate indexing system for our data...

>
> ...There is no need to store or
> maintain this information along with postgres when you can store
> and maintain it directly in postgres as well.


Whether we store our data inside or outside Postgres misses the point (in fact, most of our data is stored IN Postgres). It's the code that actually performs the index operation that has to be external to Postgres.

> On top of that, postgres has a very flexible and extensible index
> system.


You guys can correct me if I'm wrong, but the key feature that's missing from Postgres's flexible indexing is the ability to maintain state across queries. Something like this:

select a, b, my_index_state() from foo where ...
offset 100 limit 10 using my_index(prev_my_index_state);

The my_index_state() function would issue something like a "cookie", an opaque text or binary object that would record information about how it got from row 1 through row 99. When you issue the query above, it could start looking for row 100 WITHOUT reexamining rows 1-99.

This could be tricky in a OLTP environment, where the "cookie" could be invalidated by changes to the database. But in warehouse read-mostly or read-only environments, it could yield vastly improved performance for database web applications.

If I'm not mistaken, Postgres (nor Oracle, MySQL or other RDBMS) can't do this. I would love to be corrected.

The problem is that relational databases were invented before the web and its stateless applications. In the "good old days", you could connect to a database and work for hours, and in that environment cursors and such work well -- the RDBMS maintains the internal state of the indexing system. But in a web environment, state information is very difficult to maintain. There are all sorts of systems that try (Enterprise Java Beans, for example), but they're very complex.

Craig


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #23 (permalink)  
Old 04-19-2008, 09:25 AM
Tino Wildenhain
 
Posts: n/a
Default Re: Performance of count(*)

Craig A. James schrieb:
> Tino Wildenhain wrote:
>> Craig A. James schrieb:
>> ...
>>> In our case (for a variety of reasons, but this one is critical), we
>>> actually can't use Postgres indexing at all -- we wrote an entirely
>>> separate indexing system for our data...

>>
>> ...There is no need to store or
>> maintain this information along with postgres when you can store
>> and maintain it directly in postgres as well.

>
> Whether we store our data inside or outside Postgres misses the point
> (in fact, most of our data is stored IN Postgres). It's the code that
> actually performs the index operation that has to be external to Postgres.
>
>> On top of that, postgres has a very flexible and extensible index
>> system.

>
> You guys can correct me if I'm wrong, but the key feature that's missing
> from Postgres's flexible indexing is the ability to maintain state
> across queries. Something like this:
>
> select a, b, my_index_state() from foo where ...
> offset 100 limit 10 using my_index(prev_my_index_state);
>


Yes, you are wrong :-) The technique is called "CURSOR"
if you maintain persistent connection per session
(e.g. stand allone application or clever pooling webapplication)

If its a naive web application you just store your session
in tables where you can easily maintain the scroll state
as well.

Regards
Tino

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #24 (permalink)  
Old 04-19-2008, 09:25 AM
Steve Atkins
 
Posts: n/a
Default Re: Performance of count(*)


On Mar 22, 2007, at 10:21 AM, Craig A. James wrote:

> Tino Wildenhain wrote:
>> Craig A. James schrieb:
>> ...
>>> In our case (for a variety of reasons, but this one is critical),
>>> we actually can't use Postgres indexing at all -- we wrote an
>>> entirely separate indexing system for our data...

>>
>> ...There is no need to store or
>> maintain this information along with postgres when you can store
>> and maintain it directly in postgres as well.

>
> Whether we store our data inside or outside Postgres misses the
> point (in fact, most of our data is stored IN Postgres). It's the
> code that actually performs the index operation that has to be
> external to Postgres.
>
>> On top of that, postgres has a very flexible and extensible index
>> system.

>
> You guys can correct me if I'm wrong, but the key feature that's
> missing from Postgres's flexible indexing is the ability to
> maintain state across queries. Something like this:
>
> select a, b, my_index_state() from foo where ...
> offset 100 limit 10 using my_index(prev_my_index_state);
>
> The my_index_state() function would issue something like a
> "cookie", an opaque text or binary object that would record
> information about how it got from row 1 through row 99. When you
> issue the query above, it could start looking for row 100 WITHOUT
> reexamining rows 1-99.
>
> This could be tricky in a OLTP environment, where the "cookie"
> could be invalidated by changes to the database. But in warehouse
> read-mostly or read-only environments, it could yield vastly
> improved performance for database web applications.
>
> If I'm not mistaken, Postgres (nor Oracle, MySQL or other RDBMS)
> can't do this. I would love to be corrected.


As long as you're ordering by some row in the table then you can do
that in
straight SQL.

select a, b, ts from foo where (stuff) and foo > X order by foo limit 10

Then, record the last value of foo you read, and plug it in as X the
next
time around.

This has the advantage over a simple offset approach of actually
displaying all the data as a user pages through it too. (Consider
the case where the user is viewing offsets 91-100, and you delete
the record at offset 15. The user goes to the next page and will
miss the record that used to be at offset 101 and is now at offset
100).

> The problem is that relational databases were invented before the
> web and its stateless applications. In the "good old days", you
> could connect to a database and work for hours, and in that
> environment cursors and such work well -- the RDBMS maintains the
> internal state of the indexing system. But in a web environment,
> state information is very difficult to maintain. There are all
> sorts of systems that try (Enterprise Java Beans, for example), but
> they're very complex.


I think the problem is more that most web developers aren't very good
at using the database, and tend to fall back on simplistic, wrong,
approaches
to displaying the data. There's a lot of monkey-see, monkey-do in web
UI design too, which doesn't help.

Cheers,
Steve


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #25 (permalink)  
Old 04-19-2008, 09:25 AM
Craig A. James
 
Posts: n/a
Default Re: Performance of count(*)

Steve Atkins wrote:
> As long as you're ordering by some row in the table then you can do that in
> straight SQL.
>
> select a, b, ts from foo where (stuff) and foo > X order by foo limit 10
>
> Then, record the last value of foo you read, and plug it in as X the next
> time around.


We've been over this before in this forum: It doesn't work as advertised. Look for postings by me regarding the fact that there is no way to tell the optimizer the cost of executing a function. There's one, for example, on Oct 18, 2006.

> I think the problem is more that most web developers aren't very good
> at using the database, and tend to fall back on simplistic, wrong,
> approaches
> to displaying the data. There's a lot of monkey-see, monkey-do in web
> UI design too, which doesn't help.


Thanks, I'm sure your thoughtful comments will help me solve my problem. Somehow. ;-)

Craig

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #26 (permalink)  
Old 04-19-2008, 09:25 AM
Merlin Moncure
 
Posts: n/a
Default Re: Performance of count(*)

On 3/22/07, Michael Stone <mstone+postgres@mathom.us> wrote:
> On Thu, Mar 22, 2007 at 06:27:32PM +0100, Tino Wildenhain wrote:
> >Craig A. James schrieb:
> >>You guys can correct me if I'm wrong, but the key feature that's missing
> >>from Postgres's flexible indexing is the ability to maintain state
> >>across queries. Something like this:
> >>
> >> select a, b, my_index_state() from foo where ...
> >> offset 100 limit 10 using my_index(prev_my_index_state);
> >>

> >
> >Yes, you are wrong :-) The technique is called "CURSOR"
> >if you maintain persistent connection per session
> >(e.g. stand allone application or clever pooling webapplication)

>
> Did you read the email before correcting it? From the part you trimmed
> out:
>
> >The problem is that relational databases were invented before the web
> >and its stateless applications. In the "good old days", you could
> >connect to a database and work for hours, and in that environment
> >cursors and such work well -- the RDBMS maintains the internal state of
> >the indexing system. But in a web environment, state information is
> >very difficult to maintain. There are all sorts of systems that try
> >(Enterprise Java Beans, for example), but they're very complex.

>
> It sounds like they wrote their own middleware to handle the problem,
> which is basically what you suggested (a "clever pooling web
> application") after saying "wrong".


Tino was saying that rather that build a complete indexing storage
management solution that lives outside the database, it is better to
do intelligent session management so that you get the simplicity if a
two tier client server system but the scalability of a web app.

Web apps are not necessarily stateless, you just have to be a little
clever about how database connections are opened and closed. Then you
get all the database stuff that comes along with a persistent
connection (advisory locks, cursors, prepared statements, etc) without
building all kinds of data management into the middleware.

merlin

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #27 (permalink)  
Old 04-19-2008, 09:25 AM
Steve Atkins
 
Posts: n/a
Default Re: Performance of count(*)


On Mar 22, 2007, at 11:26 AM, Guido Neitzer wrote:

> On 22.03.2007, at 11:53, Steve Atkins wrote:
>
>> As long as you're ordering by some row in the table then you can
>> do that in
>> straight SQL.
>>
>> select a, b, ts from foo where (stuff) and foo > X order by foo
>> limit 10
>>
>> Then, record the last value of foo you read, and plug it in as X
>> the next
>> time around.

>
> This does only work if you have unique values in foo. You might
> have "batch breaks" inside a list of rows with equal values for foo.


If I don't have unique values in foo, I certainly have unique values
in (foo, pk).

>
> But: a web application that needs state and doesn't maintain it by
> itself (or inside the dev toolkit) is imho broken by design. How
> should the database store a "state" for a web app? It's only
> possible on the web app part, because the app is either stateless
> and so are the queries to the database - they have to be re-
> evaluated for every request as the request might come from totally
> different sources (users, ...) or it is stateful and has to
> maintain the state because only the app developer knows, what
> information is needed for the "current state".
>
> This is why all web application toolkits have a "session" concept.


Yes. HTTP is not very stateful. Web applications are stateful. There
are some really obvious approaches to maintaining state cleanly that
work well with databases and let you do some quite complex stuff
(tying a persistent database connection to a single user, for
instance). But they don't scale at all well.

What Craig was suggesting is, basically, to assign a persistent
database connection to each user. But rather than maintain that
connection as a running process, to serialise all the state out of
the database connection and store that in the webapp, then when the
next action from that user comes in take a database connection and
stuff all that state into it again.

It's a lovely idea, but strikes me as completely infeasible in the
general case. There's just too much state there. Doing it in the
specific case is certainly possible, but rapidly devolves to the
standard approach of "On the first page of results, run the query and
record the first 5000 results. Store those in a scratch table,
indexed by session-id, or in external storage. On displaying later
pages of results to the same user, pull directly from the already
calculated results."

>
>> I think the problem is more that most web developers aren't very good
>> at using the database, and tend to fall back on simplistic, wrong,
>> approaches
>> to displaying the data. There's a lot of monkey-see, monkey-do in web
>> UI design too, which doesn't help.

>
> Sure. That is the other problem ... ;-) But, and I think this is
> much more important: most toolkits today free you from using the
> database directly and writing lots and lots of lines of sql code
> which instantly breaks when you switch the storage backend. It's
> just the thing from where you look at something.


The real problem is the user-interface is designed around what is
easy to implement in elderly cgi scripts, rather than what's
appropriate to the data being displayed or useful to the user.
Displaying tables of results, ten at a time, is just one of the more
egregious examples of that.

Cheers,
Steve


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #28 (permalink)  
Old 04-19-2008, 09:25 AM
Tom Lane
 
Posts: n/a
Default Re: Performance of count(*)

"Craig A. James" <cjames@modgraph-usa.com> writes:
> Steve Atkins wrote:
>> As long as you're ordering by some row in the table then you can do that in
>> straight SQL.
>>
>> select a, b, ts from foo where (stuff) and foo > X order by foo limit 10
>>
>> Then, record the last value of foo you read, and plug it in as X the next
>> time around.


> We've been over this before in this forum: It doesn't work as advertised. Look for postings by me regarding the fact that there is no way to tell the optimizer the cost of executing a function. There's one, for example, on Oct 18, 2006.


You mean
http://archives.postgresql.org/pgsql...0/msg00283.php
? I don't see anything there that bears on Steve's suggestion.
(The complaint is obsolete as of CVS HEAD anyway.)

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #29 (permalink)  
Old 04-19-2008, 09:25 AM
Michael Stone
 
Posts: n/a
Default Re: Performance of count(*)

On Thu, Mar 22, 2007 at 02:24:39PM -0400, Merlin Moncure wrote:
>Tino was saying that rather that build a complete indexing storage
>management solution that lives outside the database, it is better to
>do intelligent session management so that you get the simplicity if a
>two tier client server system but the scalability of a web app.


No, what he was saying was "there's this thing called a cursor". I
thought there was enough information in the original message to indicate
that the author knew about cursors. There are certainly pros and cons
and nuances to different approaches, but Tino's message didn't touch on
anything that specific.

And even if you do use some kind of "intelligent session management",
how many simultaneous cursors can postgres sanely keep track of?
Thousands? Millions? Tens of Millions? I suspect there's a scalability
limit in there somewhere. Luckily I don't spend much time in the web
application space, so I don't need to know.

Mike Stone

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #30 (permalink)  
Old 04-19-2008, 09:25 AM
Craig A. James
 
Posts: n/a
Default Re: Performance of count(*)

Tom Lane wrote:
> "Craig A. James" <cjames@modgraph-usa.com> writes:
>> Steve Atkins wrote:
>>> As long as you're ordering by some row in the table then you can do that in
>>> straight SQL.
>>>
>>> select a, b, ts from foo where (stuff) and foo > X order by foo limit 10
>>>
>>> Then, record the last value of foo you read, and plug it in as X the next
>>> time around.

>
>> We've been over this before in this forum: It doesn't work as advertised.
>> Look for postings by me regarding the fact that there is no way to tell
>> the optimizer the cost of executing a function. There's one, for example,
>> on Oct 18, 2006.

>
> You mean
> http://archives.postgresql.org/pgsql...0/msg00283.php
> ? I don't see anything there that bears on Steve's suggestion.
> (The complaint is obsolete as of CVS HEAD anyway.)


Mea culpa, it's October 8, not October 18:

http://archives.postgresql.org/pgsql...0/msg00143.php

The relevant part is this:

"My example, discussed previously in this forum, is a classic. I have a VERY expensive function (it's in the class of NP-complete problems, so there is no faster way to do it). There is no circumstance when my function should be used as a filter, and no circumstance when it should be done before a join. But PG has no way of knowing the cost of a function, and so the optimizer assigns the same cost to every function. Big disaster.

"The result? I can't use my function in any WHERE clause that involves any other conditions or joins. Only by itself. PG will occasionally decide to use my function as a filter instead of doing the join or the other WHERE conditions first, and I'm dead.

"The interesting thing is that PG works pretty well for me on big tables -- it does the join first, then applies my expensive functions. But with a SMALL (like 50K rows) table, it applies my function first, then does the join. A search that completes in 1 second on a 5,000,000 row database can take a minute or more on a 50,000 row database."

Craig

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 01:49 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com