Unix Technical Forum

Re: count(*) performance improvement ideas

This is a discussion on Re: count(*) performance improvement ideas within the pgsql Hackers forums, part of the PostgreSQL category; --> >> The whole thing is a bit of an abuse of what the mechanism >> was intended >> for, ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 04-17-2008, 04:11 PM
PFC
 
Posts: n/a
Default Re: count(*) performance improvement ideas


>> The whole thing is a bit of an abuse of what the mechanism
>> was intended
>> for, and so I'm not sure we should rejigger GUC's behavior to make it
>> more pleasant, but on the other hand if we're not ready to provide a
>> better substitute ...

>
> In my experiments with materialized views, I identified these problems
> as "minor" difficulties. Resolving them would allow further abuse


Let's try this quick & dirty implementation of a local count-delta cache
using a local in-memory hashtable (ie. {}).
Writing the results to stable storage in an ON COMMIT trigger is left as
an exercise to the reader
Performance isn't that bad, calling the trigger takes about 50 us.
Oldskool implementation with a table is at the end, it's about 10x slower.

Example :

INSERT INTO victim1 (key) VALUES ('one'),('two'),('two');
INSERT 0 3
Temps : 1,320 ms
test=# SELECT * FROM get_count();
key | cnt
-----+-----
two | 2
one | 1


CREATE OR REPLACE FUNCTION clear_count( )
RETURNS VOID
AS $$
GD.clear()
$$ LANGUAGE plpythonu;

CREATE OR REPLACE FUNCTION update_count( key TEXT, delta INTEGER )
RETURNS INTEGER
AS $$
if key in GD:
GD[key] += delta
else:
GD[key] = delta
return GD[key]
$$ LANGUAGE plpythonu;

CREATE TYPE count_data AS ( key TEXT, cnt INTEGER );

CREATE OR REPLACE FUNCTION get_count( )
RETURNS SETOF count_data
AS $$
return GD.iteritems()
$$ LANGUAGE plpythonu;


CREATE TABLE victim( id SERIAL PRIMARY KEY, key TEXT NOT NULL );
INSERT INTO victim (key) SELECT (random() * 300)::INTEGER::TEXT FROM
generate_series( 1,100000 );

CREATE TABLE victim1( id SERIAL PRIMARY KEY, key TEXT NOT NULL );

\timing
INSERT INTO victim1 SELECT * FROM victim;
TRUNCATE TABLE victim1;

SELECT clear_count();
INSERT INTO victim1 SELECT * FROM victim RETURNING update_count( key, 1 );
SELECT * FROM get_count();
TRUNCATE TABLE victim1;

CREATE OR REPLACE FUNCTION counter_trigger_f()
RETURNS TRIGGER
LANGUAGE plpgsql
AS
$$
DECLARE
BEGIN
IF TG_OP = 'INSERT' THEN
PERFORM update_count( NEW.key, 1 );
RETURN NEW;
ELSEIF TG_OP = 'UPDATE' THEN
-- update topic
IF NEW.key != OLD.key THEN
PERFORM update_count( OLD.key, -1 ), update_count( NEW.key, 1
);
END IF;
RETURN NEW;
ELSE -- DELETE
PERFORM update_count( OLD.key, -1 );
RETURN OLD;
END IF;
END;
$$;

CREATE TRIGGER count_trigger BEFORE INSERT OR UPDATE OR DELETE ON victim1
FOR EACH ROW EXECUTE PROCEDURE counter_trigger_f();

SELECT clear_count();
INSERT INTO victim1 SELECT * FROM victim;
SELECT * FROM get_count();

SELECT clear_count();
TRUNCATE TABLE victim1;
INSERT INTO victim1 (key) VALUES ('one'),('two'),('two');
SELECT * FROM get_count();
DELETE FROM victim1 WHERE key='two';
SELECT * FROM get_count();
UPDATE victim1 SET key='three' WHERE key='one';
SELECT * FROM get_count();
DELETE FROM victim1;
SELECT * FROM get_count();


CREATE TABLE counts( key TEXT PRIMARY KEY, total INTEGER NOT NULL DEFAULT
0 );

CREATE OR REPLACE FUNCTION table_counter_trigger_f()
RETURNS TRIGGER
LANGUAGE plpgsql
AS
$$
DECLARE
BEGIN
IF TG_OP = 'INSERT' THEN
UPDATE counts SET total=total+1 WHERE key=NEW.key;
IF NOT FOUND THEN INSERT INTO counts (key,total) VALUES
(NEW.key,1); END IF;
RETURN NEW;
ELSEIF TG_OP = 'UPDATE' THEN
-- update topic
IF NEW.key != OLD.key THEN
UPDATE counts SET total=total-1 WHERE key=OLD.key;
UPDATE counts SET total=total+1 WHERE key=NEW.key;
IF NOT FOUND THEN INSERT INTO counts (key,total) VALUES
(NEW.key,1); END IF;
END IF;
RETURN NEW;
ELSE -- DELETE
UPDATE counts SET total=total-1 WHERE key=OLD.key;
RETURN OLD;
END IF;
END;
$$;

CREATE TABLE victim2( id SERIAL PRIMARY KEY, key TEXT NOT NULL );

CREATE TRIGGER table_count_trigger BEFORE INSERT OR UPDATE OR DELETE ON
victim2 FOR EACH ROW EXECUTE PROCEDURE table_counter_trigger_f();
SELECT * FROM counts;
TRUNCATE TABLE victim2;
INSERT INTO victim2 SELECT * FROM victim;



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #12 (permalink)  
Old 04-17-2008, 04:11 PM
Stephen Denne
 
Posts: n/a
Default Re: count(*) performance improvement ideas

Tom Lane wrote
> Transaction commit is an exceedingly subtle and carefully structured
> thing. Throwing random user-defined code into it ain't gonna happen.


Deferred constraint triggers currently run random user-defined code. This'll do me.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachments is confidential andmay be subject to legal privilege. If it is not intended for you please advise by reply immediately, destroy it and do not copy, disclose or use it in any way.
__________________________________________________ ________________
This email has been scanned by the DMZGlobal Business Quality
Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________ ________________



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #13 (permalink)  
Old 04-17-2008, 04:11 PM
PFC
 
Posts: n/a
Default Re: count(*) performance improvement ideas

On Thu, 17 Apr 2008 02:48:37 +0200, Stephen Denne
<Stephen.Denne@datamail.co.nz> wrote:

> PFC wrote:
>> Let's try this quick & dirty implementation of a local
>> count-delta cache
>> using a local in-memory hashtable (ie. {}).

>
>> CREATE OR REPLACE FUNCTION update_count( key TEXT, delta INTEGER )
>> RETURNS INTEGER
>> AS $$
>> if key in GD:
>> GD[key] += delta
>> else:
>> GD[key] = delta
>> return GD[key]
>> $$ LANGUAGE plpythonu;

>
> Thanks for the code, this seems to be very much what I was looking for.
>
> I don't know plpythonu (nor python), just read a few docs now:


Learn Python, it is a really useful language

> "The global dictionary SD is available to store data between function
> calls. This variable is private static data. The global dictionary GD is
> public data, available to all Python functions within a session. Use
> with care."
>
> Does session == transaction or connection?
> I don't understand the difference between SD and GD, private and public.
> Where are the context boundaries?


There is no sharing between processes, so
- both SD and GD are limited to the current session (connection, postgres
process), no shared memory is involved
- GD is global between all python functions (global)
- SD is specific to each python function (static)

The big gotcha is that these are all non-transactional : if you rollback,
GD and SD stay the same, and when you issue a query, you can assume the
state of SD and GD is random (due to previous queries) unless you
initialize them to a known value.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #14 (permalink)  
Old 04-17-2008, 04:11 PM
Tom Lane
 
Posts: n/a
Default Re: count(*) performance improvement ideas

Joe Conway <mail@joeconway.com> writes:
> Tom Lane wrote:
>> I thought about this in the shower just now, and ISTM that if we want to
>> turn this into an actual feature rather than a kluge, there needs to be
>> some sort of "define variable" command that sets up a custom variable
>> and specifies its type (and whatever other properties seem worth
>> setting). IOW expose the DefineCustomFooVariable functions to SQL users.
>>
>> I'd be a bit inclined to restrict the namespace that can be set up that
>> way, eg allow only "local." or "session." as the prefix. Maybe
>> that's just being too anal, but we could guarantee not to introduce
>> colliding built-in GUCs in future releases, whereas people trying to
>> define variables with any random name would definitely be at risk.


> Would it make sense to have built-in GUCs belong to "pg_catalog." and
> user defined GUCs default to "public."?


[ after a bit of reflection... ] I don't think that we want to tie
GUC names to schemas, especially not schemas that might not be there
(remember public is droppable). The existing scheme for qualified
GUC names considers that the prefix is the name of a loadable module,
which isn't typically tied to any particular schema.

I kinda like "session" as the prefix since it helps remind people that
these things will have session lifespan.

OTOH, there's a possibility for confusion with the SET SESSION syntax
("hm, did you mean SET SESSION foo = ... or SET session.foo = ...?").
"local" has got the same issue. Maybe "temp"?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #15 (permalink)  
Old 04-18-2008, 08:31 AM
PFC
 
Posts: n/a
Default Re: count(*) performance improvement ideas


> My wife has a snake phobia, besides, I've just started learning Scala.


Just had a look at Scala, it looks nice. Slightly Lispish (like all good
languages)...

> txid_current()
> No... hold on, it is per session, and a session can't have two or more
> transactions active at once can it?


It could be used to detect rollback.

> So the problem is that other functions may be using GD themselves, and
> your own code is at the mercy of the other functions. Conversely you
> shouldn't clear GD, as some other function may be using it.


Exactly.

> So you're better off using a single function for everything, and using
> SD within it?


Since the purpose is to store counts for rows matching a certain criteria
in a set of tables, you could build a hashtable of hashtables, like :

GD[table name][criteria name][criteria value] = count

This would add complexity, about half a line of code. But you'd have to
create lots of plpgsql trigger functions to wrap it.

> There isn't any way of telling whether the function is being called for
> the first time in a transaction. You don't know when to clear it.


The first time in a session, GD will be empty.
Clearing it at the start of a transaction would not be useful (clearing
it at ROLLBACK would).
It is updating the "real" summary table with the contents of this hash
that is the problem, also.

So, basically, if you connect, do one insert, and disconnect, this would
be useless.
But, if you do a zillion inserts, caching the counts deltas in RAM would
be faster.
And if you use persistent connections, you could update the counts in the
real table only every N minutes, for instance, but this would need some
complicity from the backend.

> Regards,
> Stephen Denne.
>
> Disclaimer:
> At the Datamail Group we value team commitment, respect, achievement,
> customer focus, and courage. This email with any attachments is
> confidential and may be subject to legal privilege. If it is not
> intended for you please advise by reply immediately, destroy it and do
> not copy, disclose or use it in any way.
> __________________________________________________ ________________
> This email has been scanned by the DMZGlobal Business Quality
> Electronic Messaging Suite.
> Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
> __________________________________________________ ________________
>
>




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #16 (permalink)  
Old 06-24-2008, 09:32 PM
Bruce Momjian
 
Posts: n/a
Default Re: count(*) performance improvement ideas


Added to TODO:

* Allow custom variables to appear in pg_settings()


---------------------------------------------------------------------------

Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > Tom Lane wrote:
> >> Really? [ pokes around ... ] Hm, you're right, because
> >> add_placeholder_variable() sets the GUC_NO_SHOW_ALL flag, and in this
> >> usage it'll never be cleared. I wonder if we should change that.
> >>
> >> The whole thing is a bit of an abuse of what the mechanism was
> >> intended for, and so I'm not sure we should rejigger GUC's behavior
> >> to make it more pleasant, but on the other hand if we're not ready to
> >> provide a better substitute ...

>
> > While I agree with that part, is there any actual *reason* why we
> > shouldn't have the custom variables included in pg_settings?

>
> IIRC, the motivation for doing that was to not expose a completely bogus
> set of attributes for a variable whose defining C-module hadn't been
> loaded yet.
>
> I thought about this in the shower just now, and ISTM that if we want to
> turn this into an actual feature rather than a kluge, there needs to be
> some sort of "define variable" command that sets up a custom variable
> and specifies its type (and whatever other properties seem worth
> setting). IOW expose the DefineCustomFooVariable functions to SQL users.
>
> I'd be a bit inclined to restrict the namespace that can be set up that
> way, eg allow only "local." or "session." as the prefix. Maybe
> that's just being too anal, but we could guarantee not to introduce
> colliding built-in GUCs in future releases, whereas people trying to
> define variables with any random name would definitely be at risk.
>
> Comments?
>
> regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 01:22 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com