This is a discussion on Slow query with MEMORY engine - why? within the MySQL forums, part of the Database Server Software category; --> I'm doing this query repeatedly on a MEMORY database. SELECT domain, requestor_ip_hash AS requestor FROM ratingqueue WHERE rating_state = ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I'm doing this query repeatedly on a MEMORY database. SELECT domain, requestor_ip_hash AS requestor FROM ratingqueue WHERE rating_state = 'queued' AND NOT EXISTS (SELECT * FROM ratingqueue WHERE requestor_ip_hash = requestor AND rating_state != 'queued') ORDER BY request_timestamp LIMIT 1 Here's the table: describe ratingqueue; +-------------------+-------------------------------------+------+-----+---------------------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------+-------------------------------------+------+-----+---------------------+-------+ | domain | varchar(255) | NO | PRI | | | | requestor_ip_hash | int(11) | YES | MUL | NULL | | | rating_state | enum('queued','starting','running') | NO | MUL | queued | | | server | varchar(63) | YES | | NULL | | | priority | smallint(6) | YES | | 0 | | | update_timestamp | timestamp | NO | | CURRENT_TIMESTAMP | | | request_timestamp | timestamp | NO | MUL | 0000-00-00 00:00:00 | | +-------------------+-------------------------------------+------+-----+---------------------+-------+ It's actually a scheduler; a number of machines are looking for the next work queue item. There's no need for the table to persist over restarts, so it's in the MEMORY engine. The point of doing this in the database is that multiple machines can use the same work queue. This works fine when the table is small, but with about 1000 entries, it seems to choke. Each query takes about 2 seconds. And that's in the MEMORY database. This suggests that it's doing an O(N^2) operation. Here's an EXPLAIN: explain SELECT domain, requestor_ip_hash AS requestor FROM ratingqueue WHERE rating_state = 'queued' AND NOT EXISTS (SELECT * FROM ratingqueue WHERE requestor_ip_hash = requestor AND rating_state != 'queued') ORDER BY request_timestamp LIMIT 1; +----+--------------------+-------------+------+--------------------------------+--------------+---------+-------+------+-----------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+-------------+------+--------------------------------+--------------+---------+-------+------+-----------------------------+ | 1 | PRIMARY | ratingqueue | ref | rating_state | rating_state | 1 | const | 1097 | Using where; Using filesort | | 2 | DEPENDENT SUBQUERY | ratingqueue | ALL | requestor_ip_hash,rating_state | NULL | NULL | NULL | 823 | Using where | +----+--------------------+-------------+------+--------------------------------+--------------+---------+-------+------+-----------------------------+ All the fields in the subquery are keys. What's the problem? There must be a more efficient way to express this query. Suggestions? John Nagle |
| |||
| On 4 Jun, 06:05, John Nagle <na...@animats.com> wrote: > I'm doing this query repeatedly on a MEMORY database. > > SELECT domain, requestor_ip_hash AS requestor FROM ratingqueue WHERE > rating_state = 'queued' AND NOT EXISTS > (SELECT * FROM ratingqueue WHERE requestor_ip_hash = requestor AND rating_state > != 'queued') ORDER BY request_timestamp LIMIT 1 > > Here's the table: > > describe ratingqueue; > +-------------------+-------------------------------------+------+-----+---------------------+-------+ > | Field | Type | Null | Key | Default > | Extra | > +-------------------+-------------------------------------+------+-----+---------------------+-------+ > | domain | varchar(255) | NO | PRI | > | | > | requestor_ip_hash | int(11) | YES | MUL | NULL > | | > | rating_state | enum('queued','starting','running') | NO | MUL | queued > | | > | server | varchar(63) | YES | | NULL > | | > | priority | smallint(6) | YES | | 0 > | | > | update_timestamp | timestamp | NO | | > CURRENT_TIMESTAMP | | > | request_timestamp | timestamp | NO | MUL | > 0000-00-00 00:00:00 | | > +-------------------+-------------------------------------+------+-----+---------------------+-------+ > > It's actually a scheduler; a number of machines are looking for the next > work queue item. There's no need for the table to persist over restarts, > so it's in the MEMORY engine. The point of doing this in the database > is that multiple machines can use the same work queue. > > This works fine when the table is small, but with > about 1000 entries, it seems to choke. Each query takes about 2 seconds. > And that's in the MEMORY database. This suggests that it's doing an O(N^2) > operation. > > Here's an EXPLAIN: > > explain SELECT domain, requestor_ip_hash AS requestor FROM ratingqueue WHERE > rating_state = 'queued' AND NOT EXISTS > (SELECT * FROM ratingqueue WHERE requestor_ip_hash = requestor AND rating_state > != 'queued') ORDER BY request_timestamp LIMIT 1; > +----+--------------------+-------------+------+--------------------------------+--------------+---------+-------+------+-----------------------------+ > | id | select_type | table | type | possible_keys > | key | key_len | ref | rows | Extra | > +----+--------------------+-------------+------+--------------------------------+--------------+---------+-------+------+-----------------------------+ > | 1 | PRIMARY | ratingqueue | ref | rating_state > | rating_state | 1 | const | 1097 | Using where; Using filesort | > | 2 | DEPENDENT SUBQUERY | ratingqueue | ALL | requestor_ip_hash,rating_state > | NULL | NULL | NULL | 823 | Using where | > +----+--------------------+-------------+------+--------------------------------+--------------+---------+-------+------+-----------------------------+ > > All the fields in the subquery are keys. What's the problem? > > There must be a more efficient way to express this query. > Suggestions? > > John Nagle Having all the fields as keys may not help that much when you have more than one criteria in a WHERE or JOIN clause. In this case an appropriate composite key is much more useful. Also, I would rewrite this as a JOIN instead of a dependent subquery. |
| |||
| Captain Paralytic wrote: > On 4 Jun, 06:05, John Nagle <na...@animats.com> wrote: >> I'm doing this query repeatedly on a MEMORY database. >> >> SELECT domain, requestor_ip_hash AS requestor FROM ratingqueue WHERE >> rating_state = 'queued' AND NOT EXISTS >> (SELECT * FROM ratingqueue WHERE requestor_ip_hash = requestor AND rating_state >> != 'queued') ORDER BY request_timestamp LIMIT 1 > > Having all the fields as keys may not help that much when you have > more than one criteria in a WHERE or JOIN clause. In this case an > appropriate composite key is much more useful. > Also, I would rewrite this as a JOIN instead of a dependent subquery. How do you express "not exists" in a JOIN? John Nagle |
| |||
| On 5 Jun, 02:25, John Nagle <na...@animats.com> wrote: > Captain Paralytic wrote: > > On 4 Jun, 06:05, John Nagle <na...@animats.com> wrote: > >> I'm doing this query repeatedly on a MEMORY database. > > >> SELECT domain, requestor_ip_hash AS requestor FROM ratingqueue WHERE > >> rating_state = 'queued' AND NOT EXISTS > >> (SELECT * FROM ratingqueue WHERE requestor_ip_hash = requestor AND rating_state > >> != 'queued') ORDER BY request_timestamp LIMIT 1 > > > Having all the fields as keys may not help that much when you have > > more than one criteria in a WHERE or JOIN clause. In this case an > > appropriate composite key is much more useful. > > Also, I would rewrite this as a JOIN instead of a dependent subquery. > > How do you express "not exists" in a JOIN? > > John Nagle I'm not totally sure about what the data rules are in your table, but something this should do it. SELECT r1.domain, r1.requestor_ip_hash FROM ratingqueue r1 LEFT JOIN ratingqueue r2 ON r1.requestor_ip_hash = r2.requestor_ip_hash AND r2.rating_state != 'queued' WHERE r1.rating_state = 'queued' AND r2.requestor_ip_hash IS NULL Note that setting your query out nicely as above before posting makes it much easier for us to offer help. |
| ||||
| Captain Paralytic wrote: > On 5 Jun, 02:25, John Nagle <na...@animats.com> wrote: >> Captain Paralytic wrote: >>> On 4 Jun, 06:05, John Nagle <na...@animats.com> wrote: >>>> I'm doing this query repeatedly on a MEMORY database. >>>> SELECT domain, requestor_ip_hash AS requestor FROM ratingqueue WHERE >>>> rating_state = 'queued' AND NOT EXISTS >>>> (SELECT * FROM ratingqueue WHERE requestor_ip_hash = requestor AND rating_state >>>> != 'queued') ORDER BY request_timestamp LIMIT 1 >>> Having all the fields as keys may not help that much when you have >>> more than one criteria in a WHERE or JOIN clause. In this case an >>> appropriate composite key is much more useful. >>> Also, I would rewrite this as a JOIN instead of a dependent subquery. >> How do you express "not exists" in a JOIN? >> >> John Nagle > > I'm not totally sure about what the data rules are in your table, but > something this should do it. > > SELECT > r1.domain, > r1.requestor_ip_hash > FROM ratingqueue r1 > LEFT JOIN ratingqueue r2 ON r1.requestor_ip_hash = > r2.requestor_ip_hash AND r2.rating_state != 'queued' > WHERE r1.rating_state = 'queued' AND r2.requestor_ip_hash IS NULL That's worse. Tried the query in this form: SELECT r1.domain, r1.requestor_ip_hash FROM ratingqueue r1 LEFT JOIN ratingqueue r2 ON r1.requestor_ip_hash = r2.requestor_ip_hash AND r2.rating_state != 'queued' WHERE r1.rating_state = 'queued' AND r2.requestor_ip_hash IS NULL ORDER BY request_timestamp LIMIT 1; Timing results: (MySQL query browser, idle Windows PC, idle database, MySQL 5 running on same machine as query browser.) Original query: 0 rows fetched in 0.0007s. (4.3616s) Left join query: 0 rows fetched in 0.0165s. (5.0268s) In either case, it's taking 4 to 5 seconds to process this query on a MEMORY table of modest size. Table def is: CREATE TABLE ratingqueue ( domain VARCHAR(255) NOT NULL UNIQUE PRIMARY KEY, requestor_ip_hash INT, rating_state ENUM ('queued','starting','running') NOT NULL DEFAULT 'queued', server VARCHAR(63) NULL, priority SMALLINT DEFAULT 0, update_timestamp TIMESTAMP NOT NULL, request_timestamp TIMESTAMP NOT NULL, INDEX(requestor_ip_hash, rating_state), INDEX(request_timestamp) ) ENGINE=MEMORY; The table contains 1222 entries, 2 of which are 'rating_state="running"' and all of which have the same 'requestor_ip_hash'. That's the problem. The query is to return a row for which the requestor_ip_hash does not match any row that is in "running" state. (It's a fair load balancer.) When no such row exists, the query is incredibly inefficient. There must be some better way to do this. John Nagle |
| Thread Tools | |
| Display Modes | |
|
|