View Single Post

   
  #1 (permalink)  
Old 04-29-2008, 08:28 PM
Hugo
 
Posts: n/a
Default search an improved "distinct, exists, in" clauses...

Hi folks,

I do have a model where I have some document occurrences of entities
that are hierarchicaly organized.

For instance I may have entities "IBM" and "Sun" which are children of
entity "Company". Then in a document I could have
3 occurrences of the entity "IBM", and 1 of entity "Sun".

The model I use could be simplified to something like:

CREATE TABLE `positionedelement` (
`idPositionedElement` int(11) NOT NULL,
`idDocument` int(11) NOT NULL,
`startPosition` int(11) NOT NULL,
`endPosition` int(11) NOT NULL,
`uidEntity` int(11) NOT NULL,
PRIMARY KEY (`idPositionedElement`)
) ENGINE=InnoDB

CREATE TABLE `entity` (
`uidEntity` int(11) NOT NULL,
`entityName` varchar(255) NOT NULL,
PRIMARY KEY (`uidEntity`),
) ENGINE=InnoDB

CREATE TABLE `entityhierarchy` (
`id` int(11) NOT NULL,
`idParentEntity` int(11) NOT NULL,
`uidEntity` int(11) NOT NULL,
`depth` tinyint(4) NOT NULL,
PRIMARY KEY (`id`),
KEY `entitiesIdx` (`uidEntity`),
KEY `fullHierarchyIdx` (`idParentEntity`,`depth`),
) ENGINE=InnoDB

One of the most used query executed is to find which children of a given
entity could be found into a list of documents.

For instance: which "Company" could be found in those 42 documents.

I do have 3 worst use-cases:

1) lots of children (around 1M) and lots of occurrences (1M)
The query is slow (around 2min), but it's by design, so let it be.

2) lots of children (around 1M) and few occurrences (10)
The best specific query I found is to select distinct children amoung
all occurrences (query A). Very quick indeed (less than a second)

3) few children (10) and lots of occurrences (1M)
The best specific query I found is to select every children for which
exist an occurrences (query B). On average case, it's also very quick
(negative search may be quite slow, though)

The problem is that if I apply query A on case 3, it's very slow (a
couple of minutes), and if I apply query B on case 2, it's also very slow.

Does someone could think of a unique query, in which the optimizer,
according to index cardinality, could handle case 2) are 3) quickly ? Or
does my model need a rethink ?

Suppose I do have only 8 children to "test", there is no way to make
query A stop once the 8 children are successfully return by the distinct
clause ?

For the moment the only solution I see, is to choose in my outer code
(Java), which query to run according to some partial heuristics. This
means bypassing MySQL optimizer, which appears to me as a really bad idea.

Feel free to ask for questions,
Thanks in advance for any help or tips, I'm really stuck on this one...

--
Hugo
Reply With Quote