vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi folks, I do have a model where I have some document occurrences of entities that are hierarchicaly organized. For instance I may have entities "IBM" and "Sun" which are children of entity "Company". Then in a document I could have 3 occurrences of the entity "IBM", and 1 of entity "Sun". The model I use could be simplified to something like: CREATE TABLE `positionedelement` ( `idPositionedElement` int(11) NOT NULL, `idDocument` int(11) NOT NULL, `startPosition` int(11) NOT NULL, `endPosition` int(11) NOT NULL, `uidEntity` int(11) NOT NULL, PRIMARY KEY (`idPositionedElement`) ) ENGINE=InnoDB CREATE TABLE `entity` ( `uidEntity` int(11) NOT NULL, `entityName` varchar(255) NOT NULL, PRIMARY KEY (`uidEntity`), ) ENGINE=InnoDB CREATE TABLE `entityhierarchy` ( `id` int(11) NOT NULL, `idParentEntity` int(11) NOT NULL, `uidEntity` int(11) NOT NULL, `depth` tinyint(4) NOT NULL, PRIMARY KEY (`id`), KEY `entitiesIdx` (`uidEntity`), KEY `fullHierarchyIdx` (`idParentEntity`,`depth`), ) ENGINE=InnoDB One of the most used query executed is to find which children of a given entity could be found into a list of documents. For instance: which "Company" could be found in those 42 documents. I do have 3 worst use-cases: 1) lots of children (around 1M) and lots of occurrences (1M) The query is slow (around 2min), but it's by design, so let it be. 2) lots of children (around 1M) and few occurrences (10) The best specific query I found is to select distinct children amoung all occurrences (query A). Very quick indeed (less than a second) 3) few children (10) and lots of occurrences (1M) The best specific query I found is to select every children for which exist an occurrences (query B). On average case, it's also very quick (negative search may be quite slow, though) The problem is that if I apply query A on case 3, it's very slow (a couple of minutes), and if I apply query B on case 2, it's also very slow. Does someone could think of a unique query, in which the optimizer, according to index cardinality, could handle case 2) are 3) quickly ? Or does my model need a rethink ? Suppose I do have only 8 children to "test", there is no way to make query A stop once the 8 children are successfully return by the distinct clause ? For the moment the only solution I see, is to choose in my outer code (Java), which query to run according to some partial heuristics. This means bypassing MySQL optimizer, which appears to me as a really bad idea. Feel free to ask for questions, Thanks in advance for any help or tips, I'm really stuck on this one... -- Hugo |
| |||
| Hugo wrote: > Hi folks, > > I do have a model where I have some document occurrences of entities > that are hierarchicaly organized. > > For instance I may have entities "IBM" and "Sun" which are children of > entity "Company". Then in a document I could have > 3 occurrences of the entity "IBM", and 1 of entity "Sun". > > The model I use could be simplified to something like: > > CREATE TABLE `positionedelement` ( > `idPositionedElement` int(11) NOT NULL, > `idDocument` int(11) NOT NULL, > `startPosition` int(11) NOT NULL, > `endPosition` int(11) NOT NULL, > `uidEntity` int(11) NOT NULL, > PRIMARY KEY (`idPositionedElement`) > ) ENGINE=InnoDB > > CREATE TABLE `entity` ( > `uidEntity` int(11) NOT NULL, > `entityName` varchar(255) NOT NULL, > PRIMARY KEY (`uidEntity`), > ) ENGINE=InnoDB > > CREATE TABLE `entityhierarchy` ( > `id` int(11) NOT NULL, > `idParentEntity` int(11) NOT NULL, > `uidEntity` int(11) NOT NULL, > `depth` tinyint(4) NOT NULL, > PRIMARY KEY (`id`), > KEY `entitiesIdx` (`uidEntity`), > KEY `fullHierarchyIdx` (`idParentEntity`,`depth`), > ) ENGINE=InnoDB > > One of the most used query executed is to find which children of a given > entity could be found into a list of documents. > > For instance: which "Company" could be found in those 42 documents. > > I do have 3 worst use-cases: > > 1) lots of children (around 1M) and lots of occurrences (1M) > The query is slow (around 2min), but it's by design, so let it be. > > 2) lots of children (around 1M) and few occurrences (10) > The best specific query I found is to select distinct children amoung > all occurrences (query A). Very quick indeed (less than a second) > > 3) few children (10) and lots of occurrences (1M) > The best specific query I found is to select every children for which > exist an occurrences (query B). On average case, it's also very quick > (negative search may be quite slow, though) > > The problem is that if I apply query A on case 3, it's very slow (a > couple of minutes), and if I apply query B on case 2, it's also very slow. > > Does someone could think of a unique query, in which the optimizer, > according to index cardinality, could handle case 2) are 3) quickly ? Or > does my model need a rethink ? > > Suppose I do have only 8 children to "test", there is no way to make > query A stop once the 8 children are successfully return by the distinct > clause ? > > For the moment the only solution I see, is to choose in my outer code > (Java), which query to run according to some partial heuristics. This > means bypassing MySQL optimizer, which appears to me as a really bad idea. > > Feel free to ask for questions, > Thanks in advance for any help or tips, I'm really stuck on this one... > I'm a bit confused about exactly what you're trying to do. What are the SELECT statements you're using now, and what do you get if you EXPLAIN those statements? -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
| |||
| Jerry Stuckle wrote : > I'm a bit confused about exactly what you're trying to do. I not an English native speaker, adding confusions in my explanations about a non-trivial problem. If you give me a chance, I'll try to summarize it differently : I need to run a query that have 3 worst use cases, depending on the input (X and Y). X and Y values heavily determined the number of lines to read: from few (1,2 lines) to a lot (1 million of lines). worst use case 1 (called 'WUC1') : X is high (1.000.000) Y is low (10) worst use case 2 (called 'WUC2') : X is low (10) Y is high (1.000.000) worst use case 3 (called 'WUC3') : X is high (1.000.000) Y is high (1.000.000) For WUC1, I found a query which is really fast (QueryA, using distinct). But it's very slow on WUC2 (which does not surprise me). For WUC2, I found a query which is really fast (QueryB, using exist). But it's very slow on WUC1 (which does not surprise me neither). For the moment, I have to figure out (in my outer Java code) if I'm in WUC1 and then trigger the QueryA, or in WUC2 and then trigger the QueryB. But I want to let the MySQL optimizer doing that job for me ! My question is: Does someone could think of a query construction that could be quick in both WUC1 and WUC2 ? (For WUC3, the query is inherently slow, due to cross join between 1M and 1M lines, and I'm not looking for an improvement) > What are the > SELECT statements you're using now, and what do you get if you EXPLAIN > those statements? Here we go Query A (get distinct fathers of occurrences in a given document list that are direct children of an entity): select distinct eh2.idParentEntity from entity e join entityhierarchy eh on eh.uidEntity=e.uidEntity join entityhierarchy eh2 on eh2.uidEntity=e.uidEntity join positionedElement pe on pe.uidEntity = e.uidEntity where eh.idParentEntity=X and eh2.depth=eh.depth-1 and pe.idKnowledgeSet < Y Query B (from every direct children of an entity, select the one that have occurrences in a given document list) select e.uidEntity from entity e join entityhierarchy eh on eh.uidEntity=e.uidEntity where eh.idParentEntity=X and depth=1 and exists (select 1 from entity e join entityhierarchy eh2 on eh2.uidEntity=e.uidEntity join entity father on eh2.idParentEntity = father.uidEntity join positionedElement pe on pe.uidEntity = e.uidEntity where eh.uidEntity=eh2.idParentEntity and pe.idKnowledgeSet < Y) Now the explain statement (80 cols...) *[CASE 1]* | Query A in WUC1 | X=lots of children (1.3M) | Y=few documents (10) | average execution time: 80 ms *************************** 1. row *************************** id: 1 select_type: SIMPLE table: pe type: range possible_keys: pEntityIdx,ksIdx key: ksIdx key_len: 4 ref: NULL rows: 155 Extra: Using where; Using temporary *************************** 2. row *************************** id: 1 select_type: SIMPLE table: eh type: ref possible_keys: entitiesIdx,fullHierarchyIdx,hIdx key: entitiesIdx key_len: 4 ref: tcell.pe.uidEntity rows: 1 Extra: Using where *************************** 3. row *************************** id: 1 select_type: SIMPLE table: e type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: tcell.eh.uidEntity rows: 1 Extra: Using where; Using index *************************** 4. row *************************** id: 1 select_type: SIMPLE table: eh2 type: ref possible_keys: entitiesIdx key: entitiesIdx key_len: 4 ref: tcell.pe.uidEntity rows: 1 Extra: Using where ************************************************** ************ *[CASE 2]* | Query A, WUC2 | X=few children (8) | Y=lots of documents (150.000) | average execution time: 2 min *************************** 1. row *************************** id: 1 select_type: SIMPLE table: pe type: range possible_keys: pEntityIdx,ksIdx key: ksIdx key_len: 4 ref: NULL rows: 2511168 Extra: Using where; Using temporary *************************** 2. row *************************** id: 1 select_type: SIMPLE table: eh type: ref possible_keys: entitiesIdx,fullHierarchyIdx,hIdx key: entitiesIdx key_len: 4 ref: tcell.pe.uidEntity rows: 1 Extra: Using where *************************** 3. row *************************** id: 1 select_type: SIMPLE table: e type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: tcell.eh.uidEntity rows: 1 Extra: Using where; Using index *************************** 4. row *************************** id: 1 select_type: SIMPLE table: eh2 type: ref possible_keys: entitiesIdx key: entitiesIdx key_len: 4 ref: tcell.pe.uidEntity rows: 1 Extra: Using where ************************************************** ************ *[CASE 3]* | Query B, WUC1 | X=lots children (1.3M) | Y=few documents (10) | average execution time: 2 min *************************** 1. row *************************** id: 1 select_type: PRIMARY table: eh type: ref possible_keys: entitiesIdx,fullHierarchyIdx,hIdx key: hIdx key_len: 5 ref: const,const rows: 340292 Extra: Using where; Using index *************************** 2. row *************************** id: 1 select_type: PRIMARY table: e type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: tcell.eh.uidEntity rows: 1 Extra: Using index *************************** 3. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: father type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: tcell.eh.uidEntity rows: 1 Extra: Using index *************************** 4. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: eh2 type: ref possible_keys: entitiesIdx,fullHierarchyIdx,hIdx key: fullHierarchyIdx key_len: 4 ref: tcell.eh.uidEntity rows: 1 Extra: Using where *************************** 5. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: pe type: ref possible_keys: pEntityIdx,ksIdx key: pEntityIdx key_len: 4 ref: tcell.eh2.uidEntity rows: 20 Extra: Using where; Using index *************************** 6. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: e type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: tcell.eh2.uidEntity rows: 1 Extra: Using index ************************************************** ************ *[CASE 4]* | Query B, WUC2 | X=few children (8) | Y=lots of documents (150.000) | average execution time: 80 ms *************************** 1. row *************************** id: 1 select_type: PRIMARY table: eh type: ref possible_keys: entitiesIdx,fullHierarchyIdx,hIdx key: hIdx key_len: 5 ref: const,const rows: 8 Extra: Using where; Using index *************************** 2. row *************************** id: 1 select_type: PRIMARY table: e type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: tcell.eh.uidEntity rows: 1 Extra: Using index *************************** 3. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: father type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: tcell.eh.uidEntity rows: 1 Extra: Using index *************************** 4. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: eh2 type: ref possible_keys: entitiesIdx,fullHierarchyIdx,hIdx key: fullHierarchyIdx key_len: 4 ref: tcell.eh.uidEntity rows: 1 Extra: Using where *************************** 5. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: pe type: ref possible_keys: pEntityIdx,ksIdx key: pEntityIdx key_len: 4 ref: tcell.eh2.uidEntity rows: 20 Extra: Using where; Using index *************************** 6. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: e type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: tcell.eh2.uidEntity rows: 1 Extra: Using index ************************************************** ************ Thanks for any in-depth look at this problem. It really is my last-ditch effort on that one. I reach the limit of my knowledge about MySQL and DB design :/ -- Hugo |
| |||
| Jerry, Just to let you know I can't email you directly (using gmail smtp): > Delivery to the following recipient failed permanently: > > jstucklex@attglobal.net > > Technical details of permanent failure: PERM_FAILURE: Gmail tried to > deliver your message, but it was rejected by the recipient domain. > The error that the other server returned was: 551 551 not our > customer. We recommend contacting the other email provider for > further information about the cause of this error. Thanks for your > continued support. (state 14) -- Hugo |
| |||
| On Wed, 30 Apr 2008 12:52:09 +0200, Hugo <hugo@nospam.invalid> wrote: > Jerry, > > Just to let you know I can't email you directly (using gmail smtp): > >> Delivery to the following recipient failed permanently: >> >> jstucklex@attglobal.net >> Sig: "Remove the "x" from my email address" -- Rik Wasmus |
| ||||
| Hugo wrote: > Jerry Stuckle wrote : > >> I'm a bit confused about exactly what you're trying to do. > > I not an English native speaker, adding confusions in my explanations > about a non-trivial problem. > > If you give me a chance, I'll try to summarize it differently : > > I need to run a query that have 3 worst use cases, depending on the > input (X and Y). X and Y values heavily determined the number of lines > to read: from few (1,2 lines) to a lot (1 million of lines). > > worst use case 1 (called 'WUC1') : > X is high (1.000.000) > Y is low (10) > > worst use case 2 (called 'WUC2') : > X is low (10) > Y is high (1.000.000) > > worst use case 3 (called 'WUC3') : > X is high (1.000.000) > Y is high (1.000.000) > > > For WUC1, I found a query which is really fast (QueryA, using distinct). > But it's very slow on WUC2 (which does not surprise me). > > For WUC2, I found a query which is really fast (QueryB, using exist). > But it's very slow on WUC1 (which does not surprise me neither). > > For the moment, I have to figure out (in my outer Java code) if I'm in > WUC1 and then trigger the QueryA, or in WUC2 and then trigger the > QueryB. But I want to let the MySQL optimizer doing that job for me ! > > My question is: Does someone could think of a query construction that > could be quick in both WUC1 and WUC2 ? > > (For WUC3, the query is inherently slow, due to cross join between 1M > and 1M lines, and I'm not looking for an improvement) > >> What are the >> SELECT statements you're using now, and what do you get if you EXPLAIN >> those statements? > > Here we go > > Query A (get distinct fathers of occurrences in a given document list > that are direct children of an entity): > > select distinct eh2.idParentEntity > from entity e > join entityhierarchy eh on eh.uidEntity=e.uidEntity > join entityhierarchy eh2 on eh2.uidEntity=e.uidEntity > join positionedElement pe on pe.uidEntity = e.uidEntity > where eh.idParentEntity=X > and eh2.depth=eh.depth-1 > and pe.idKnowledgeSet < Y > > Query B (from every direct children of an entity, select the one that > have occurrences in a given document list) > > select e.uidEntity > from entity e > join entityhierarchy eh on eh.uidEntity=e.uidEntity > where eh.idParentEntity=X and depth=1 > and exists > (select 1 from entity e > join entityhierarchy eh2 on eh2.uidEntity=e.uidEntity > join entity father on eh2.idParentEntity = father.uidEntity > join positionedElement pe on pe.uidEntity = e.uidEntity > where eh.uidEntity=eh2.idParentEntity > and pe.idKnowledgeSet < Y) > > Now the explain statement (80 cols...) > > *[CASE 1]* > | Query A in WUC1 > | X=lots of children (1.3M) > | Y=few documents (10) > | average execution time: 80 ms > > *************************** 1. row *************************** > id: 1 > select_type: SIMPLE > table: pe > type: range > possible_keys: pEntityIdx,ksIdx > key: ksIdx > key_len: 4 > ref: NULL > rows: 155 > Extra: Using where; Using temporary > *************************** 2. row *************************** > id: 1 > select_type: SIMPLE > table: eh > type: ref > possible_keys: entitiesIdx,fullHierarchyIdx,hIdx > key: entitiesIdx > key_len: 4 > ref: tcell.pe.uidEntity > rows: 1 > Extra: Using where > *************************** 3. row *************************** > id: 1 > select_type: SIMPLE > table: e > type: eq_ref > possible_keys: PRIMARY > key: PRIMARY > key_len: 4 > ref: tcell.eh.uidEntity > rows: 1 > Extra: Using where; Using index > *************************** 4. row *************************** > id: 1 > select_type: SIMPLE > table: eh2 > type: ref > possible_keys: entitiesIdx > key: entitiesIdx > key_len: 4 > ref: tcell.pe.uidEntity > rows: 1 > Extra: Using where > ************************************************** ************ > > *[CASE 2]* > | Query A, WUC2 > | X=few children (8) > | Y=lots of documents (150.000) > | average execution time: 2 min > > *************************** 1. row *************************** > id: 1 > select_type: SIMPLE > table: pe > type: range > possible_keys: pEntityIdx,ksIdx > key: ksIdx > key_len: 4 > ref: NULL > rows: 2511168 > Extra: Using where; Using temporary > *************************** 2. row *************************** > id: 1 > select_type: SIMPLE > table: eh > type: ref > possible_keys: entitiesIdx,fullHierarchyIdx,hIdx > key: entitiesIdx > key_len: 4 > ref: tcell.pe.uidEntity > rows: 1 > Extra: Using where > *************************** 3. row *************************** > id: 1 > select_type: SIMPLE > table: e > type: eq_ref > possible_keys: PRIMARY > key: PRIMARY > key_len: 4 > ref: tcell.eh.uidEntity > rows: 1 > Extra: Using where; Using index > *************************** 4. row *************************** > id: 1 > select_type: SIMPLE > table: eh2 > type: ref > possible_keys: entitiesIdx > key: entitiesIdx > key_len: 4 > ref: tcell.pe.uidEntity > rows: 1 > Extra: Using where > ************************************************** ************ > > *[CASE 3]* > | Query B, WUC1 > | X=lots children (1.3M) > | Y=few documents (10) > | average execution time: 2 min > > *************************** 1. row *************************** > id: 1 > select_type: PRIMARY > table: eh > type: ref > possible_keys: entitiesIdx,fullHierarchyIdx,hIdx > key: hIdx > key_len: 5 > ref: const,const > rows: 340292 > Extra: Using where; Using index > *************************** 2. row *************************** > id: 1 > select_type: PRIMARY > table: e > type: eq_ref > possible_keys: PRIMARY > key: PRIMARY > key_len: 4 > ref: tcell.eh.uidEntity > rows: 1 > Extra: Using index > *************************** 3. row *************************** > id: 2 > select_type: DEPENDENT SUBQUERY > table: father > type: eq_ref > possible_keys: PRIMARY > key: PRIMARY > key_len: 4 > ref: tcell.eh.uidEntity > rows: 1 > Extra: Using index > *************************** 4. row *************************** > id: 2 > select_type: DEPENDENT SUBQUERY > table: eh2 > type: ref > possible_keys: entitiesIdx,fullHierarchyIdx,hIdx > key: fullHierarchyIdx > key_len: 4 > ref: tcell.eh.uidEntity > rows: 1 > Extra: Using where > *************************** 5. row *************************** > id: 2 > select_type: DEPENDENT SUBQUERY > table: pe > type: ref > possible_keys: pEntityIdx,ksIdx > key: pEntityIdx > key_len: 4 > ref: tcell.eh2.uidEntity > rows: 20 > Extra: Using where; Using index > *************************** 6. row *************************** > id: 2 > select_type: DEPENDENT SUBQUERY > table: e > type: eq_ref > possible_keys: PRIMARY > key: PRIMARY > key_len: 4 > ref: tcell.eh2.uidEntity > rows: 1 > Extra: Using index > ************************************************** ************ > > > *[CASE 4]* > | Query B, WUC2 > | X=few children (8) > | Y=lots of documents (150.000) > | average execution time: 80 ms > > *************************** 1. row *************************** > id: 1 > select_type: PRIMARY > table: eh > type: ref > possible_keys: entitiesIdx,fullHierarchyIdx,hIdx > key: hIdx > key_len: 5 > ref: const,const > rows: 8 > Extra: Using where; Using index > *************************** 2. row *************************** > id: 1 > select_type: PRIMARY > table: e > type: eq_ref > possible_keys: PRIMARY > key: PRIMARY > key_len: 4 > ref: tcell.eh.uidEntity > rows: 1 > Extra: Using index > *************************** 3. row *************************** > id: 2 > select_type: DEPENDENT SUBQUERY > table: father > type: eq_ref > possible_keys: PRIMARY > key: PRIMARY > key_len: 4 > ref: tcell.eh.uidEntity > rows: 1 > Extra: Using index > *************************** 4. row *************************** > id: 2 > select_type: DEPENDENT SUBQUERY > table: eh2 > type: ref > possible_keys: entitiesIdx,fullHierarchyIdx,hIdx > key: fullHierarchyIdx > key_len: 4 > ref: tcell.eh.uidEntity > rows: 1 > Extra: Using where > *************************** 5. row *************************** > id: 2 > select_type: DEPENDENT SUBQUERY > table: pe > type: ref > possible_keys: pEntityIdx,ksIdx > key: pEntityIdx > key_len: 4 > ref: tcell.eh2.uidEntity > rows: 20 > Extra: Using where; Using index > *************************** 6. row *************************** > id: 2 > select_type: DEPENDENT SUBQUERY > table: e > type: eq_ref > possible_keys: PRIMARY > key: PRIMARY > key_len: 4 > ref: tcell.eh2.uidEntity > rows: 1 > Extra: Using index > ************************************************** ************ > > Thanks for any in-depth look at this problem. It really is my last-ditch > effort on that one. I reach the limit of my knowledge about MySQL and DB > design :/ > Ok, a couple of things. First of all, you're using temporary tables which need to hold lots of rows. What's your tmp_table_size? If it's too small, MySQL will have to use disk, which takes a lot longer. I'm surprised Query 2 is so fast in some cases - typically MySQL can't optimize subqueries as well as JOINs. What happens if you rewrite this to a JOIN? Also, what do you have for indexes on the tables? -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |