This is a discussion on excessive time spent in "statistics" status within the MySQL General forum forums, part of the MySQL category; --> I have some queries, involving a largish number of JOIN, which are apparently very slow or even take forever ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I have some queries, involving a largish number of JOIN, which are apparently very slow or even take forever (a mysqladmin processlist shows them remain in the "statistics" status for a long time, in most cases I have to kill them after several minutes). When I first had the problem I googled around and found some reference (which I've lost) saying that the "statistics" status is actually what one does with an EXPLAIN SELECT, and that this is done preliminarily to the actual query. It also said it might occur with a large number of joins because this analysis, for n joins MIGHT try up to n! combinations, unless one somehow specified the priorities (but the author did not remember how). I thought to have overcome the problem using a feature of the CREATE VIEW command (see below), but apparently I simply moved it to an higher n. Now I tried to see how it scales with the number of joins, and the curious things is that e.g. for n=9 it works fast, for n=15 it works slowly, for n=18 works fast again and for n=20 takes an infinite time. I'll first explain my background : - I have a number of tables (let's call them t0, t1, t2 ...) - all of them have an auto_increment column called seq which is also an index - one table (t0) is "more important" (actually it is list of celestial X-ray sources while the other are celestial objects in other wavebands but this is irrelevant to you). - I have precomputed correlation tables among t0 and each of the other. These tables are called eg. t0ti, have two columns t0 and ti corresponding to the t0.seq and ti.seq of objects which are "associated". They are indexed on (t0,ti). Note that an object in t0 can be associated with 1 or more or zero (t0ti.ti null) objects in ti. - I originally (already under mysql 3) devised a way to identify counterparts in MORE tables (all these associations are based on spherical distance + other criteria). This involved creating a working table G This table has columns named t0 t1 ... tn (containing the pointers t0.seq t1.seq ... for counterparts associated and validated according to some criteria) plus other service columns The simultaneous access was achieved in our interface by a mechanism we called virtual tables, which essentially was SELECT some subset of columns in some of the t0...tn or some expression thereof FROM G left join t0 on G.t0=t0.seq left join t1 on G.t1=t1.seq ... left join tn on G.tn=tn.seq We refer to the t0...tn as "the member tables" of G. We have different versions of G corresponding to different sets of member tables and different association criteria. The largest of our cases has 26 different members. Our mechanism was such that we defined a subset of columns in each of the ti (or expressions thereof like distances etc.) as "interesting", with an associated alias. Our interface usually showed only such "virtual columns", but had a possibility to add (naming them manually as ti.colname) to the SELECT also all other "member columns" normally hidden. We also allow to correlate a virtual table with a single physical table tk (be it member or not) using the t0tk correlation table (t0 is the "First Member"). - the above worked and still works, but has some clumsiness. When we upgraded to mysql 5 and discovered the CREATE VIEW command we decided to replace our virtual tables with views. - for each G we define a view as create algoritm=temptable view V as SELECT some subset of columns in some of the t0...tn or in G or some expression thereof FROM G left join t0 on G.t0=t0.seq left join t1 on G.t1=t1.seq ... left join tn on G.tn=tn.seq - the "algorithm=temptable" was required because without it some of our queries (see below) entered in the "statistics" status forever already with 11 members - for the rest the VIEWs work nicely when used standalone and are easier for the user ... - ... but on the other hand they HIDE the member columns which are not explicitly named in CREATE VIEW (where one wants to keep a manageable number of columns). Hide means here that their names ti.colname cannot be used in SELECT ! - so we devised an option by which on ticking on "show members also" one can also include these ti.colname in the query de facto this doubles the joins, because the statement built is SELECT list of (V.colname and ti.colname with i chosen among 0 and n) FROM ( G left join t0 on G.t0=t0.seq left join t1 on G.t1=t1.seq ... left join tn on G.tn=tn.seq ) left join V on G.seq=V.seq This statement NOW works (it did not work with e.g. 11 member tables before we switched to ALGORITHM=TEMPTABLE in the CREATE VIEW). An "explain select" for a query on such views gives that a view with n members "with members also enabled" involves 29+n queries (our [working] maximum is n=26 with 55 queries) I have also noticed that such "explain select" are relatively fast (0.3 sec) and are somehow cached (if I repeat one, it takes zero time) - the last step would be to make a query on a VIEW with "show members also" ticked, correlated with another table Tk (be Tk either a member, one of t1...tn, or a non-member) This involves a query like this SELECT list of (V.colname and ti.colname and Tk.colname) FROM ( VTk left join ( (G left join t0 on G.t0=t0.seq left join t1 on G.t1=t1.seq ... left join tn on G.tn=tn.seq ) left join V on G.seq=V.seq ) on VTk.V = V.seq ) left join Tk on VTk.Tk=Tk/seq where the correlation table VTk is also a view CREATE ALGORITHM=TEMPTABLE VIEW VTk as select G.seq as V, G.Tk as Tk from G left join t0Tk on G.t0=t0Tk.t0 which exploits the precomputed correlation among the "important" table t0 (the "First Member") and Tk. Note that t0Tk is indexed on (t0,Tk) but the view apparently is not indexed. - now such a query on our largest table (26 members) hangs in statistics status forever. I experimented therefore trying to use a variable names of left joins (this to tell when it stops working ... in real life I *might* need all, or any of the member). The query for "view with members also correlated with another table" in an "explain select" requires 33+n queries for n members. Now the funny thing is that up to e.g. 9 members "explain select" is rather fast (0.3 sec). For 15 members (48 queries) slows down significantly (6.3 sec) and is not cached (even if repeated soon still takes 6.3 sec). For 20 members it takes forever. But for 18 members (51 queries) instead takes only 0.5 sec. When I say 9,15,18,20 members I mean the first 9,15,18,20 of the full member list. Such a list is in an arbitrary order (sort of historical the various tables entered the database). Also I note that the output of "explain select" presents the various tables in a varying order. For instance the slow 15-member case but also the fast 9-member case have "<derived2>" (which is the 13000-element un-indexed VTk correlation view) interspersed with the other table, the fast 18-member case has it first. Questions : - what does "explain select" actually do and why sometimes hangs ? - can this be overcome rearranging the order of the joins (note that the bulk of the members are all joined with G), or introducing parentheses or with other syntax changes ? - or has it to do with some configuration parameter, maybe related to what is cached, cache size or other ? Thanks in advance to whoever is able to give hints. -- ----------------------------------------------------------------------- Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy) For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html ----------------------------------------------------------------------- |
| |||
| Hello Lucio, (reply below) Lucio Chiappetti wrote: > I have some queries, involving a largish number of JOIN, which > are apparently very slow or even take forever (a mysqladmin processlist > shows them remain in the "statistics" status for a long time, in most > cases I have to kill them after several minutes). > > When I first had the problem I googled around and found some reference > (which I've lost) saying that the "statistics" status is actually what one > does with an EXPLAIN SELECT, and that this is done preliminarily to the > actual query. It also said it might occur with a large number of joins > because this analysis, for n joins MIGHT try up to n! combinations, unless > one somehow specified the priorities (but the author did not remember > how). You can find those explanations in our manual at http://dev.mysql.com/doc/refman/5.0/...formation.html > > I thought to have overcome the problem using a feature of the CREATE > VIEW command (see below), but apparently I simply moved it to an higher n. > > Now I tried to see how it scales with the number of joins, and the curious > things is that e.g. for n=9 it works fast, for n=15 it works slowly, for > n=18 works fast again and for n=20 takes an infinite time. > > I'll first explain my background : > > - I have a number of tables (let's call them t0, t1, t2 ...) > - all of them have an auto_increment column called seq which is also > an index > > - one table (t0) is "more important" (actually it is list of > celestial X-ray sources while the other are celestial objects > in other wavebands but this is irrelevant to you). > > - I have precomputed correlation tables among t0 and each of > the other. These tables are called eg. t0ti, have two columns > t0 and ti corresponding to the t0.seq and ti.seq of objects which > are "associated". They are indexed on (t0,ti). Note that an > object in t0 can be associated with 1 or more or zero (t0ti.ti null) > objects in ti. > > - I originally (already under mysql 3) devised a way to identify > counterparts in MORE tables (all these associations are based on > spherical distance + other criteria). This involved creating a > working table G > > This table has columns named t0 t1 ... tn (containing the pointers > t0.seq t1.seq ... for counterparts associated and validated according > to some criteria) plus other service columns > > The simultaneous access was achieved in our interface by a mechanism > we called virtual tables, which essentially was > > SELECT > some subset of columns in some of the t0...tn > or some expression thereof > FROM > G left join t0 on G.t0=t0.seq > left join t1 on G.t1=t1.seq > ... > left join tn on G.tn=tn.seq > > We refer to the t0...tn as "the member tables" of G. I normally do not get lost in symbolic descriptions such as yours. However your description of G and how you build it leaves me in the dark. Can you show me a few sample rows of G (symbolically, if you like) and describe what is in each column G.t0 to G.tn? What I am curious to know is what do each of these n object have in common that allows them to represented as a single tuple on the G table. You also mention other "service columns". What kinds of information are you keeping in those? > > We have different versions of G corresponding to different sets > of member tables and different association criteria. > > The largest of our cases has 26 different members. ><snip> > - for each G we define a view as > "Each G" ? Again, that makes the concept of what a G really is more confusing to me. I understand databases and I know more than a little about stellar cartography, cosmology, and physics. Please don't hold back. > <snipped (to be revisited later)> > > Questions : > > - what does "explain select" actually do and why sometimes hangs ? > > - can this be overcome rearranging the order of the joins (note that > the bulk of the members are all joined with G), or introducing > parentheses or with other syntax changes ? > > - or has it to do with some configuration parameter, maybe related > to what is cached, cache size or other ? > > Thanks in advance to whoever is able to give hints. > EXPLAIN SELECT simply stops a normal SELECT statement from actually performing the data retrieval steps and shows us (the users) a description of the techniques the query engine was about to use to get at the data. That means that the steps of parsing the query, tokenizing the symbols, and optimizing the execution plan still take place. It is during this optimization phase that most of your CPU time is being used as the engine will work many permutations of joining one table to another until it reaches a decision about which plan is "less expensive" that all of the others. http://dev.mysql.com/doc/refman/5.0/...mizations.html http://dev.mysql.com/doc/refman/5.0/...imization.html and the rest of the optimization chapter describe this process in considerable details http://dev.mysql.com/doc/refman/5.0/en/query-speed.html Please do explain the data problems you are trying to solve in some more detail as I may be able to help you to design a less cumbersome method of achieving the same goals. -- Shawn Green, Support Engineer MySQL Inc., USA, www.mysql.com Office: Blountville, TN __ ___ ___ ____ __ / |/ /_ __/ __/ __ \/ / / /|_/ / // /\ \/ /_/ / /__ /_/ /_/\_, /___/\___\_\___/ <___/ Join the Quality Contribution Program Today! http://dev.mysql.com/qualitycontribution.html |
| |||
| As I'm new on this list and can't find a "policy statement", I assume it is OK to reply to the list in "discussion list" fashion. If instead it is preferred to reply to each sender privately and later I'd post a summary, please let me know and I'll comply. On Sat, 1 Sep 2007, Shawn Green wrote: > Lucio Chiappetti wrote: > > I have some queries, involving a largish number of JOIN, which are > > [...] very slow or [...] remain in the "statistics" status [forever] > > This involved creating a working table G > > [...] > > The simultaneous access was achieved in our interface by a mechanism > > we called virtual tables, which essentially was > > > > SELECT > > some subset of columns in some of the t0...tn > > or some expression thereof > > FROM > > G left join t0 on G.t0=t0.seq > > left join t1 on G.t1=t1.seq > > ... > > left join tn on G.tn=tn.seq > > > > We refer to the t0...tn as "the member tables" of G. > I normally do not get lost in symbolic descriptions such as yours. > However your description of G and how you build it leaves me in the > dark. > > Can you show me a few sample rows of G (symbolically, if you like) and > describe what is in each column G.t0 to G.tn? What I am curious to know > is what do each of these n object have in common that allows them to > represented as a single tuple on the G table. The explanation was already (too tersely ?) contained in my sentence > > This table has columns named t0 t1 ... tn (containing the pointers > > t0.seq t1.seq ... for counterparts associated and validated according > > to some criteria) plus other service columns Let me make an example. t0 is a table of X-ray sources, it contains sky coordinates (ra,dec), count rates and fluxes in 5 energy bands, detection probabilities etc. In addition it contains a column named "seq" which is just an auto_increment sequential number, and is the unique way to identify a given source (there is a further complication due to the fact some records are redundant detections of the same source, but since they are not used in building the G's you'd better ignore this). t1 might be e.g. a table of optical sources, with sky coordinates, magnitudes, and, of course, its own "seq". There may be different tables of optical sources (t1, t2, t3). Or t4 can be a table of radio sources, with sky coordinates, fluxes, and its "seq". Or t5 can be a table of URLs into external astronomical sites like SIMBAD or NED (if you know them), again with sky coordinates and a "seq". Some of all these tables have their own identifiers, but sometimes these aren't unique, or aren't numeric. So in general these tables MAY have an UNIQUE PRIMARY index which can be either the original identifier, or some combination of original columns (for instance a source identifier and a field identifier) and an unique auxiliary key which is my auto_increment seq (numeric, built at data ingestion). This again is mostly irrelevant to you. The point is that each table has an unique auto_increment seq. A "G" table will simply contain the "seq's" in the "member tables". For ease of use the column names in G will be the table names of the member tables. I indicated them as t0 t1 t2 ... that their actual names are e.g. "nov06", "d1t3", "ukidss", "radio", "simbad" is irrelevant. So a record in G may contain for instance : - its own seq 253719 (do not be worried by the fact the number is large there are lots of gaps for records removed during construction) - the seq in the X-ray table (t0) : 1521 - the seq in an optical table (t1) : 1229 - the seq in another optical table (t2) : 42168 - the seq in the radio table (t3) : null - the seq in an IR table (t4) : 9 Another record with seq 260429 can have the same t0=1521, but e.g. the seq in the IR table t4=11, and all other t1 t2 t3 null. Etc. etc. Essentially G says that X-ray source 1521 can have up to 2 (or 1 or 7 or whatever) potential counterparts, one is optical t2=1229 which is the same as optical t3=42168 and the same as IR t4=9 ; the other is only IR t4=11, etc. etc. All associations are pre-computed via some sort of other (proximity) analysis. > You also mention other "service columns". What kinds of information are > you keeping in those? Information which is irrelevant to the present discussion, except for a marginal point (see below). Like for instance a numeric rank which says that the association 1521/1229/42168/null/9 is preferred, and the association 1521/null/null/null/11 is unlikely, or to be rejected. Or flags produced during the identification. Or the chance probabilities that the association of a source in t0 and t1 or t2 is real considered the distance and the density of objects having a given magnitude. > > We have different versions of G corresponding to different sets > > of member tables and different association criteria. > "Each G" ? Again, that makes the concept of what a G really is more > confusing to me. I understand databases and I know more than a little > about stellar cartography, cosmology, and physics. Please don't hold > back. Each G in the sense of different versions. I may have one G starting from a t0 with X-ray sources in an area of the sky, and another with a t0 in another area or coming from a different analysis. Or even with the same t0 I may have a G whose members are t0 t1 t2 t3 (identification done two years ago with tables available then) and a G whose members are t0 t1 t5 t6 t7 t8 t9 t10 (identification done now, with more tables, or with newer versions of some tables). Again this is irrelevant for MY PROBLEM, except for the following statement : - I have no problems with the SELECT statements I reported above for whatever number of t0...tn member tables. Neither if I issue the statement as such, nor if I encapsulate it in a CREATE VIEW - I start encounter problems if I want to join the VIEW created with the above statement with something else, when n is slighly large (e.g. a G with 11 members), but if I use ALGORITHM=TEMPTABLE the problem goes away, at least for n=26 (my largest G). - I have a problem again for a more complex join, which arises around n=20. Actually it's look like further experimenting shows a solution, but I'm still unsure whether that gives the same results of the old one : such a solution is : replace all LEFT JOIN by STRAIGHT_JOIN Maybe you'd just want to comment on that ? > Please do explain the data problems you are trying to solve in some more > detail as I may be able to help you to design a less cumbersome method > of achieving the same goals. I thank you for your offer. However we do not want to revolutionize all our system and interfaces (Shawn, if you are interested I can give you privately more specific pointers), which has been working well since some years and used inside our consortium, and has just "gone public" for some datasets. The "data problem" is likely to have no or little relation with the astronomical nature of the data, or the background explained above (which howver I hope was of some interest), but can be summarized as follows. Note the problems occur ONLY in the latest steps, which are those of lesser importance for the user (or important for less users) : - there are n tables (all with their "seq" column), which contain several columns, of which some are particularly interesting, and some other are less interesting - these tables are associated via a "G" table which links all the seqs of associated objects. This is precomputed once forever. - the majority of the users will be interested in a small number of interesting columns taken from some of the member tables (or expressions thereof). Howevever we wanted to screen them from entering in a SELECT statement columns explicitly by names like nov06.seq, nov06.fluxb, w1t3.magip or even dist(nov06.ra_corr,nov06.dec_corr,w1t3.ra,w1t3.dec l)*3600 (where dist is an UDF) ... and I have even worse expressions. So we originally devised our own java interface to somehow hide the expansion of such aliases, and later embedded the aliases in a CREATE VIEW. The "FROM" of such statements is the one listed above, involving G and a sequence of left joins on t0 to tn. This works nicely and fast in both the old (no view) and new (with view) way. - a minority of users could be interested in accessing also some of the columns in the member tables, which are NOT listed in our non-hidden list or are NOT members of the view. In the old arrangement, we simply had a check box which enabled "view member also" and un-hided the hidden columns in a menu, Also the user could simply type their name (say nov06.snrcd) in the SELECT if he knew the name. Such statement was the SAME involving G and a sequence of left joins on t0 to tn. It is just a matter of listing more columns in the "select" before the "from". However if the new arrangement uses a view V to hide the entire statement, if I WANT USERS TO BE ABLE ALSO TO SEE HIDDEN COLUMNS (if you want THIS is the data problem !!!) I have to explicitly repeat the list of joins. I cannot issue select nov06.snrcd from V because nov06.snrcd was not included in the definition of the view (and there are too many columns in the n members t0...tn to be all named in a viable view). So I generate a statement like that listed at about 2/3 of the post which started the thread SELECT list of (V.colname and ti.colname with i chosen among 0 and n) FROM ( G left join t0 on G.t0=t0.seq left join t1 on G.t1=t1.seq ... left join tn on G.tn=tn.seq ) left join V on G.seq=V.seq Again this works provided one uses ALGORITHM=TEMPTABLE in the CREATE VIEW. Despite the fact some redundant queries are issued. - a real minority of users (me only ? for debugging purposes ? it happened me once that I detected and fixed a typo that way ) could be interested in accessing not only the interesting columns in the view AND the hidden columns in the members, but also columns in a FURTHER TABLE tk (e.g. which is not yet a member, or is a new release of some member which I have to evaluate whether to replace). This involves a more complex join ... which is listed towards the end of the original post and I won't repeat here Such join involves one further view which in turn is a join of the G and a correlation table between tk and the X-ray table t0 (essentially listing all seqs in t0 and tk whose object are closer than a predefined sky distance). It is this latter statement (of rare use, hence my reluctance to revolutionize all the rest) which enters the "statistics state" forever for large number of members. However, as I said, replacing LEFT JOIN by STRAIGHT_JOIN seems to solve it. Now coming to more general issues, I have read (or re-read) most of the manual pages you quoted, but your explanation below is what condenses more effectively what happens. > EXPLAIN SELECT simply stops a normal SELECT statement from actually > performing the data retrieval steps and shows us [...] I used in the past EXPLAIN SELECT mainly to check whether introducing an index (or sometimes a parenthesized order) improved the performance, although why it did it was sort of black magic. > It is during this optimization phase that most of your CPU time is being > used as the engine will work many permutations of joining one table to > another until it reaches a decision about which plan is "less expensive" > that all of the others. So essentially this matches what a lost Google reference said, that in case of many joins an excessive times may be spent checking up to n! combinations unless one somehow "forces the order". What I gather from some hints in the documentation (and my very rough and quick test) is that STRAIGHT_JOIN might be the way to force the order (but will it have some unpleasant side effects or is it safe ?) Thanks again. -- ----------------------------------------------------------------------- Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy) For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html ----------------------------------------------------------------------- () ascii ribbon campaign - against html mail /\ http://arc.pasp.de/ |
| |||
| Hello Lucio, Thank you for the excellent description of you problem. I believe I completely understand both the data you are handling and the problems you are facing. I would not ask you to change your schema at this point. There is far too much work put into it at this phase to suggest a redesign. What I hope to do is to help you to make you queries work better by applying hints, modifiers, and limits to the optimizer. http://dev.mysql.com/doc/refman/5.0/en/join.html states: "STRAIGHT_JOIN is identical to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer puts the tables in the wrong order." You asked what would happen if you replaced all of your view's LEFT JOIN clauses with STRAIGHT_JOIN clauses. When used in the FROM clause (as part of the definitions of where the data comes from) STRAIGHT_JOIN would be equivalent to a JOIN which is equivalent to an INNER JOIN. This is not what you are trying to achieve. However, there is another place you can put the STRAIGHT_JOIN modifier: in the SELECT clause (http://dev.mysql.com/doc/refman/5.0/en/select.html) . Quoting again: " STRAIGHT_JOIN forces the optimizer to join the tables in the order in which they are listed in the FROM clause. You can use this to speed up a query if the optimizer joins the tables in non-optimal order. See Section 6.2.1, “Optimizing Queries with EXPLAIN”. STRAIGHT_JOIN also can be used in the table_references list. See Section 12.2.7.1, “JOIN Syntax”. SELECT STRAIGHT_JOIN .... FROM ... LEFT JOIN ... WHERE ... ... This will allow you to keep your LEFT JOINs in the <table reference> portion of your query (everything between FROM and WHERE) but avoid all of the permutations the optimizer performse related to trying to analyze which table to join first to which other table. Why are some sets of tables optimized more quickly than the others? My best guess would be that some mix of WHERE conditions and table indexes make it much faster to eliminate combinatorial permutations of JOIN sequences than others. There is another variable you could use to minimize how many table permutations the optimizer will examine. This is the variable optimizer_search_depth: (http://dev.mysql.com/doc/refman/5.0/...search_dept h) (My apologies if that link wraps incorrectly). You can set this variable on a per-session basis so it would be possible to tune it for each call to your views. I know that would be quite the administrative hassle to implement setting this variable for each call to your views but I provide it here for completeness. Please try out "SELECT STRAIGHT_JOIN" and let me know if your situation improves. BTW - we encourage everyone to reply to the full list on all responses (unless they are confidential) so that all members can gain from the knowledge transfer. -- Shawn Green, Support Engineer MySQL Inc., USA, www.mysql.com Office: Blountville, TN __ ___ ___ ____ __ / |/ /_ __/ __/ __ \/ / / /|_/ / // /\ \/ /_/ / /__ /_/ /_/\_, /___/\___\_\___/ <___/ Join the Quality Contribution Program Today! http://dev.mysql.com/qualitycontribution.html |
| |||
| On Mon, 3 Sep 2007, Shawn Green wrote: > What I hope to do is to help you to make you queries work better by > applying hints, modifiers, and limits to the optimizer. Many thanks, Shawn. > "STRAIGHT_JOIN is identical to JOIN, except ... > > You asked what would happen if you replaced all of your view's LEFT JOIN > clauses with STRAIGHT_JOIN clauses. [...] This is not what you are > trying to achieve. I realized that. I overlooked "is identical to JOIN" since I generally use only LEFT JOINs to benefit of null returns (to us it is equally important to know "source A has these 1/2/n counterparts" and "source A has no counterparts" in the other catalogue/s). I originally tested that STRAIGHT_JOIN made the "explain select" arrive to an end without looping forever, but as soon as I tried a real select I realized the different behaviour. > However, there is another place you can put the STRAIGHT_JOIN modifier: > in the SELECT clause I discovered that too, and it looks promising. I plan to test whether that makes any difference on the query result and execution time in some representative cases. > Why are some sets of tables optimized more quickly than the others? My > best guess would be that some mix of WHERE conditions and table indexes > make it much faster to eliminate combinatorial permutations of JOIN > sequences than others. I also realized that the WHERE part enters the optimization too ... this is of course unpredictable a priori, as it depends on the particular user needs. > This is the variable optimizer_search_depth: This also looks interesting thanks. I did some quick experimenting. For some reasons a SHOW VARIABLE tells me its normal value is 62. Apparently this does not depend on the database I select (I thought my real "data" database with the views and G's could have an higher value than e.g. my own "administrative" DB or the mysql DB, but it is always 62. I tried lowering it to 10/15/20 and even to 0 (which according to 5.2.3 in the manual should be a sort of "automatic" value). In all cases the explain select on our maximal statement (the one which loops forever in statistics status under default conditions) reaches an end rather fast, and returns the same result (irrespective of 0/10/15/20). The ORDER returned by playign with optimizer_search_depth is however different from the one returned with SELECT STRAIGHT_JOIN. The difference is just in the order (the type, key and "extra" looks the same, in general "ref" or "eq_ref" and "using index") so I suppose it should not make much difference. > You can set this variable on a per-session basis so it would be possible > to tune it for each call to your views. While I'm doing the tests under the mysql linemode client, our production environment is a Java jsp interface under tomcat. I have to check with my colleague dealing with java programming what is a "session" for him (probably a JDBC connection in connection pool). Actually what looks appealing in optimizer_search_depth vs SELECT STRAIGHT_JOIN is the possibility of fixing (no tuning per session) optimizer_search_depth to a decent default (0 ?) once forever, because this requires no changes to the code (however if SELECT STRAIGHT_JOIN will require changes concentrated in a few places). I'll do some experimenting and report back. -- ----------------------------------------------------------------------- Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy) For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html ----------------------------------------------------------------------- () ascii ribbon campaign - against html mail /\ http://arc.pasp.de/ |
| |||
| On Tue, 4 Sep 2007, Lucio Chiappetti wrote: > I'll do some experimenting and report back. In lack of better ways of doing a tie-break, I've done the following tests (with the linemode client), checking both the results of a query and the total time spent. I tested 16 different combinations of arbitrary WHERE conditions and correlation with an arbitrary external tables, using my "maximal" G (the one with 26 members). For each I tested 3 cases (total 16*3=48) : a) the query on the "virtual" table correlated with the external (the virtual is my G left join t1 ... left join tn). This involves the lowest number of joins (29) and is of course the fastest (and also the easiest to get access to "hidden" columns by name, but is not what I intend our users use because the way to select columns is clumsy (they should use a VIEW on the 26 members instead ... normally with no correlation on external that will be equally fast), but represents my REFERENCE, i.e. I checked the results of the other test to be the same as this. b) the query on the VIEW V with "members also" (which implies a redundant join of V with G left join t1 ... left join tn) and correlated with an external table. Doubles the joins (59). I used here SELECT STRAIGHT_JOIN. c) the same query of (b) but with a normal select, and preceded by setting optimizer_search_depth=0 (auto). I did each test in a fresh mysql session to prevent cached valued to affect the timings. The good news are that all query go to end without getting stuck in the statistics state, and that (a),(b),(c) for the same query return the same results. Considering the timing instead : (a) are obviously the fastest (from 0.0 to 0.63 seconds at worst) (b) the queries with SELECT STRAIGHT_JOIN ("not optimized" ?) are *in general* the slowest but not too slow, i.e. complete within from 0.8 to 1.9 sec. There are however a couple of cases involving one external table in which they are slower (though not unfeasible), from 3 to 12 sec in one case, and from 24 to 45 sec in another. (c) the queries with optimizer_search_depth=0 have a speed comparable with (b). *In general* they are marginally faster than (b) (0.7 to 1.6 sec) when (b) is reasonably fast. When (b) is slow, however (c) is TWICE AS SLOWER (5 to 23 and 48 to 89 sec) The explain select does not give obvious clues while those particular table combinations are slower, and, considered that the queries involving "view + member also + external" concern an absolute minority of users[*] it is not worth spending more time investigating. [*] actually a larger minority (I hope it remains such) of our "public" users accesses the DB once to "take away" all data without performing any selection :-( My inclination therefore would be to prefer optimizer_search_depth=0 to SELECT STRAIGHT_JOIN *if* it can be easily arranged in our tomcat environment, because of the marginal increase in speed despite the occasional worsening. But the two look almost equivalent. -- ----------------------------------------------------------------------- Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy) For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html ----------------------------------------------------------------------- () ascii ribbon campaign - against html mail /\ http://arc.pasp.de/ |
| |||
| Lucio Chiappetti wrote: > On Tue, 4 Sep 2007, Lucio Chiappetti wrote: > >> I'll do some experimenting and report back. > > In lack of better ways of doing a tie-break, I've done the following > tests (with the linemode client), checking both the results of a query and > the total time spent. I tested 16 different combinations of arbitrary > WHERE conditions and correlation with an arbitrary external tables, using > my "maximal" G (the one with 26 members). > > For each I tested 3 cases (total 16*3=48) : > > a) the query on the "virtual" table correlated with the external > (the virtual is my G left join t1 ... left join tn). This involves > the lowest number of joins (29) and is of course the fastest (and > also the easiest to get access to "hidden" columns by name, but > is not what I intend our users use because the way to select columns > is clumsy (they should use a VIEW on the 26 members instead ... > normally with no correlation on external that will be equally fast), > but represents my REFERENCE, i.e. I checked the results of the other > test to be the same as this. > > b) the query on the VIEW V with "members also" (which implies a redundant > join of V with G left join t1 ... left join tn) and correlated with > an external table. Doubles the joins (59). I used here > SELECT STRAIGHT_JOIN. > > c) the same query of (b) but with a normal select, and preceded by > setting optimizer_search_depth=0 (auto). > <snip> In your b) test, did you use the SELECT STRAIGHT_JOIN as your outer SELECT statement or within the CREATE VIEW statement>? If you only tried it one way, you could try it the other, too. > >[*] actually a larger minority (I hope it remains such) of our "public" > users accesses the DB once to "take away" all data without performing any > selection :-( > It saddens me to see people abuse your hard work in this way. > My inclination therefore would be to prefer optimizer_search_depth=0 > to SELECT STRAIGHT_JOIN *if* it can be easily arranged in our tomcat > environment, because of the marginal increase in speed despite the > occasional worsening. But the two look almost equivalent. > Another option would be to build your view dynamically based on the criteria that a user selects through a web-based interface. One other way to approach this project would be to normalize your relationships and have one skinny but very tall table similar to CREATE TABLE map_table ( object1_id int , object1_type int , object2_id int , object2_type int , confidence tinyint ) where confidence would be a whole number from 0 to 100. You would need only one of these to replace each G table you are generating now. However, that would be a major difference in how you currently use your data and I would not suggest this for a near-term solution. -- Shawn Green, Support Engineer MySQL Inc., USA, www.mysql.com Office: Blountville, TN __ ___ ___ ____ __ / |/ /_ __/ __/ __ \/ / / /|_/ / // /\ \/ /_/ / /__ /_/ /_/\_, /___/\___\_\___/ <___/ Join the Quality Contribution Program Today! http://dev.mysql.com/qualitycontribution.html |
| |||
| On Tue, 4 Sep 2007, Shawn Green wrote: > > On Tue, 4 Sep 2007, Lucio Chiappetti wrote: > > For each I tested 3 cases (total 16*3=48) : > > > > a) the query on the "virtual" table correlated with the external > > (the virtual is my G left join t1 ... left join tn). [...] > > represents my REFERENCE, this case will also exploit full (or default) optimization > > b) the query on the VIEW V with "members also" (which implies a > > redundant join of V with G left join t1 ... left join tn) and > > correlated with an external table. [...] I used here SELECT > > STRAIGHT_JOIN. > > c) the same query of (b) but with a normal select, and preceded by > > setting optimizer_search_depth=0 (auto). > In your b) test, did you use the SELECT STRAIGHT_JOIN as your outer > SELECT statement or within the CREATE VIEW statement? No. I left the CREATE VIEW alone (standard LEFT JOINs) in all cases, since that will be fully optimized, and used by a majority of users. SELECT STRAIGHT_JOIN was used only in (b) ["no optimization"] and not in (c) ["residual optimization"]. Our inclination would now be to use SELECT STRAIGHT_JOIN because it's the one involving only "elegant" changes to the code. Consider again that the case "view + member also + other table" will be rare (very few users) compare to "view alone". Anyhow all this was rather instructive. Now I'll start experimenting with UNIONs too ... -- ----------------------------------------------------------------------- Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy) For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html ----------------------------------------------------------------------- () ascii ribbon campaign - against html mail /\ http://arc.pasp.de/ |
| |||
| Hi Shawn, Lucio, > SELECT STRAIGHT_JOIN .... > FROM ... > LEFT JOIN ... > WHERE ... > ... Just to correct a point here... if a query uses only LEFT JOIN or RIGHT JOIN, the join order is fixed by the query's order itself, so using STRAIGHT_JOIN should have no effect whatsoever. Equally important, since the join order is fixed when you use LEFT JOIN, you *must* list the joins in the correct order when writing the query, otherwise you will see very poor performance. MySQL's optimizer cannot reorder the joins because it has the potential to change the result of the query. Regards, Jeremy -- high performance mysql consulting www.provenscaling.com |
| ||||
| On Fri, 14 Sep 2007, Jeremy Cole wrote: > Just to correct a point here... if a query uses only LEFT JOIN or RIGHT JOIN, > the join order is fixed by the query's order itself, so using STRAIGHT_JOIN > should have no effect whatsoever. Equally important, since the join order is but experimentally using SELECT STRAIGHT_JOIN *does make* a difference. If I use it there is no time spent in the "statistics" phase, otherwise there is. Shawn can comment more conclusively since he knows the inner working of mysql. I just report a finding by experiment. > should have no effect whatsoever. Equally important, since the join order is > fixed when you use LEFT JOIN, you *must* list the joins in the correct order > when writing the query, otherwise you will see very poor performance. However the "correct order" is not always obvious to be determined a priori (like in my cases where I have a "glorified correlation" table G which shall go first, but all member tables which go next with equal rank (except the "first member" which is more equal than the other "=) ). The user can then build a query which uses only SOME of the members. And I've noticed that the result of EXPLAIN SELECT (the order) changed according to the content of the query (e.g. the particular WHERE condition). So for me experimentally use of SELECT STRAIGHT_JOIN is an effective solution. -- ----------------------------------------------------------------------- Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy) For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html ----------------------------------------------------------------------- () ascii ribbon campaign - against html mail /\ http://arc.pasp.de/ |