vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hello, I am currently working with a data mart. The business wants a view of the data that shows columns from the main table along with up to 6 codes from a related table (not related by the full PK of the main table though). They don't care about the order of the codes or how they are assigned to their columns. Here is an example, which will hopefully make this clearer: CREATE TABLE dbo.Main_Table ( my_id INT NOT NULL, line_number SMALLINT NOT NULL, some_other_columns VARCHAR(20) NULL ) GO ALTER TABLE dbo.Main_Table ADD CONSTRAINT PK_Main_Table PRIMARY KEY CLUSTERED (my_id, line_number) GO CREATE TABLE dbo.Related_Table ( my_id INT NOT NULL, my_code CHAR(6) NOT NULL ) GO ALTER TABLE dbo.Related_Table ADD CONSTRAINT PK_Related_Table PRIMARY KEY CLUSTERED (my_id, my_code) GO -- The two tables above are related by my_id INSERT INTO dbo.Main_Table (my_id, line_number) VALUES (1, 1) INSERT INTO dbo.Main_Table (my_id, line_number) VALUES (1, 2) INSERT INTO dbo.Main_Table (my_id, line_number) VALUES (1, 3) INSERT INTO dbo.Main_Table (my_id, line_number) VALUES (2, 1) INSERT INTO dbo.Main_Table (my_id, line_number) VALUES (2, 2) INSERT INTO dbo.Related_Table (my_id, my_code) VALUES (1, '22.63') INSERT INTO dbo.Related_Table (my_id, my_code) VALUES (1, '73.09') INSERT INTO dbo.Related_Table (my_id, my_code) VALUES (1, '51.23') INSERT INTO dbo.Related_Table (my_id, my_code) VALUES (2, '26.42') GO The results that they would want to see are: my_id line_number my_code_1 my_code_2 my_code_3 my_code_4 1 1 22.63 73.09 51.23 NULL 1 2 22.63 73.09 51.23 NULL 1 3 22.63 73.09 51.23 NULL 2 1 26.42 NULL NULL NULL 2 2 26.42 NULL NULL NULL I'm pretty sure that I will need to load a table with the data in this format (or something similar) since generating this output on the fly for the users will not work performance-wise (the main table is a partitioned table with 6 partitions, each containting 35M+ rows, while the related table is a single table that contains about 2M rows. There is additional logic that has to be done with a similar table of 90M rows. So, I will try to load a table with this denormalized view of the world. I could probably add some sort of sequence number to Related_Table if you think that might help (then it just becomes a simple series of left outer joins). Any thoughts? Thanks! -Tom. |
| |||
| Thomas R. Hummel (tom_hummel@hotmail.com) writes: > I am currently working with a data mart. The business wants a view of > the data that shows columns from the main table along with up to 6 > codes from a related table (not related by the full PK of the main > table though). They don't care about the order of the codes or how they > are assigned to their columns. Here is an example, which will hopefully > make this clearer: >... > I'm pretty sure that I will need to load a table with the data in this > format (or something similar) since generating this output on the fly > for the users will not work performance-wise (the main table is a > partitioned table with 6 partitions, each containting 35M+ rows, while > the related table is a single table that contains about 2M rows. There > is additional logic that has to be done with a similar table of 90M > rows. So, I will try to load a table with this denormalized view of the > world. I could probably add some sort of sequence number to > Related_Table if you think that might help (then it just becomes a > simple series of left outer joins). Without a sequence number in Related_Table it becomes a lot more difficult. So I changed Related_Table to: CREATE TABLE dbo.Related_Table ( my_id INT NOT NULL, seq_no tinyint NOT NULL CHECK (seq_no BETWEEN 1 AND 4), my_code CHAR(6) NOT NULL ) and produced this query (with a single LEFT JOIN): SELECT M.my_id, M.line_number, code1 = MIN(CASE R.seq_no WHEN 1 THEN R.my_code END), code2 = MIN(CASE R.seq_no WHEN 2 THEN R.my_code END), code3 = MIN(CASE R.seq_no WHEN 3 THEN R.my_code END), code4 = MIN(CASE R.seq_no WHEN 4 THEN R.my_code END) FROM Main_Table M LEFT JOIN Related_Table R ON M.my_id = R.my_id GROUP BY M.my_id, M.line_number ORDER BY M.my_id, M.line_number -- Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/pro...ads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinf...ons/books.mspx |
| |||
| Erland Sommarskog wrote: > Without a sequence number in Related_Table it becomes a lot more difficult. > So I changed Related_Table to: > > CREATE TABLE dbo.Related_Table ( > my_id INT NOT NULL, > seq_no tinyint NOT NULL CHECK (seq_no BETWEEN 1 AND 4), > my_code CHAR(6) NOT NULL ) > > and produced this query (with a single LEFT JOIN): > > SELECT M.my_id, M.line_number, > code1 = MIN(CASE R.seq_no WHEN 1 THEN R.my_code END), > code2 = MIN(CASE R.seq_no WHEN 2 THEN R.my_code END), > code3 = MIN(CASE R.seq_no WHEN 3 THEN R.my_code END), > code4 = MIN(CASE R.seq_no WHEN 4 THEN R.my_code END) > FROM Main_Table M > LEFT JOIN Related_Table R ON M.my_id = R.my_id > GROUP BY M.my_id, M.line_number > ORDER BY M.my_id, M.line_number Thanks for the interesting solution Erland. In the real life case of course, I will have many more columns than just my_id and line_number. I realize that everything is dependent on the specific data, etc., but in general, would you expect this to perform better than: SELECT M.my_id, M.line_number, R1.my_code AS code1, R2.my_code AS code2, R3.my_code AS code3, R4.my_code AS code4 FROM dbo.Main_Table M LEFT OUTER JOIN dbo.Related_Table R1 ON R1.my_id = M.my_id AND R1.seq_no = 1 LEFT OUTER JOIN dbo.Related_Table R2 ON R1.my_id = M.my_id AND R1.seq_no = 2 LEFT OUTER JOIN dbo.Related_Table R3 ON R1.my_id = M.my_id AND R1.seq_no = 3 LEFT OUTER JOIN dbo.Related_Table R4 ON R1.my_id = M.my_id AND R1.seq_no = 4 I'll experiment with both solutions if I can add a sequence number, but I was just a little surprised to see your query once the seq_no was added. Thanks, -Tom. |
| |||
| Thomas R. Hummel (tom_hummel@hotmail.com) writes: > Thanks for the interesting solution Erland. In the real life case of > course, I will have many more columns than just my_id and line_number. > I realize that everything is dependent on the specific data, etc., but > in general, would you expect this to perform better than: I will have to admit not having done any benchmark on the two solutions. But my gut feeling is that the solution with MIN and GROUP BY generally is better. The other seems to give more chances for the optimizer to go astray. -- Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se Books Online for SQL Server 2005 at http://www.microsoft.com/technet/pro...ads/books.mspx Books Online for SQL Server 2000 at http://www.microsoft.com/sql/prodinf...ons/books.mspx |
| |||
| Erland Sommarskog wrote: > Thomas R. Hummel (tom_hummel@hotmail.com) writes: > > Thanks for the interesting solution Erland. In the real life case of > > course, I will have many more columns than just my_id and line_number. > > I realize that everything is dependent on the specific data, etc., but > > in general, would you expect this to perform better than: > > I will have to admit not having done any benchmark on the two solutions. > But my gut feeling is that the solution with MIN and GROUP BY generally > is better. The other seems to give more chances for the optimizer to go > astray. I'll be sure to post some results here once I have both solutions in place for testing. Might be a few days though. -Tom. |
| |||
| Erland Sommarskog wrote: > Thomas R. Hummel (tom_hummel@hotmail.com) writes: > > Thanks for the interesting solution Erland. In the real life case of > > course, I will have many more columns than just my_id and line_number. > > I realize that everything is dependent on the specific data, etc., but > > in general, would you expect this to perform better than: > > I will have to admit not having done any benchmark on the two solutions. > But my gut feeling is that the solution with MIN and GROUP BY generally > is better. The other seems to give more chances for the optimizer to go > astray. As promised... I had a chance today to run some tests with each method. When I used a smaller table for the Main_Table (~17M rows) both had the same estimated cost for their query plans, although the multiple joins method ran a bit faster (3m49s vs. 5m11s). With a larger Main_Table (~39M rows) the estimated cost for the query plan with the group by was a bit higher than the multiple join method. In that case the group by method ran 20m23s vs. 12m45s for the multiple join method. Of course, this may vary depending on server memory, IO speed, data frequencies, etc., etc., but for my case it looks like doing multiple joins works out best. Thanks, -Tom. |
| |||
| Tom, you might want to sort and aggregate a narrow result set first, then join: SELECT M.my_id, M.line_number, code1, code2, code3, code4 FROM Main_Table M LEFT JOIN ( select my_id, MIN(CASE R.seq_no WHEN 1 THEN R.my_code END) code1, MIN(CASE R.seq_no WHEN 2 THEN R.my_code END) code2, MIN(CASE R.seq_no WHEN 3 THEN R.my_code END) code3, MIN(CASE R.seq_no WHEN 4 THEN R.my_code END) code4 from Related_Table r group by my_id ) R ON M.my_id = R.my_id In many cases it is dramatically faster. We were describing it in article named "The Less SQL Server Sorts, the Faster It Responds". |
| |||
| Alexander Kuznetsov wrote: > Tom, > > you might want to sort and aggregate a narrow result set first, then > join: > > SELECT M.my_id, M.line_number, > code1, > code2, > code3, > code4 > FROM Main_Table M > LEFT JOIN ( > select my_id, > MIN(CASE R.seq_no WHEN 1 THEN R.my_code END) code1, > MIN(CASE R.seq_no WHEN 2 THEN R.my_code END) code2, > MIN(CASE R.seq_no WHEN 3 THEN R.my_code END) code3, > MIN(CASE R.seq_no WHEN 4 THEN R.my_code END) code4 > from Related_Table r > group by my_id > ) R > ON M.my_id = R.my_id > > > In many cases it is dramatically faster. We were describing it in > article named "The Less SQL Server Sorts, the Faster It Responds". Thanks for the suggestion. When I checked a solution based on this the estimated cost in the query plan was indeed lower and it ran in just over 10m vs. the 12m45s of the multiple join solution. I may end up sticking with the multiple join solution because I think that it's more obvious to future coders what the code is trying to do and the performance difference is not a big factor, but I'll definitely put this solution in my toolbox for future use. Thanks! -Tom. |
| ||||
| Tom, In such cases I just love to figure out what components add up to those 10 or 12 minutes, for instance: How much time does it take to just scan the main table? select count(*) from main_table where col1+col2=17 (the table is scanned and a very minor overhead to calculate "col1+col2") How much time does it take to run the inner subquery? and so on. IMO it is important. For instance, if it takes 8 minutes just to scan the main table, then the difference between these 2 approaches is not 10' vs.12'45'', it is in fact closer to 2' vs. 4'45''. Makes sense? |