vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hello, I want to set up a query in SQL Server that is "grouped by" a variable PRICE. Since PRICE takes on continuous decimal values, I want to create deciles based on this variable and then display the average price in each decile. As background, PRICE is a calculated quantity: I divide a table field SLS_DLRS by a field SLS_UNTS to get it. I also want to include an average SLS_UNTS for each decile. So essentially, I want the result to be something like: DECILE AVG_PRICE AVG_SLS_UNTS 1 0.50 5.2 2 1.50 4.7 .... ... ... 10 9.50 1.1 Is there a way to do this in an SQL statement? Thanks in advance, Indraneel |
| |||
| "Indraneel Sheorey" <indraneel.sheorey@dartmouth.edu> wrote in message news:3556ef5.0311141040.1e38d2a8@posting.google.co m... > Hello, > > I want to set up a query in SQL Server that is "grouped by" a variable > PRICE. Since PRICE takes on continuous decimal values, I want to > create deciles based on this variable and then display the average > price in each decile. As background, PRICE is a calculated quantity: > I divide a table field SLS_DLRS by a field SLS_UNTS to get it. I also > want to include an average SLS_UNTS for each decile. > > So essentially, I want the result to be something like: > DECILE AVG_PRICE AVG_SLS_UNTS > 1 0.50 5.2 > 2 1.50 4.7 > ... ... ... > 10 9.50 1.1 > > Is there a way to do this in an SQL statement? > > Thanks in advance, > Indraneel There are different methods, with varying degrees of accuracy, to calculate percentiles. Here's one reasonable method described through an example. Consider the 50th percentile (5th decile or median) for the following numbers, arranged in increasing order: 3, 5, 7, 8, 9, 11, 13, 15 1. Compute the rank R of the 50th percentile. R = P / 100 * (N + 1) P = the desired percentile N = number of numbers in the collection R = 50 / 100 * (8 + 1) = 9 / 2 = 4.5 2. When R is an integer, the Pth percentile would be the number with rank R. 3. When R is not an integer, as in this case, the Pth percentile is computed by interpolation as follows: a. Define IR as the integer portion of R. For this example, IR is 4. b. Define FR as the fractional portion of R. For this example, FR is .5. c. Find the values in the sequence with rank IR and IR + 1. For example, this means the value with rank 4 and the value with rank 5, which are 8 and 9, respectively. d. Interpolate by multiplying the difference between the scores by FR and adding the result to the lower score. For this example, this is .5 * (9 - 8) + 8 = 8.5 Therefore, the 50th percentile is 8.5. CREATE TABLE T ( sls_dlrs DECIMAL (8, 2) NOT NULL CHECK (sls_dlrs > 0), sls_unts INT NOT NULL CHECK (sls_unts > 0) ) CREATE VIEW TPrices (sls_dlrs, sls_unts, price) AS SELECT sls_dlrs, sls_unts, sls_dlrs / sls_unts FROM T -- Rank prices in increasing order. For a given price P in an ordered collection, -- give both the rank of the first occurrence of P and the rank of the last -- occurrence of P. The first price P in an ordered collection has rank 1. CREATE VIEW IncreasingOrderRanks (price, rank_first, rank_last) AS SELECT P1.price, COUNT(*) - P1.occurrences + 1, COUNT(*) FROM (SELECT price, COUNT(*) AS occurrences FROM TPrices GROUP BY price) AS P1 INNER JOIN TPrices AS P2 ON P2.price <= P1.price GROUP BY P1.price, P1.occurrences CREATE TABLE Digits ( d INT NOT NULL PRIMARY KEY CHECK (d BETWEEN 0 AND 9) ) INSERT INTO Digits (d) VALUES (0) INSERT INTO Digits (d) VALUES (1) INSERT INTO Digits (d) VALUES (2) INSERT INTO Digits (d) VALUES (3) INSERT INTO Digits (d) VALUES (4) INSERT INTO Digits (d) VALUES (5) INSERT INTO Digits (d) VALUES (6) INSERT INTO Digits (d) VALUES (7) INSERT INTO Digits (d) VALUES (8) INSERT INTO Digits (d) VALUES (9) -- Need to generate natural numbers up to 100 to represent whole number -- percentile ranks CREATE VIEW PercentileRanks (rank) AS SELECT Ones.d + 10 * Tens.d + 1 FROM Digits AS Ones CROSS JOIN Digits AS Tens CREATE VIEW Percentiles (percentile, rank) AS SELECT CASE WHEN PR.rank * (RN.rank + 1) / 100.0 < R1.rank THEN R1.price WHEN PR.rank * (RN.rank + 1) / 100.0 > RN.rank THEN RN.price WHEN PR.rank * (RN.rank + 1) / 100.0 > CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) THEN (SELECT CASE WHEN R1.price = R2.price THEN R1.price ELSE (PR.rank * (RN.rank + 1) / 100.0 - CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)) * (R2.price - R1.price) + R1.price END FROM IncreasingOrderRanks AS R1 INNER JOIN IncreasingOrderRanks AS R2 ON CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) BETWEEN R1.rank_first AND R1.rank_last AND CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) + 1 BETWEEN R2.rank_first AND R2.rank_last) ELSE (SELECT price FROM IncreasingOrderRanks WHERE CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) BETWEEN rank_first AND rank_last) END, PR.rank FROM PercentileRanks AS PR CROSS JOIN (SELECT 1, price FROM IncreasingOrderRanks WHERE rank_first = 1) AS R1(rank, price) CROSS JOIN (SELECT MAX(rank_last), MAX(price) FROM IncreasingOrderRanks) AS RN(rank, price) CREATE VIEW Deciles (decile, rank) AS SELECT percentile, rank / 10 FROM Percentiles WHERE rank % 10 = 0 -- The desired query SELECT D1.rank, AVG(P.price) AS avg_price, AVG(P.sls_unts) AS avg_sls_unts FROM Deciles AS D1 LEFT OUTER JOIN Deciles AS D2 ON D2.rank = D1.rank - 1 INNER JOIN TPrices AS P ON P.price <= D1.decile AND (D2.decile IS NULL OR P.price > D2.decile) GROUP BY D1.rank ORDER BY D1.rank Regards, jag |
| |||
| "John Gilson" <jag@acm.org> wrote in message news:<REktb.175923$pT1.22236@twister.nyc.rr.com>.. . > "Indraneel Sheorey" <indraneel.sheorey@dartmouth.edu> wrote in message > news:3556ef5.0311141040.1e38d2a8@posting.google.co m... > > Hello, > > > > I want to set up a query in SQL Server that is "grouped by" a variable > > PRICE. Since PRICE takes on continuous decimal values, I want to > > create deciles based on this variable and then display the average > > price in each decile. As background, PRICE is a calculated quantity: > > I divide a table field SLS_DLRS by a field SLS_UNTS to get it. I also > > want to include an average SLS_UNTS for each decile. > > > > So essentially, I want the result to be something like: > > DECILE AVG_PRICE AVG_SLS_UNTS > > 1 0.50 5.2 > > 2 1.50 4.7 > > ... ... ... > > 10 9.50 1.1 > > > > Is there a way to do this in an SQL statement? > > > > Thanks in advance, > > Indraneel > > There are different methods, with varying degrees of accuracy, to > calculate percentiles. Here's one reasonable method described through > an example. Consider the 50th percentile (5th decile or median) for the > following numbers, arranged in increasing order: > > 3, 5, 7, 8, 9, 11, 13, 15 > > 1. Compute the rank R of the 50th percentile. > R = P / 100 * (N + 1) > P = the desired percentile > N = number of numbers in the collection > > R = 50 / 100 * (8 + 1) = 9 / 2 = 4.5 > 2. When R is an integer, the Pth percentile would be the number with > rank R. > 3. When R is not an integer, as in this case, the Pth percentile is > computed by interpolation as follows: > a. Define IR as the integer portion of R. For this example, IR is 4. > b. Define FR as the fractional portion of R. For this example, FR is .5. > c. Find the values in the sequence with rank IR and IR + 1. For example, > this means the value with rank 4 and the value with rank 5, which > are 8 and 9, respectively. > d. Interpolate by multiplying the difference between the scores by FR > and adding the result to the lower score. For this example, this is > .5 * (9 - 8) + 8 = 8.5 > > Therefore, the 50th percentile is 8.5. > > CREATE TABLE T > ( > sls_dlrs DECIMAL (8, 2) NOT NULL CHECK (sls_dlrs > 0), > sls_unts INT NOT NULL CHECK (sls_unts > 0) > ) > > CREATE VIEW TPrices (sls_dlrs, sls_unts, price) > AS > SELECT sls_dlrs, sls_unts, sls_dlrs / sls_unts > FROM T > > -- Rank prices in increasing order. For a given price P in an ordered collection, > -- give both the rank of the first occurrence of P and the rank of the last > -- occurrence of P. The first price P in an ordered collection has rank 1. > CREATE VIEW IncreasingOrderRanks (price, rank_first, rank_last) > AS > SELECT P1.price, > COUNT(*) - P1.occurrences + 1, > COUNT(*) > FROM (SELECT price, COUNT(*) AS occurrences > FROM TPrices > GROUP BY price) AS P1 > INNER JOIN > TPrices AS P2 > ON P2.price <= P1.price > GROUP BY P1.price, P1.occurrences > > CREATE TABLE Digits > ( > d INT NOT NULL PRIMARY KEY CHECK (d BETWEEN 0 AND 9) > ) > > INSERT INTO Digits (d) > VALUES (0) > INSERT INTO Digits (d) > VALUES (1) > INSERT INTO Digits (d) > VALUES (2) > INSERT INTO Digits (d) > VALUES (3) > INSERT INTO Digits (d) > VALUES (4) > INSERT INTO Digits (d) > VALUES (5) > INSERT INTO Digits (d) > VALUES (6) > INSERT INTO Digits (d) > VALUES (7) > INSERT INTO Digits (d) > VALUES (8) > INSERT INTO Digits (d) > VALUES (9) > > -- Need to generate natural numbers up to 100 to represent whole number > -- percentile ranks > CREATE VIEW PercentileRanks (rank) > AS > SELECT Ones.d + 10 * Tens.d + 1 > FROM Digits AS Ones > CROSS JOIN > Digits AS Tens > > CREATE VIEW Percentiles (percentile, rank) > AS > SELECT CASE WHEN PR.rank * (RN.rank + 1) / 100.0 < R1.rank > THEN R1.price > WHEN PR.rank * (RN.rank + 1) / 100.0 > RN.rank > THEN RN.price > WHEN PR.rank * (RN.rank + 1) / 100.0 > > CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) > THEN > (SELECT CASE WHEN R1.price = R2.price > THEN R1.price > ELSE (PR.rank * (RN.rank + 1) / 100.0 - > CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)) * > (R2.price - R1.price) + R1.price > END > FROM IncreasingOrderRanks AS R1 > INNER JOIN > IncreasingOrderRanks AS R2 > ON CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) > BETWEEN R1.rank_first AND R1.rank_last AND > CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) + 1 > BETWEEN R2.rank_first AND R2.rank_last) > ELSE (SELECT price > FROM IncreasingOrderRanks > WHERE CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) > BETWEEN rank_first AND rank_last) > END, > PR.rank > FROM PercentileRanks AS PR > CROSS JOIN > (SELECT 1, price > FROM IncreasingOrderRanks > WHERE rank_first = 1) AS R1(rank, price) > CROSS JOIN > (SELECT MAX(rank_last), MAX(price) > FROM IncreasingOrderRanks) AS RN(rank, price) > > CREATE VIEW Deciles (decile, rank) > AS > SELECT percentile, rank / 10 > FROM Percentiles > WHERE rank % 10 = 0 > > -- The desired query > SELECT D1.rank, AVG(P.price) AS avg_price, AVG(P.sls_unts) AS avg_sls_unts > FROM Deciles AS D1 > LEFT OUTER JOIN > Deciles AS D2 > ON D2.rank = D1.rank - 1 > INNER JOIN > TPrices AS P > ON P.price <= D1.decile AND > (D2.decile IS NULL OR P.price > D2.decile) > GROUP BY D1.rank > ORDER BY D1.rank > > Regards, > jag Thanks for your detailed and comprehensive answer, jag. I am having trouble creating the Percentiles view, however. I am running SQL Server locally on a desktop computer, so I think this command may be too much for it -- I get timeout messages. Is there a way I can split up this command to create the Percentiles view? Thanks again, Indraneel |
| ||||
| "Indraneel Sheorey" <indraneel.sheorey@dartmouth.edu> wrote in message news:3556ef5.0311151130.73f293e4@posting.google.co m... > "John Gilson" <jag@acm.org> wrote in message news:<REktb.175923$pT1.22236@twister.nyc.rr.com>.. . > > "Indraneel Sheorey" <indraneel.sheorey@dartmouth.edu> wrote in message > > news:3556ef5.0311141040.1e38d2a8@posting.google.co m... > > > Hello, > > > > > > I want to set up a query in SQL Server that is "grouped by" a variable > > > PRICE. Since PRICE takes on continuous decimal values, I want to > > > create deciles based on this variable and then display the average > > > price in each decile. As background, PRICE is a calculated quantity: > > > I divide a table field SLS_DLRS by a field SLS_UNTS to get it. I also > > > want to include an average SLS_UNTS for each decile. > > > > > > So essentially, I want the result to be something like: > > > DECILE AVG_PRICE AVG_SLS_UNTS > > > 1 0.50 5.2 > > > 2 1.50 4.7 > > > ... ... ... > > > 10 9.50 1.1 > > > > > > Is there a way to do this in an SQL statement? > > > > > > Thanks in advance, > > > Indraneel > > > > There are different methods, with varying degrees of accuracy, to > > calculate percentiles. Here's one reasonable method described through > > an example. Consider the 50th percentile (5th decile or median) for the > > following numbers, arranged in increasing order: > > > > 3, 5, 7, 8, 9, 11, 13, 15 > > > > 1. Compute the rank R of the 50th percentile. > > R = P / 100 * (N + 1) > > P = the desired percentile > > N = number of numbers in the collection > > > > R = 50 / 100 * (8 + 1) = 9 / 2 = 4.5 > > 2. When R is an integer, the Pth percentile would be the number with > > rank R. > > 3. When R is not an integer, as in this case, the Pth percentile is > > computed by interpolation as follows: > > a. Define IR as the integer portion of R. For this example, IR is 4. > > b. Define FR as the fractional portion of R. For this example, FR is .5. > > c. Find the values in the sequence with rank IR and IR + 1. For example, > > this means the value with rank 4 and the value with rank 5, which > > are 8 and 9, respectively. > > d. Interpolate by multiplying the difference between the scores by FR > > and adding the result to the lower score. For this example, this is > > .5 * (9 - 8) + 8 = 8.5 > > > > Therefore, the 50th percentile is 8.5. > > > > CREATE TABLE T > > ( > > sls_dlrs DECIMAL (8, 2) NOT NULL CHECK (sls_dlrs > 0), > > sls_unts INT NOT NULL CHECK (sls_unts > 0) > > ) > > > > CREATE VIEW TPrices (sls_dlrs, sls_unts, price) > > AS > > SELECT sls_dlrs, sls_unts, sls_dlrs / sls_unts > > FROM T > > > > -- Rank prices in increasing order. For a given price P in an ordered collection, > > -- give both the rank of the first occurrence of P and the rank of the last > > -- occurrence of P. The first price P in an ordered collection has rank 1. > > CREATE VIEW IncreasingOrderRanks (price, rank_first, rank_last) > > AS > > SELECT P1.price, > > COUNT(*) - P1.occurrences + 1, > > COUNT(*) > > FROM (SELECT price, COUNT(*) AS occurrences > > FROM TPrices > > GROUP BY price) AS P1 > > INNER JOIN > > TPrices AS P2 > > ON P2.price <= P1.price > > GROUP BY P1.price, P1.occurrences > > > > CREATE TABLE Digits > > ( > > d INT NOT NULL PRIMARY KEY CHECK (d BETWEEN 0 AND 9) > > ) > > > > INSERT INTO Digits (d) > > VALUES (0) > > INSERT INTO Digits (d) > > VALUES (1) > > INSERT INTO Digits (d) > > VALUES (2) > > INSERT INTO Digits (d) > > VALUES (3) > > INSERT INTO Digits (d) > > VALUES (4) > > INSERT INTO Digits (d) > > VALUES (5) > > INSERT INTO Digits (d) > > VALUES (6) > > INSERT INTO Digits (d) > > VALUES (7) > > INSERT INTO Digits (d) > > VALUES (8) > > INSERT INTO Digits (d) > > VALUES (9) > > > > -- Need to generate natural numbers up to 100 to represent whole number > > -- percentile ranks > > CREATE VIEW PercentileRanks (rank) > > AS > > SELECT Ones.d + 10 * Tens.d + 1 > > FROM Digits AS Ones > > CROSS JOIN > > Digits AS Tens > > > > CREATE VIEW Percentiles (percentile, rank) > > AS > > SELECT CASE WHEN PR.rank * (RN.rank + 1) / 100.0 < R1.rank > > THEN R1.price > > WHEN PR.rank * (RN.rank + 1) / 100.0 > RN.rank > > THEN RN.price > > WHEN PR.rank * (RN.rank + 1) / 100.0 > > > CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) > > THEN > > (SELECT CASE WHEN R1.price = R2.price > > THEN R1.price > > ELSE (PR.rank * (RN.rank + 1) / 100.0 - > > CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)) * > > (R2.price - R1.price) + R1.price > > END > > FROM IncreasingOrderRanks AS R1 > > INNER JOIN > > IncreasingOrderRanks AS R2 > > ON CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) > > BETWEEN R1.rank_first AND R1.rank_last AND > > CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) + 1 > > BETWEEN R2.rank_first AND R2.rank_last) > > ELSE (SELECT price > > FROM IncreasingOrderRanks > > WHERE CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) > > BETWEEN rank_first AND rank_last) > > END, > > PR.rank > > FROM PercentileRanks AS PR > > CROSS JOIN > > (SELECT 1, price > > FROM IncreasingOrderRanks > > WHERE rank_first = 1) AS R1(rank, price) > > CROSS JOIN > > (SELECT MAX(rank_last), MAX(price) > > FROM IncreasingOrderRanks) AS RN(rank, price) > > > > CREATE VIEW Deciles (decile, rank) > > AS > > SELECT percentile, rank / 10 > > FROM Percentiles > > WHERE rank % 10 = 0 > > > > -- The desired query > > SELECT D1.rank, AVG(P.price) AS avg_price, AVG(P.sls_unts) AS avg_sls_unts > > FROM Deciles AS D1 > > LEFT OUTER JOIN > > Deciles AS D2 > > ON D2.rank = D1.rank - 1 > > INNER JOIN > > TPrices AS P > > ON P.price <= D1.decile AND > > (D2.decile IS NULL OR P.price > D2.decile) > > GROUP BY D1.rank > > ORDER BY D1.rank > > > > Regards, > > jag > > Thanks for your detailed and comprehensive answer, jag. I am having > trouble creating the Percentiles view, however. I am running SQL > Server locally on a desktop computer, so I think this command may be > too much for it -- I get timeout messages. Is there a way I can split > up this command to create the Percentiles view? > > Thanks again, > Indraneel Some appropriate storing and indexing of intermediate results should help. Apply the following revisions to the above code. -- *** ADD *** CREATE TABLE TPricesBag ( sls_dlrs DECIMAL (8, 2) NOT NULL, sls_unts INT NOT NULL, price DECIMAL (8, 2) NOT NULL ) -- *** ADD *** INSERT INTO TPricesBag (sls_dlrs, sls_unts, price) SELECT sls_dlrs, sls_unts, price FROM TPrices -- *** ADD *** CREATE TABLE TPricesSet ( price DECIMAL (8, 2) NOT NULL, occurrences INT NOT NULL, PRIMARY KEY (price) ) -- *** ADD *** INSERT INTO TPricesSet (price, occurrences) SELECT price, COUNT(*) FROM TPricesBag GROUP BY price -- *** REPLACE *** -- Rank prices in increasing order. For a given price P in an ordered collection, -- give both the rank of the first occurrence of P and the rank of the last -- occurrence of P. The first price P in an ordered collection has rank 1. CREATE VIEW IncreasingOrderRanks (price, rank_first, rank_last) AS SELECT P1.price, COUNT(*) - P1.occurrences + 1, COUNT(*) FROM TPricesSet AS P1 INNER JOIN TPricesBag AS P2 ON P2.price <= P1.price GROUP BY P1.price, P1.occurrences -- *** ADD *** CREATE TABLE Ranks ( price DECIMAL (8, 2) NOT NULL, rank_first INT NOT NULL CHECK (rank_first >= 1), rank_last INT NOT NULL CHECK (rank_last >= 1), CHECK (rank_last >= rank_first), PRIMARY KEY (rank_first) ) -- *** ADD *** INSERT INTO Ranks (price, rank_first, rank_last) SELECT price, rank_first, rank_last FROM IncreasingOrderRanks -- *** REPLACE *** CREATE VIEW Percentiles (percentile, rank) AS SELECT CASE WHEN PR.rank * (RN.rank + 1) / 100.0 < R1.rank THEN R1.price WHEN PR.rank * (RN.rank + 1) / 100.0 > RN.rank THEN RN.price WHEN PR.rank * (RN.rank + 1) / 100.0 > CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) THEN (SELECT CASE WHEN R1.price = R2.price THEN R1.price ELSE (PR.rank * (RN.rank + 1) / 100.0 - CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)) * (R2.price - R1.price) + R1.price END FROM Ranks AS R1 INNER JOIN Ranks AS R2 ON CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) BETWEEN R1.rank_first AND R1.rank_last AND CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) + 1 BETWEEN R2.rank_first AND R2.rank_last) ELSE (SELECT price FROM Ranks WHERE CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) BETWEEN rank_first AND rank_last) END, PR.rank FROM PercentileRanks AS PR CROSS JOIN (SELECT 1, price FROM Ranks WHERE rank_first = 1) AS R1(rank, price) CROSS JOIN (SELECT MAX(rank_last), MAX(price) FROM Ranks) AS RN(rank, price) -- *** REPLACE *** -- The desired query SELECT D1.rank, AVG(P.price) AS avg_price, AVG(P.sls_unts) AS avg_sls_unts FROM Deciles AS D1 LEFT OUTER JOIN Deciles AS D2 ON D2.rank = D1.rank - 1 INNER JOIN TPricesBag AS P ON P.price <= D1.decile AND (D2.decile IS NULL OR P.price > D2.decile) GROUP BY D1.rank ORDER BY D1.rank Given a table T with 10,000 rows (randomly generated data) on a 1.7 GHz P4 with 500 MB RAM, the above query took a bit over 1 minute. Not blazing, hopefully acceptable. Regards, jag |