View Single Post

   
  #4 (permalink)  
Old 02-28-2008, 07:49 PM
John Gilson
 
Posts: n/a
Default Re: Creating Deciles

"Indraneel Sheorey" <indraneel.sheorey@dartmouth.edu> wrote in message
news:3556ef5.0311151130.73f293e4@posting.google.co m...
> "John Gilson" <jag@acm.org> wrote in message news:<REktb.175923$pT1.22236@twister.nyc.rr.com>.. .
> > "Indraneel Sheorey" <indraneel.sheorey@dartmouth.edu> wrote in message
> > news:3556ef5.0311141040.1e38d2a8@posting.google.co m...
> > > Hello,
> > >
> > > I want to set up a query in SQL Server that is "grouped by" a variable
> > > PRICE. Since PRICE takes on continuous decimal values, I want to
> > > create deciles based on this variable and then display the average
> > > price in each decile. As background, PRICE is a calculated quantity:
> > > I divide a table field SLS_DLRS by a field SLS_UNTS to get it. I also
> > > want to include an average SLS_UNTS for each decile.
> > >
> > > So essentially, I want the result to be something like:
> > > DECILE AVG_PRICE AVG_SLS_UNTS
> > > 1 0.50 5.2
> > > 2 1.50 4.7
> > > ... ... ...
> > > 10 9.50 1.1
> > >
> > > Is there a way to do this in an SQL statement?
> > >
> > > Thanks in advance,
> > > Indraneel

> >
> > There are different methods, with varying degrees of accuracy, to
> > calculate percentiles. Here's one reasonable method described through
> > an example. Consider the 50th percentile (5th decile or median) for the
> > following numbers, arranged in increasing order:
> >
> > 3, 5, 7, 8, 9, 11, 13, 15
> >
> > 1. Compute the rank R of the 50th percentile.
> > R = P / 100 * (N + 1)
> > P = the desired percentile
> > N = number of numbers in the collection
> >
> > R = 50 / 100 * (8 + 1) = 9 / 2 = 4.5
> > 2. When R is an integer, the Pth percentile would be the number with
> > rank R.
> > 3. When R is not an integer, as in this case, the Pth percentile is
> > computed by interpolation as follows:
> > a. Define IR as the integer portion of R. For this example, IR is 4.
> > b. Define FR as the fractional portion of R. For this example, FR is .5.
> > c. Find the values in the sequence with rank IR and IR + 1. For example,
> > this means the value with rank 4 and the value with rank 5, which
> > are 8 and 9, respectively.
> > d. Interpolate by multiplying the difference between the scores by FR
> > and adding the result to the lower score. For this example, this is
> > .5 * (9 - 8) + 8 = 8.5
> >
> > Therefore, the 50th percentile is 8.5.
> >
> > CREATE TABLE T
> > (
> > sls_dlrs DECIMAL (8, 2) NOT NULL CHECK (sls_dlrs > 0),
> > sls_unts INT NOT NULL CHECK (sls_unts > 0)
> > )
> >
> > CREATE VIEW TPrices (sls_dlrs, sls_unts, price)
> > AS
> > SELECT sls_dlrs, sls_unts, sls_dlrs / sls_unts
> > FROM T
> >
> > -- Rank prices in increasing order. For a given price P in an ordered collection,
> > -- give both the rank of the first occurrence of P and the rank of the last
> > -- occurrence of P. The first price P in an ordered collection has rank 1.
> > CREATE VIEW IncreasingOrderRanks (price, rank_first, rank_last)
> > AS
> > SELECT P1.price,
> > COUNT(*) - P1.occurrences + 1,
> > COUNT(*)
> > FROM (SELECT price, COUNT(*) AS occurrences
> > FROM TPrices
> > GROUP BY price) AS P1
> > INNER JOIN
> > TPrices AS P2
> > ON P2.price <= P1.price
> > GROUP BY P1.price, P1.occurrences
> >
> > CREATE TABLE Digits
> > (
> > d INT NOT NULL PRIMARY KEY CHECK (d BETWEEN 0 AND 9)
> > )
> >
> > INSERT INTO Digits (d)
> > VALUES (0)
> > INSERT INTO Digits (d)
> > VALUES (1)
> > INSERT INTO Digits (d)
> > VALUES (2)
> > INSERT INTO Digits (d)
> > VALUES (3)
> > INSERT INTO Digits (d)
> > VALUES (4)
> > INSERT INTO Digits (d)
> > VALUES (5)
> > INSERT INTO Digits (d)
> > VALUES (6)
> > INSERT INTO Digits (d)
> > VALUES (7)
> > INSERT INTO Digits (d)
> > VALUES (8)
> > INSERT INTO Digits (d)
> > VALUES (9)
> >
> > -- Need to generate natural numbers up to 100 to represent whole number
> > -- percentile ranks
> > CREATE VIEW PercentileRanks (rank)
> > AS
> > SELECT Ones.d + 10 * Tens.d + 1
> > FROM Digits AS Ones
> > CROSS JOIN
> > Digits AS Tens
> >
> > CREATE VIEW Percentiles (percentile, rank)
> > AS
> > SELECT CASE WHEN PR.rank * (RN.rank + 1) / 100.0 < R1.rank
> > THEN R1.price
> > WHEN PR.rank * (RN.rank + 1) / 100.0 > RN.rank
> > THEN RN.price
> > WHEN PR.rank * (RN.rank + 1) / 100.0 >
> > CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
> > THEN
> > (SELECT CASE WHEN R1.price = R2.price
> > THEN R1.price
> > ELSE (PR.rank * (RN.rank + 1) / 100.0 -
> > CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT))

*
> > (R2.price - R1.price) + R1.price
> > END
> > FROM IncreasingOrderRanks AS R1
> > INNER JOIN
> > IncreasingOrderRanks AS R2
> > ON CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
> > BETWEEN R1.rank_first AND R1.rank_last AND
> > CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) + 1
> > BETWEEN R2.rank_first AND R2.rank_last)
> > ELSE (SELECT price
> > FROM IncreasingOrderRanks
> > WHERE CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
> > BETWEEN rank_first AND rank_last)
> > END,
> > PR.rank
> > FROM PercentileRanks AS PR
> > CROSS JOIN
> > (SELECT 1, price
> > FROM IncreasingOrderRanks
> > WHERE rank_first = 1) AS R1(rank, price)
> > CROSS JOIN
> > (SELECT MAX(rank_last), MAX(price)
> > FROM IncreasingOrderRanks) AS RN(rank, price)
> >
> > CREATE VIEW Deciles (decile, rank)
> > AS
> > SELECT percentile, rank / 10
> > FROM Percentiles
> > WHERE rank % 10 = 0
> >
> > -- The desired query
> > SELECT D1.rank, AVG(P.price) AS avg_price, AVG(P.sls_unts) AS avg_sls_unts
> > FROM Deciles AS D1
> > LEFT OUTER JOIN
> > Deciles AS D2
> > ON D2.rank = D1.rank - 1
> > INNER JOIN
> > TPrices AS P
> > ON P.price <= D1.decile AND
> > (D2.decile IS NULL OR P.price > D2.decile)
> > GROUP BY D1.rank
> > ORDER BY D1.rank
> >
> > Regards,
> > jag

>
> Thanks for your detailed and comprehensive answer, jag. I am having
> trouble creating the Percentiles view, however. I am running SQL
> Server locally on a desktop computer, so I think this command may be
> too much for it -- I get timeout messages. Is there a way I can split
> up this command to create the Percentiles view?
>
> Thanks again,
> Indraneel


Some appropriate storing and indexing of intermediate results should help.
Apply the following revisions to the above code.

-- *** ADD ***
CREATE TABLE TPricesBag
(
sls_dlrs DECIMAL (8, 2) NOT NULL,
sls_unts INT NOT NULL,
price DECIMAL (8, 2) NOT NULL
)

-- *** ADD ***
INSERT INTO TPricesBag (sls_dlrs, sls_unts, price)
SELECT sls_dlrs, sls_unts, price
FROM TPrices

-- *** ADD ***
CREATE TABLE TPricesSet
(
price DECIMAL (8, 2) NOT NULL,
occurrences INT NOT NULL,
PRIMARY KEY (price)
)

-- *** ADD ***
INSERT INTO TPricesSet (price, occurrences)
SELECT price, COUNT(*)
FROM TPricesBag
GROUP BY price

-- *** REPLACE ***
-- Rank prices in increasing order. For a given price P in an ordered collection,
-- give both the rank of the first occurrence of P and the rank of the last
-- occurrence of P. The first price P in an ordered collection has rank 1.
CREATE VIEW IncreasingOrderRanks (price, rank_first, rank_last)
AS
SELECT P1.price,
COUNT(*) - P1.occurrences + 1,
COUNT(*)
FROM TPricesSet AS P1
INNER JOIN
TPricesBag AS P2
ON P2.price <= P1.price
GROUP BY P1.price, P1.occurrences

-- *** ADD ***
CREATE TABLE Ranks
(
price DECIMAL (8, 2) NOT NULL,
rank_first INT NOT NULL CHECK (rank_first >= 1),
rank_last INT NOT NULL CHECK (rank_last >= 1),
CHECK (rank_last >= rank_first),
PRIMARY KEY (rank_first)
)

-- *** ADD ***
INSERT INTO Ranks (price, rank_first, rank_last)
SELECT price, rank_first, rank_last
FROM IncreasingOrderRanks

-- *** REPLACE ***
CREATE VIEW Percentiles (percentile, rank)
AS
SELECT CASE
WHEN PR.rank * (RN.rank + 1) / 100.0 < R1.rank
THEN R1.price
WHEN PR.rank * (RN.rank + 1) / 100.0 > RN.rank
THEN RN.price
WHEN PR.rank * (RN.rank + 1) / 100.0 >
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
THEN
(SELECT CASE WHEN R1.price = R2.price
THEN R1.price
ELSE (PR.rank * (RN.rank + 1) / 100.0 -
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)) *
(R2.price - R1.price) + R1.price
END
FROM Ranks AS R1
INNER JOIN
Ranks AS R2
ON CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
BETWEEN R1.rank_first AND R1.rank_last AND
CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT) + 1
BETWEEN R2.rank_first AND R2.rank_last)
ELSE (SELECT price
FROM Ranks
WHERE CAST(PR.rank * (RN.rank + 1) / 100.0 AS INT)
BETWEEN rank_first AND rank_last)
END,
PR.rank
FROM PercentileRanks AS PR
CROSS JOIN
(SELECT 1, price
FROM Ranks
WHERE rank_first = 1) AS R1(rank, price)
CROSS JOIN
(SELECT MAX(rank_last), MAX(price)
FROM Ranks) AS RN(rank, price)

-- *** REPLACE ***
-- The desired query
SELECT D1.rank, AVG(P.price) AS avg_price,
AVG(P.sls_unts) AS avg_sls_unts
FROM Deciles AS D1
LEFT OUTER JOIN
Deciles AS D2
ON D2.rank = D1.rank - 1
INNER JOIN
TPricesBag AS P
ON P.price <= D1.decile AND
(D2.decile IS NULL OR P.price > D2.decile)
GROUP BY D1.rank
ORDER BY D1.rank

Given a table T with 10,000 rows (randomly generated data) on a 1.7 GHz
P4 with 500 MB RAM, the above query took a bit over 1 minute. Not
blazing, hopefully acceptable.

Regards,
jag


Reply With Quote