Re: Count unique values from overlapping values On Mar 11, 3:17 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote:
> On Tue, 11 Mar 2008 14:55:10 +0100, Kim <kims...@gmail.com> wrote:
> > On Mar 11, 2:11 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote:
> >> On Tue, 11 Mar 2008 13:57:37 +0100, Kim <kims...@gmail.com> wrote:
> >> > On Mar 11, 11:25 am, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote:
> >> >> On Tue, 11 Mar 2008 11:20:06 +0100, Kim <kims...@gmail.com> wrote:
> >> >> > Sample data:
> >> >> > colA,colB,sum
> >> >> > 33996344,33996351,8
> >> >> > 50331648,67276831,16945184
> >> >> > 50331648,68257567,17925920
> >> >> > 67276832,67276847,16
> >> >> > 67276848,67277023,176
> >> >> > 67277024,67277031,8
>
> >> >> > Wanted output:
> >> >> > 33996344,33996351,8
> >> >> > 50331648,68257567,17925920
>
> >> >> > Sum is done as follows (colB-colA+1).
> >> >> > Any idea how to get this outcome ?
>
> >> >> > This will give incorrect unique count, because it simply sums all
> >> up.
> >> >> > SELECT sum(colB-colA+1) FROM tbl
>
> >> >> If this is not what you want:
> >> >> SELECT sum(colB-colA+1) FROM tbl
> >> >> GROUP BY colA,colB
>
> >> >> ... you'll have to explain the logic further, as I can't really see
> >> it.
> >> >> What overlaps what, how, an how should that be handled?
>
> >> > To Rik Wasmus:
> >> > Its not what I want.
> >> > If you look at the sample when you should see 6 rows. Of these only
> >> > the 2 listed in "wanted output" is unique.
>
> >> > Here is why:
> >> > Row 1 is unique already, so no filtering needed.
> >> > Row 2 is not unique because row 3 contains the same + more and
> >> > therefor needs to be filtered out.
> >> > Row 3 is unique because it has the largest range with no overlapping.
> >> > Row 4-6 are all in the range which row 3 has and therefor needs to be
> >> > filtered out.
>
> >> So, if I understand correctly, you want the 'largest' range, and skip
> >> all
> >> records that fall in a bigger range?
>
> >> If that's the case, what should happen with:
> >> 1 4
> >> 2 6
> >> 1 3
> >> 8 12
> >> 9 10
>
> >> The problem here is row 2 falls partly in row 1, but I assume:
> >> 1 4
> >> 2 6
> >> 8 12
>
> >> If I assume these should count as different rows, I'd say this one:
> >> SELECT
> >> a.colA, a.colB, (a.colB-a.colA+1)
> >> FROM tablename a
> >> LEFT JOIN tablename b
> >> ON b.colA <= a.colA
> >> AND b.colA >= a.colA
> >> # having a pk would help
> >> # now we have to make sure it
> >> # doesn't match on it's own row:
> >> AND NOT (a.colA = b.colA AND a.colB = b.colB)
> >> WHERE b.colA IS NULL
>
> > Your understanding is correct, as is your assumption.
>
> > I had not yet considered the way you outline of a range overlapping
> > another range. I now see dark future of doing this without looping...
>
> > Regarding your query on sample data, this is what I get:
> > 33996344,33996351,8
> > 67276832,67276847,16
> > 67276848,67277023,176
> > 67277024,67277031,8
>
> Hmmz, just an error in typing it up, the second JOIN clause should
> offcourse be on colB:
> mysql> SELECT
> -> a.colA, a.colB, (a.colB-a.colA+1)
> -> FROM tester a
> -> LEFT JOIN tester b
> -> ON b.colA <= a.colA
> -> AND b.colB >= a.colB
> -> # having a pk would help
> -> # now we have to make sure it
> -> # doesn't match on it's own row:
> -> AND NOT (a.colA = b.colA AND a.colB = b.colB)
> -> WHERE b.colA IS NULL;
> +----------+----------+-------------------+
> | colA | colB | (a.colB-a.colA+1) |
> +----------+----------+-------------------+
> | 33996344 | 33996351 | 8 |
> | 50331648 | 68257567 | 17925920 |
> +----------+----------+-------------------+
> 2 rows in set (0.00 sec)
> --
> Rik Wasmus
Yours actually work faster than Luuk's.
However the time it takes to sum up unique values on 10.000 rows takes
94sec, which isnt fast in any way. I dare not think of how much time
it will take for millions of rows.
SELECT
sum(a.colB-a.colA+1)
FROM tbl a
LEFT JOIN tbl b
ON b.colA <= a.colA
AND b.colB >= a.colB
AND NOT (a.colA = b.colA AND a.colB = b.colB)
WHERE b.colA IS NULL
Why is the where clause there ? If its avoid NULL values, then its not
a problem since there are none in tbl.
Therefor am I trying the SP way I described in my last post. |