This is a discussion on Count unique values from overlapping values within the MySQL forums, part of the Database Server Software category; --> Kim schreef: > On Mar 11, 3:17 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >> On Tue, 11 Mar 2008 14:55:10 ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Kim schreef: > On Mar 11, 3:17 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >> On Tue, 11 Mar 2008 14:55:10 +0100, Kim <kims...@gmail.com> wrote: >>> On Mar 11, 2:11 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >>>> On Tue, 11 Mar 2008 13:57:37 +0100, Kim <kims...@gmail.com> wrote: >>>>> On Mar 11, 11:25 am, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >>>>>> On Tue, 11 Mar 2008 11:20:06 +0100, Kim <kims...@gmail.com> wrote: >>>>>>> Sample data: >>>>>>> colA,colB,sum >>>>>>> 33996344,33996351,8 >>>>>>> 50331648,67276831,16945184 >>>>>>> 50331648,68257567,17925920 >>>>>>> 67276832,67276847,16 >>>>>>> 67276848,67277023,176 >>>>>>> 67277024,67277031,8 >>>>>>> Wanted output: >>>>>>> 33996344,33996351,8 >>>>>>> 50331648,68257567,17925920 >>>>>>> Sum is done as follows (colB-colA+1). >>>>>>> Any idea how to get this outcome ? >>>>>>> This will give incorrect unique count, because it simply sums all >>>> up. >>>>>>> SELECT sum(colB-colA+1) FROM tbl >>>>>> If this is not what you want: >>>>>> SELECT sum(colB-colA+1) FROM tbl >>>>>> GROUP BY colA,colB >>>>>> ... you'll have to explain the logic further, as I can't really see >>>> it. >>>>>> What overlaps what, how, an how should that be handled? >>>>> To Rik Wasmus: >>>>> Its not what I want. >>>>> If you look at the sample when you should see 6 rows. Of these only >>>>> the 2 listed in "wanted output" is unique. >>>>> Here is why: >>>>> Row 1 is unique already, so no filtering needed. >>>>> Row 2 is not unique because row 3 contains the same + more and >>>>> therefor needs to be filtered out. >>>>> Row 3 is unique because it has the largest range with no overlapping. >>>>> Row 4-6 are all in the range which row 3 has and therefor needs to be >>>>> filtered out. >>>> So, if I understand correctly, you want the 'largest' range, and skip >>>> all >>>> records that fall in a bigger range? >>>> If that's the case, what should happen with: >>>> 1 4 >>>> 2 6 >>>> 1 3 >>>> 8 12 >>>> 9 10 >>>> The problem here is row 2 falls partly in row 1, but I assume: >>>> 1 4 >>>> 2 6 >>>> 8 12 >>>> If I assume these should count as different rows, I'd say this one: >>>> SELECT >>>> a.colA, a.colB, (a.colB-a.colA+1) >>>> FROM tablename a >>>> LEFT JOIN tablename b >>>> ON b.colA <= a.colA >>>> AND b.colA >= a.colA >>>> # having a pk would help >>>> # now we have to make sure it >>>> # doesn't match on it's own row: >>>> AND NOT (a.colA = b.colA AND a.colB = b.colB) >>>> WHERE b.colA IS NULL >>> Your understanding is correct, as is your assumption. >>> I had not yet considered the way you outline of a range overlapping >>> another range. I now see dark future of doing this without looping... >>> Regarding your query on sample data, this is what I get: >>> 33996344,33996351,8 >>> 67276832,67276847,16 >>> 67276848,67277023,176 >>> 67277024,67277031,8 >> Hmmz, just an error in typing it up, the second JOIN clause should >> offcourse be on colB: >> mysql> SELECT >> -> a.colA, a.colB, (a.colB-a.colA+1) >> -> FROM tester a >> -> LEFT JOIN tester b >> -> ON b.colA <= a.colA >> -> AND b.colB >= a.colB >> -> # having a pk would help >> -> # now we have to make sure it >> -> # doesn't match on it's own row: >> -> AND NOT (a.colA = b.colA AND a.colB = b.colB) >> -> WHERE b.colA IS NULL; >> +----------+----------+-------------------+ >> | colA | colB | (a.colB-a.colA+1) | >> +----------+----------+-------------------+ >> | 33996344 | 33996351 | 8 | >> | 50331648 | 68257567 | 17925920 | >> +----------+----------+-------------------+ >> 2 rows in set (0.00 sec) >> -- >> Rik Wasmus > > Yours actually work faster than Luuk's. its because the inner join works better than the subquery.. ;-) > However the time it takes to sum up unique values on 10.000 rows takes > 94sec, which isnt fast in any way. I dare not think of how much time > it will take for millions of rows. > SELECT > sum(a.colB-a.colA+1) > FROM tbl a > LEFT JOIN tbl b > ON b.colA <= a.colA > AND b.colB >= a.colB > AND NOT (a.colA = b.colA AND a.colB = b.colB) > WHERE b.colA IS NULL > > Why is the where clause there ? If its avoid NULL values, then its not > a problem since there are none in tbl. > > Therefor am I trying the SP way I described in my last post. for more speed try to read this part of Riks' post: >> -> # having a pk would help >> -> # now we have to make sure it >> -> # doesn't match on it's own row: >> -> AND NOT (a.colA = b.colA AND a.colB = b.colB) if you set up a pk (primary key), you can leave out the "AND NOT (a.colA = b.colA AND a.colB = b.colB)" part, which should speed up things... -- Luuk |
| |||
| On Wed, 12 Mar 2008 12:29:46 +0100, Luuk <Luuk@invalid.lan> wrote: > Kim schreef: >> On Mar 11, 3:17 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >>> On Tue, 11 Mar 2008 14:55:10 +0100, Kim <kims...@gmail.com> wrote: >>>> On Mar 11, 2:11 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >>>>> On Tue, 11 Mar 2008 13:57:37 +0100, Kim <kims...@gmail.com> wrote: >>>>>> On Mar 11, 11:25 am, "Rik Wasmus" <luiheidsgoe...@hotmail.com> >>>>>> wrote: >>>>>>> On Tue, 11 Mar 2008 11:20:06 +0100, Kim <kims...@gmail.com> wrote: >>>>>>>> Sample data: >>>>>>>> colA,colB,sum >>>>>>>> 33996344,33996351,8 >>>>>>>> 50331648,67276831,16945184 >>>>>>>> 50331648,68257567,17925920 >>>>>>>> 67276832,67276847,16 >>>>>>>> 67276848,67277023,176 >>>>>>>> 67277024,67277031,8 >>>>>>>> Wanted output: >>>>>>>> 33996344,33996351,8 >>>>>>>> 50331648,68257567,17925920 >>>>>>>> Sum is done as follows (colB-colA+1). >>>>>>>> Any idea how to get this outcome ? >>>>>>>> This will give incorrect unique count, because it simply sums all >>>>> up. >>>>>>>> SELECT sum(colB-colA+1) FROM tbl >>>>>>> If this is not what you want: >>>>>>> SELECT sum(colB-colA+1) FROM tbl >>>>>>> GROUP BY colA,colB >>>>>>> ... you'll have to explain the logic further, as I can't really see >>>>> it. >>>>>>> What overlaps what, how, an how should that be handled? >>>>>> To Rik Wasmus: >>>>>> Its not what I want. >>>>>> If you look at the sample when you should see 6 rows. Of these only >>>>>> the 2 listed in "wanted output" is unique. >>>>>> Here is why: >>>>>> Row 1 is unique already, so no filtering needed. >>>>>> Row 2 is not unique because row 3 contains the same + more and >>>>>> therefor needs to be filtered out. >>>>>> Row 3 is unique because it has the largest range with no >>>>>> overlapping. >>>>>> Row 4-6 are all in the range which row 3 has and therefor needs to >>>>>> be >>>>>> filtered out. >>>>> So, if I understand correctly, you want the 'largest' range, and skip >>>>> all >>>>> records that fall in a bigger range? >>>>> If that's the case, what should happen with: >>>>> 1 4 >>>>> 2 6 >>>>> 1 3 >>>>> 8 12 >>>>> 9 10 >>>>> The problem here is row 2 falls partly in row 1, but I assume: >>>>> 1 4 >>>>> 2 6 >>>>> 8 12 >>>>> If I assume these should count as different rows, I'd say this one: >>>>> SELECT >>>>> a.colA, a.colB, (a.colB-a.colA+1) >>>>> FROM tablename a >>>>> LEFT JOIN tablename b >>>>> ON b.colA <= a.colA >>>>> AND b.colA >= a.colA >>>>> # having a pk would help >>>>> # now we have to make sure it >>>>> # doesn't match on it's own row: >>>>> AND NOT (a.colA = b.colA AND a.colB = b.colB) >>>>> WHERE b.colA IS NULL >>>> Your understanding is correct, as is your assumption. >>>> I had not yet considered the way you outline of a range overlapping >>>> another range. I now see dark future of doing this without looping.... >>>> Regarding your query on sample data, this is what I get: >>>> 33996344,33996351,8 >>>> 67276832,67276847,16 >>>> 67276848,67277023,176 >>>> 67277024,67277031,8 >>> Hmmz, just an error in typing it up, the second JOIN clause should >>> offcourse be on colB: >>> mysql> SELECT >>> -> a.colA, a.colB, (a.colB-a.colA+1) >>> -> FROM tester a >>> -> LEFT JOIN tester b >>> -> ON b.colA <= a.colA >>> -> AND b.colB >= a.colB >>> -> # having a pk would help >>> -> # now we have to make sure it >>> -> # doesn't match on it's own row: >>> -> AND NOT (a.colA = b.colA AND a.colB = b.colB) >>> -> WHERE b.colA IS NULL; >>> +----------+----------+-------------------+ >>> | colA | colB | (a.colB-a.colA+1) | >>> +----------+----------+-------------------+ >>> | 33996344 | 33996351 | 8 | >>> | 50331648 | 68257567 | 17925920 | >>> +----------+----------+-------------------+ >>> 2 rows in set (0.00 sec) >>> -- >>> Rik Wasmus >> Yours actually work faster than Luuk's. > > its because the inner join works better than the subquery.. ;-) > >> However the time it takes to sum up unique values on 10.000 rows takes >> 94sec, which isnt fast in any way. I dare not think of how much time >> it will take for millions of rows. >> SELECT >> sum(a.colB-a.colA+1) >> FROM tbl a >> LEFT JOIN tbl b >> ON b.colA <= a.colA >> AND b.colB >= a.colB >> AND NOT (a.colA = b.colA AND a.colB = b.colB) >> WHERE b.colA IS NULL >> Why is the where clause there ? If its avoid NULL values, then its not >> a problem since there are none in tbl. >> Therefor am I trying the SP way I described in my last post. > > > for more speed try to read this part of Riks' post: > >> -> # having a pk would help > >> -> # now we have to make sure it > >> -> # doesn't match on it's own row: > >> -> AND NOT (a.colA = b.colA AND a.colB = b.colB) > > if you set up a pk (primary key), you can leave out the "AND NOT (a.colA > = b.colA AND a.colB = b.colB)" part, which should speed up things.... ....which should be substituted with: AND a.primary_key != b.primary_key Also, keywise, offcourse PK in possible primary key, and an index on colA and colB. -- Rik Wasmus |
| |||
| On Mar 12, 12:18 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: > On Wed, 12 Mar 2008 10:57:54 +0100, Kim <kims...@gmail.com> wrote: > > On Mar 11, 3:17 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: > >> On Tue, 11 Mar 2008 14:55:10 +0100, Kim <kims...@gmail.com> wrote: > >> > On Mar 11, 2:11 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: > >> >> On Tue, 11 Mar 2008 13:57:37 +0100, Kim <kims...@gmail.com> wrote: > >> >> > On Mar 11, 11:25 am, "Rik Wasmus" <luiheidsgoe...@hotmail.com> > >> wrote: > >> >> >> On Tue, 11 Mar 2008 11:20:06 +0100, Kim <kims...@gmail.com> wrote: > >> >> >> > Sample data: > >> >> >> > colA,colB,sum > >> >> >> > 33996344,33996351,8 > >> >> >> > 50331648,67276831,16945184 > >> >> >> > 50331648,68257567,17925920 > >> >> >> > 67276832,67276847,16 > >> >> >> > 67276848,67277023,176 > >> >> >> > 67277024,67277031,8 > > >> >> >> > Wanted output: > >> >> >> > 33996344,33996351,8 > >> >> >> > 50331648,68257567,17925920 > > >> >> >> > Sum is done as follows (colB-colA+1). > >> >> >> > Any idea how to get this outcome ? > > >> >> >> > This will give incorrect unique count, because it simply sums > >> all > >> >> up. > >> >> >> > SELECT sum(colB-colA+1) FROM tbl > > >> >> >> If this is not what you want: > >> >> >> SELECT sum(colB-colA+1) FROM tbl > >> >> >> GROUP BY colA,colB > > >> >> >> ... you'll have to explain the logic further, as I can't really > >> see > >> >> it. > >> >> >> What overlaps what, how, an how should that be handled? > > >> >> > To Rik Wasmus: > >> >> > Its not what I want. > >> >> > If you look at the sample when you should see 6 rows. Of these only > >> >> > the 2 listed in "wanted output" is unique. > > >> >> > Here is why: > >> >> > Row 1 is unique already, so no filtering needed. > >> >> > Row 2 is not unique because row 3 contains the same + more and > >> >> > therefor needs to be filtered out. > >> >> > Row 3 is unique because it has the largest range with no > >> overlapping. > >> >> > Row 4-6 are all in the range which row 3 has and therefor needs to > >> be > >> >> > filtered out. > > >> >> So, if I understand correctly, you want the 'largest' range, and skip > >> >> all > >> >> records that fall in a bigger range? > > >> >> If that's the case, what should happen with: > >> >> 1 4 > >> >> 2 6 > >> >> 1 3 > >> >> 8 12 > >> >> 9 10 > > >> >> The problem here is row 2 falls partly in row 1, but I assume: > >> >> 1 4 > >> >> 2 6 > >> >> 8 12 > > >> >> If I assume these should count as different rows, I'd say this one: > >> >> SELECT > >> >> a.colA, a.colB, (a.colB-a.colA+1) > >> >> FROM tablename a > >> >> LEFT JOIN tablename b > >> >> ON b.colA <= a.colA > >> >> AND b.colA >= a.colA > >> >> # having a pk would help > >> >> # now we have to make sure it > >> >> # doesn't match on it's own row: > >> >> AND NOT (a.colA = b.colA AND a.colB = b.colB) > >> >> WHERE b.colA IS NULL > > >> > Your understanding is correct, as is your assumption. > > >> > I had not yet considered the way you outline of a range overlapping > >> > another range. I now see dark future of doing this without looping... > > >> > Regarding your query on sample data, this is what I get: > >> > 33996344,33996351,8 > >> > 67276832,67276847,16 > >> > 67276848,67277023,176 > >> > 67277024,67277031,8 > > >> Hmmz, just an error in typing it up, the second JOIN clause should > >> offcourse be on colB: > >> mysql> SELECT > >> -> a.colA, a.colB, (a.colB-a.colA+1) > >> -> FROM tester a > >> -> LEFT JOIN tester b > >> -> ON b.colA <= a.colA > >> -> AND b.colB >= a.colB > >> -> # having a pk would help > >> -> # now we have to make sure it > >> -> # doesn't match on it's own row: > >> -> AND NOT (a.colA = b.colA AND a.colB = b.colB) > >> -> WHERE b.colA IS NULL; > >> +----------+----------+-------------------+ > >> | colA | colB | (a.colB-a.colA+1) | > >> +----------+----------+-------------------+ > >> | 33996344 | 33996351 | 8 | > >> | 50331648 | 68257567 | 17925920 | > >> +----------+----------+-------------------+ > >> 2 rows in set (0.00 sec) > > > Yours actually work faster than Luuk's. > > However the time it takes to sum up unique values on 10.000 rows takes > > 94sec, which isnt fast in any way. I dare not think of how much time > > it will take for millions of rows. > > Could you run it with an EXPLAIN before the SELECT and give us the output? > > > SELECT > > sum(a.colB-a.colA+1) > > FROM tbl a > > LEFT JOIN tbl b > > ON b.colA <= a.colA > > AND b.colB >= a.colB > > AND NOT (a.colA = b.colA AND a.colB = b.colB) > > WHERE b.colA IS NULL > > > Why is the where clause there ? If its avoid NULL values, then its not > > a problem since there are none in tbl. > > The WHERE clause is exactly there and works because colA is never NULL > Translated into human readable language this query is something like: > > Give the result of sum(a.colB-a.colA+1) from tbl, where there are no rows > (colA IS NULL) in tbl where colA-colB falls into the range of this row > (the join clause). > > > Therefor am I trying the SP way I described in my last post. > > One query with the right indexes (I sure hope you have some...) is often > way faster then an SP. > -- > Rik Wasmus > Could you run it with an EXPLAIN before the SELECT and give us the output? id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE a ALL NULL NULL NULL NULL 10000 1 SIMPLE b ALL NULL NULL NULL NULL 10000 Using where; Not exists > One query with the right indexes (I sure hope you have some...) is often > way faster then an SP. There is only 1 normal index in the tbl, and its on colC (containing an ID). Upon creation of tbl it is sorted by colA, colB anyway. Adding any sort of index on colA and/or colB will in fact make queries slower as much as 500% ! So I have tried more indexes. Still working on the SP, but still with errors. Thanks for the help |
| |||
| Kim wrote: > On Mar 12, 12:18 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >> On Wed, 12 Mar 2008 10:57:54 +0100, Kim <kims...@gmail.com> wrote: >>> On Mar 11, 3:17 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >>>> On Tue, 11 Mar 2008 14:55:10 +0100, Kim <kims...@gmail.com> wrote: >>>>> On Mar 11, 2:11 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >>>>>> On Tue, 11 Mar 2008 13:57:37 +0100, Kim <kims...@gmail.com> wrote: >>>>>>> On Mar 11, 11:25 am, "Rik Wasmus" <luiheidsgoe...@hotmail.com> >>>> wrote: >>>>>>>> On Tue, 11 Mar 2008 11:20:06 +0100, Kim <kims...@gmail.com> wrote: >>>>>>>>> Sample data: >>>>>>>>> colA,colB,sum >>>>>>>>> 33996344,33996351,8 >>>>>>>>> 50331648,67276831,16945184 >>>>>>>>> 50331648,68257567,17925920 >>>>>>>>> 67276832,67276847,16 >>>>>>>>> 67276848,67277023,176 >>>>>>>>> 67277024,67277031,8 >>>>>>>>> Wanted output: >>>>>>>>> 33996344,33996351,8 >>>>>>>>> 50331648,68257567,17925920 >>>>>>>>> Sum is done as follows (colB-colA+1). >>>>>>>>> Any idea how to get this outcome ? >>>>>>>>> This will give incorrect unique count, because it simply sums >>>> all >>>>>> up. >>>>>>>>> SELECT sum(colB-colA+1) FROM tbl >>>>>>>> If this is not what you want: >>>>>>>> SELECT sum(colB-colA+1) FROM tbl >>>>>>>> GROUP BY colA,colB >>>>>>>> ... you'll have to explain the logic further, as I can't really >>>> see >>>>>> it. >>>>>>>> What overlaps what, how, an how should that be handled? >>>>>>> To Rik Wasmus: >>>>>>> Its not what I want. >>>>>>> If you look at the sample when you should see 6 rows. Of these only >>>>>>> the 2 listed in "wanted output" is unique. >>>>>>> Here is why: >>>>>>> Row 1 is unique already, so no filtering needed. >>>>>>> Row 2 is not unique because row 3 contains the same + more and >>>>>>> therefor needs to be filtered out. >>>>>>> Row 3 is unique because it has the largest range with no >>>> overlapping. >>>>>>> Row 4-6 are all in the range which row 3 has and therefor needs to >>>> be >>>>>>> filtered out. >>>>>> So, if I understand correctly, you want the 'largest' range, and skip >>>>>> all >>>>>> records that fall in a bigger range? >>>>>> If that's the case, what should happen with: >>>>>> 1 4 >>>>>> 2 6 >>>>>> 1 3 >>>>>> 8 12 >>>>>> 9 10 >>>>>> The problem here is row 2 falls partly in row 1, but I assume: >>>>>> 1 4 >>>>>> 2 6 >>>>>> 8 12 >>>>>> If I assume these should count as different rows, I'd say this one: >>>>>> SELECT >>>>>> a.colA, a.colB, (a.colB-a.colA+1) >>>>>> FROM tablename a >>>>>> LEFT JOIN tablename b >>>>>> ON b.colA <= a.colA >>>>>> AND b.colA >= a.colA >>>>>> # having a pk would help >>>>>> # now we have to make sure it >>>>>> # doesn't match on it's own row: >>>>>> AND NOT (a.colA = b.colA AND a.colB = b.colB) >>>>>> WHERE b.colA IS NULL >>>>> Your understanding is correct, as is your assumption. >>>>> I had not yet considered the way you outline of a range overlapping >>>>> another range. I now see dark future of doing this without looping... >>>>> Regarding your query on sample data, this is what I get: >>>>> 33996344,33996351,8 >>>>> 67276832,67276847,16 >>>>> 67276848,67277023,176 >>>>> 67277024,67277031,8 >>>> Hmmz, just an error in typing it up, the second JOIN clause should >>>> offcourse be on colB: >>>> mysql> SELECT >>>> -> a.colA, a.colB, (a.colB-a.colA+1) >>>> -> FROM tester a >>>> -> LEFT JOIN tester b >>>> -> ON b.colA <= a.colA >>>> -> AND b.colB >= a.colB >>>> -> # having a pk would help >>>> -> # now we have to make sure it >>>> -> # doesn't match on it's own row: >>>> -> AND NOT (a.colA = b.colA AND a.colB = b.colB) >>>> -> WHERE b.colA IS NULL; >>>> +----------+----------+-------------------+ >>>> | colA | colB | (a.colB-a.colA+1) | >>>> +----------+----------+-------------------+ >>>> | 33996344 | 33996351 | 8 | >>>> | 50331648 | 68257567 | 17925920 | >>>> +----------+----------+-------------------+ >>>> 2 rows in set (0.00 sec) >>> Yours actually work faster than Luuk's. >>> However the time it takes to sum up unique values on 10.000 rows takes >>> 94sec, which isnt fast in any way. I dare not think of how much time >>> it will take for millions of rows. >> Could you run it with an EXPLAIN before the SELECT and give us the output? >> >>> SELECT >>> sum(a.colB-a.colA+1) >>> FROM tbl a >>> LEFT JOIN tbl b >>> ON b.colA <= a.colA >>> AND b.colB >= a.colB >>> AND NOT (a.colA = b.colA AND a.colB = b.colB) >>> WHERE b.colA IS NULL >>> Why is the where clause there ? If its avoid NULL values, then its not >>> a problem since there are none in tbl. >> The WHERE clause is exactly there and works because colA is never NULL >> Translated into human readable language this query is something like: >> >> Give the result of sum(a.colB-a.colA+1) from tbl, where there are no rows >> (colA IS NULL) in tbl where colA-colB falls into the range of this row >> (the join clause). >> >>> Therefor am I trying the SP way I described in my last post. >> One query with the right indexes (I sure hope you have some...) is often >> way faster then an SP. >> -- >> Rik Wasmus > >> Could you run it with an EXPLAIN before the SELECT and give us the output? > id select_type table type possible_keys key key_len ref rows > Extra > 1 SIMPLE a ALL NULL NULL NULL NULL 10000 > 1 SIMPLE b ALL NULL NULL NULL NULL 10000 Using where; Not > exists > >> One query with the right indexes (I sure hope you have some...) is often >> way faster then an SP. > There is only 1 normal index in the tbl, and its on colC (containing > an ID). > Upon creation of tbl it is sorted by colA, colB anyway. Adding any > sort of index on colA and/or colB will in fact make queries slower as > much as 500% ! So I have tried more indexes. > Wrong. Tables are NEVER sorted. Never assume that rows will be in the order they were inserted. Currently you are doing a table scan. I would suggest an index, probably on colA, but you could try with colB and explain both ways. > Still working on the SP, but still with errors. > Thanks for the help > -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex@attglobal.net ================== |
| |||
| On Mar 12, 2:47 pm, Jerry Stuckle <jstuck...@attglobal.net> wrote: > Kim wrote: > > On Mar 12, 12:18 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: > >> On Wed, 12 Mar 2008 10:57:54 +0100, Kim <kims...@gmail.com> wrote: > >>> On Mar 11, 3:17 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: > >>>> On Tue, 11 Mar 2008 14:55:10 +0100, Kim <kims...@gmail.com> wrote: > >>>>> On Mar 11, 2:11 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: > >>>>>> On Tue, 11 Mar 2008 13:57:37 +0100, Kim <kims...@gmail.com> wrote: > >>>>>>> On Mar 11, 11:25 am, "Rik Wasmus" <luiheidsgoe...@hotmail.com> > >>>> wrote: > >>>>>>>> On Tue, 11 Mar 2008 11:20:06 +0100, Kim <kims...@gmail.com> wrote: > >>>>>>>>> Sample data: > >>>>>>>>> colA,colB,sum > >>>>>>>>> 33996344,33996351,8 > >>>>>>>>> 50331648,67276831,16945184 > >>>>>>>>> 50331648,68257567,17925920 > >>>>>>>>> 67276832,67276847,16 > >>>>>>>>> 67276848,67277023,176 > >>>>>>>>> 67277024,67277031,8 > >>>>>>>>> Wanted output: > >>>>>>>>> 33996344,33996351,8 > >>>>>>>>> 50331648,68257567,17925920 > >>>>>>>>> Sum is done as follows (colB-colA+1). > >>>>>>>>> Any idea how to get this outcome ? > >>>>>>>>> This will give incorrect unique count, because it simply sums > >>>> all > >>>>>> up. > >>>>>>>>> SELECT sum(colB-colA+1) FROM tbl > >>>>>>>> If this is not what you want: > >>>>>>>> SELECT sum(colB-colA+1) FROM tbl > >>>>>>>> GROUP BY colA,colB > >>>>>>>> ... you'll have to explain the logic further, as I can't really > >>>> see > >>>>>> it. > >>>>>>>> What overlaps what, how, an how should that be handled? > >>>>>>> To Rik Wasmus: > >>>>>>> Its not what I want. > >>>>>>> If you look at the sample when you should see 6 rows. Of these only > >>>>>>> the 2 listed in "wanted output" is unique. > >>>>>>> Here is why: > >>>>>>> Row 1 is unique already, so no filtering needed. > >>>>>>> Row 2 is not unique because row 3 contains the same + more and > >>>>>>> therefor needs to be filtered out. > >>>>>>> Row 3 is unique because it has the largest range with no > >>>> overlapping. > >>>>>>> Row 4-6 are all in the range which row 3 has and therefor needs to > >>>> be > >>>>>>> filtered out. > >>>>>> So, if I understand correctly, you want the 'largest' range, and skip > >>>>>> all > >>>>>> records that fall in a bigger range? > >>>>>> If that's the case, what should happen with: > >>>>>> 1 4 > >>>>>> 2 6 > >>>>>> 1 3 > >>>>>> 8 12 > >>>>>> 9 10 > >>>>>> The problem here is row 2 falls partly in row 1, but I assume: > >>>>>> 1 4 > >>>>>> 2 6 > >>>>>> 8 12 > >>>>>> If I assume these should count as different rows, I'd say this one: > >>>>>> SELECT > >>>>>> a.colA, a.colB, (a.colB-a.colA+1) > >>>>>> FROM tablename a > >>>>>> LEFT JOIN tablename b > >>>>>> ON b.colA <= a.colA > >>>>>> AND b.colA >= a.colA > >>>>>> # having a pk would help > >>>>>> # now we have to make sure it > >>>>>> # doesn't match on it's own row: > >>>>>> AND NOT (a.colA = b.colA AND a.colB = b.colB) > >>>>>> WHERE b.colA IS NULL > >>>>> Your understanding is correct, as is your assumption. > >>>>> I had not yet considered the way you outline of a range overlapping > >>>>> another range. I now see dark future of doing this without looping... > >>>>> Regarding your query on sample data, this is what I get: > >>>>> 33996344,33996351,8 > >>>>> 67276832,67276847,16 > >>>>> 67276848,67277023,176 > >>>>> 67277024,67277031,8 > >>>> Hmmz, just an error in typing it up, the second JOIN clause should > >>>> offcourse be on colB: > >>>> mysql> SELECT > >>>> -> a.colA, a.colB, (a.colB-a.colA+1) > >>>> -> FROM tester a > >>>> -> LEFT JOIN tester b > >>>> -> ON b.colA <= a.colA > >>>> -> AND b.colB >= a.colB > >>>> -> # having a pk would help > >>>> -> # now we have to make sure it > >>>> -> # doesn't match on it's own row: > >>>> -> AND NOT (a.colA = b.colA AND a.colB = b.colB) > >>>> -> WHERE b.colA IS NULL; > >>>> +----------+----------+-------------------+ > >>>> | colA | colB | (a.colB-a.colA+1) | > >>>> +----------+----------+-------------------+ > >>>> | 33996344 | 33996351 | 8 | > >>>> | 50331648 | 68257567 | 17925920 | > >>>> +----------+----------+-------------------+ > >>>> 2 rows in set (0.00 sec) > >>> Yours actually work faster than Luuk's. > >>> However the time it takes to sum up unique values on 10.000 rows takes > >>> 94sec, which isnt fast in any way. I dare not think of how much time > >>> it will take for millions of rows. > >> Could you run it with an EXPLAIN before the SELECT and give us the output? > > >>> SELECT > >>> sum(a.colB-a.colA+1) > >>> FROM tbl a > >>> LEFT JOIN tbl b > >>> ON b.colA <= a.colA > >>> AND b.colB >= a.colB > >>> AND NOT (a.colA = b.colA AND a.colB = b.colB) > >>> WHERE b.colA IS NULL > >>> Why is the where clause there ? If its avoid NULL values, then its not > >>> a problem since there are none in tbl. > >> The WHERE clause is exactly there and works because colA is never NULL > >> Translated into human readable language this query is something like: > > >> Give the result of sum(a.colB-a.colA+1) from tbl, where there are no rows > >> (colA IS NULL) in tbl where colA-colB falls into the range of this row > >> (the join clause). > > >>> Therefor am I trying the SP way I described in my last post. > >> One query with the right indexes (I sure hope you have some...) is often > >> way faster then an SP. > >> -- > >> Rik Wasmus > > >> Could you run it with an EXPLAIN before the SELECT and give us the output? > > id select_type table type possible_keys key key_len ref rows > > Extra > > 1 SIMPLE a ALL NULL NULL NULL NULL 10000 > > 1 SIMPLE b ALL NULL NULL NULL NULL 10000 Using where; Not > > exists > > >> One query with the right indexes (I sure hope you have some...) is often > >> way faster then an SP. > > There is only 1 normal index in the tbl, and its on colC (containing > > an ID). > > Upon creation of tbl it is sorted by colA, colB anyway. Adding any > > sort of index on colA and/or colB will in fact make queries slower as > > much as 500% ! So I have tried more indexes. > > Wrong. Tables are NEVER sorted. Never assume that rows will be in the > order they were inserted. > > Currently you are doing a table scan. I would suggest an index, > probably on colA, but you could try with colB and explain both ways. > > > Still working on the SP, but still with errors. > > Thanks for the help > > -- > ================== > Remove the "x" from my email address > Jerry Stuckle > JDS Computer Training Corp. > jstuck...@attglobal.net > ================== > Wrong. Tables are NEVER sorted. Never assume that rows will be in the > order they were inserted. You misunderstand. I dont assume it is sorted - I know it is. Because when the table tbl is created the data is sorted on insertion. |
| |||
| Kim schreef: > On Mar 12, 2:47 pm, Jerry Stuckle <jstuck...@attglobal.net> wrote: >> Kim wrote: >>> On Mar 12, 12:18 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >>>> On Wed, 12 Mar 2008 10:57:54 +0100, Kim <kims...@gmail.com> wrote: >>>>> On Mar 11, 3:17 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >>>>>> On Tue, 11 Mar 2008 14:55:10 +0100, Kim <kims...@gmail.com> wrote: >>>>>>> On Mar 11, 2:11 pm, "Rik Wasmus" <luiheidsgoe...@hotmail.com> wrote: >>>>>>>> On Tue, 11 Mar 2008 13:57:37 +0100, Kim <kims...@gmail.com> wrote: >>>>>>>>> On Mar 11, 11:25 am, "Rik Wasmus" <luiheidsgoe...@hotmail.com> >>>>>> wrote: >>>>>>>>>> On Tue, 11 Mar 2008 11:20:06 +0100, Kim <kims...@gmail.com> wrote: >>>>>>>>>>> Sample data: >>>>>>>>>>> colA,colB,sum >>>>>>>>>>> 33996344,33996351,8 >>>>>>>>>>> 50331648,67276831,16945184 >>>>>>>>>>> 50331648,68257567,17925920 >>>>>>>>>>> 67276832,67276847,16 >>>>>>>>>>> 67276848,67277023,176 >>>>>>>>>>> 67277024,67277031,8 >>>>>>>>>>> Wanted output: >>>>>>>>>>> 33996344,33996351,8 >>>>>>>>>>> 50331648,68257567,17925920 >>>>>>>>>>> Sum is done as follows (colB-colA+1). >>>>>>>>>>> Any idea how to get this outcome ? >>>>>>>>>>> This will give incorrect unique count, because it simply sums >>>>>> all >>>>>>>> up. >>>>>>>>>>> SELECT sum(colB-colA+1) FROM tbl >>>>>>>>>> If this is not what you want: >>>>>>>>>> SELECT sum(colB-colA+1) FROM tbl >>>>>>>>>> GROUP BY colA,colB >>>>>>>>>> ... you'll have to explain the logic further, as I can't really >>>>>> see >>>>>>>> it. >>>>>>>>>> What overlaps what, how, an how should that be handled? >>>>>>>>> To Rik Wasmus: >>>>>>>>> Its not what I want. >>>>>>>>> If you look at the sample when you should see 6 rows. Of these only >>>>>>>>> the 2 listed in "wanted output" is unique. >>>>>>>>> Here is why: >>>>>>>>> Row 1 is unique already, so no filtering needed. >>>>>>>>> Row 2 is not unique because row 3 contains the same + more and >>>>>>>>> therefor needs to be filtered out. >>>>>>>>> Row 3 is unique because it has the largest range with no >>>>>> overlapping. >>>>>>>>> Row 4-6 are all in the range which row 3 has and therefor needs to >>>>>> be >>>>>>>>> filtered out. >>>>>>>> So, if I understand correctly, you want the 'largest' range, and skip >>>>>>>> all >>>>>>>> records that fall in a bigger range? >>>>>>>> If that's the case, what should happen with: >>>>>>>> 1 4 >>>>>>>> 2 6 >>>>>>>> 1 3 >>>>>>>> 8 12 >>>>>>>> 9 10 >>>>>>>> The problem here is row 2 falls partly in row 1, but I assume: >>>>>>>> 1 4 >>>>>>>> 2 6 >>>>>>>> 8 12 >>>>>>>> If I assume these should count as different rows, I'd say this one: >>>>>>>> SELECT >>>>>>>> a.colA, a.colB, (a.colB-a.colA+1) >>>>>>>> FROM tablename a >>>>>>>> LEFT JOIN tablename b >>>>>>>> ON b.colA <= a.colA >>>>>>>> AND b.colA >= a.colA >>>>>>>> # having a pk would help >>>>>>>> # now we have to make sure it >>>>>>>> # doesn't match on it's own row: >>>>>>>> AND NOT (a.colA = b.colA AND a.colB = b.colB) >>>>>>>> WHERE b.colA IS NULL >>>>>>> Your understanding is correct, as is your assumption. >>>>>>> I had not yet considered the way you outline of a range overlapping >>>>>>> another range. I now see dark future of doing this without looping... >>>>>>> Regarding your query on sample data, this is what I get: >>>>>>> 33996344,33996351,8 >>>>>>> 67276832,67276847,16 >>>>>>> 67276848,67277023,176 >>>>>>> 67277024,67277031,8 >>>>>> Hmmz, just an error in typing it up, the second JOIN clause should >>>>>> offcourse be on colB: >>>>>> mysql> SELECT >>>>>> -> a.colA, a.colB, (a.colB-a.colA+1) >>>>>> -> FROM tester a >>>>>> -> LEFT JOIN tester b >>>>>> -> ON b.colA <= a.colA >>>>>> -> AND b.colB >= a.colB >>>>>> -> # having a pk would help >>>>>> -> # now we have to make sure it >>>>>> -> # doesn't match on it's own row: >>>>>> -> AND NOT (a.colA = b.colA AND a.colB = b.colB) >>>>>> -> WHERE b.colA IS NULL; >>>>>> +----------+----------+-------------------+ >>>>>> | colA | colB | (a.colB-a.colA+1) | >>>>>> +----------+----------+-------------------+ >>>>>> | 33996344 | 33996351 | 8 | >>>>>> | 50331648 | 68257567 | 17925920 | >>>>>> +----------+----------+-------------------+ >>>>>> 2 rows in set (0.00 sec) >>>>> Yours actually work faster than Luuk's. >>>>> However the time it takes to sum up unique values on 10.000 rows takes >>>>> 94sec, which isnt fast in any way. I dare not think of how much time >>>>> it will take for millions of rows. >>>> Could you run it with an EXPLAIN before the SELECT and give us the output? >>>>> SELECT >>>>> sum(a.colB-a.colA+1) >>>>> FROM tbl a >>>>> LEFT JOIN tbl b >>>>> ON b.colA <= a.colA >>>>> AND b.colB >= a.colB >>>>> AND NOT (a.colA = b.colA AND a.colB = b.colB) >>>>> WHERE b.colA IS NULL >>>>> Why is the where clause there ? If its avoid NULL values, then its not >>>>> a problem since there are none in tbl. >>>> The WHERE clause is exactly there and works because colA is never NULL >>>> Translated into human readable language this query is something like: >>>> Give the result of sum(a.colB-a.colA+1) from tbl, where there are no rows >>>> (colA IS NULL) in tbl where colA-colB falls into the range of this row >>>> (the join clause). >>>>> Therefor am I trying the SP way I described in my last post. >>>> One query with the right indexes (I sure hope you have some...) is often >>>> way faster then an SP. >>>> -- >>>> Rik Wasmus >>>> Could you run it with an EXPLAIN before the SELECT and give us the output? >>> id select_type table type possible_keys key key_len ref rows >>> Extra >>> 1 SIMPLE a ALL NULL NULL NULL NULL 10000 >>> 1 SIMPLE b ALL NULL NULL NULL NULL 10000 Using where; Not >>> exists >>>> One query with the right indexes (I sure hope you have some...) is often >>>> way faster then an SP. >>> There is only 1 normal index in the tbl, and its on colC (containing >>> an ID). >>> Upon creation of tbl it is sorted by colA, colB anyway. Adding any >>> sort of index on colA and/or colB will in fact make queries slower as >>> much as 500% ! So I have tried more indexes. >> Wrong. Tables are NEVER sorted. Never assume that rows will be in the >> order they were inserted. >> >> Currently you are doing a table scan. I would suggest an index, >> probably on colA, but you could try with colB and explain both ways. >> >>> Still working on the SP, but still with errors. >>> Thanks for the help >> -- >> ================== >> Remove the "x" from my email address >> Jerry Stuckle >> JDS Computer Training Corp. >> jstuck...@attglobal.net >> ================== > >> Wrong. Tables are NEVER sorted. Never assume that rows will be in the >> order they were inserted. > You misunderstand. I dont assume it is sorted - I know it is. Because > when the table tbl is created the data is sorted on insertion. but Jerry is right, you assume its sorted, because you did not see any values 'out of sort'... but, as far as documentation goes, it CAN happen that thing get out of order, that's why MySQL does a full table scan, and thats' why you should add a key, to make sure MySQL knows that things are in a specific order. without key (on ColA) typeing something like select * from tbls where ColA='12345'; scans the complete table for the value '12345' with a key (on ColA) typein the same statement, produces an instant result. -- Luuk |
| ||||
| On Mar 11, 11:20 am, Kim <kims...@gmail.com> wrote: > Sample data: > colA,colB,sum > 33996344,33996351,8 > 50331648,67276831,16945184 > 50331648,68257567,17925920 > 67276832,67276847,16 > 67276848,67277023,176 > 67277024,67277031,8 > > Wanted output: > 33996344,33996351,8 > 50331648,68257567,17925920 > > Sum is done as follows (colB-colA+1). > Any idea how to get this outcome ? > > This will give incorrect unique count, because it simply sums all up. > SELECT sum(colB-colA+1) FROM tbl Had to change my wanted output to this: 33996344,33996351,8 50331648,67276831,16945184 67276832,68257567,980735 I still sum up after all output is generated to get the desired unique count. I ended up using a store procedure (code below), which became possible to do in MySQL 5.0. DELIMITER $$ CREATE PROCEDURE count_unique_values() BEGIN DECLARE tip_last BOOL DEFAULT FALSE; DECLARE tip_a, tip_b BIGINT unsigned DEFAULT 0; DECLARE cur1 CURSOR FOR SELECT colA,colB FROM `tbl_src` ORDER BY colA ASC, colB ASC; DECLARE CONTINUE HANDLER FOR NOT FOUND SET tip_last = TRUE; DROP TABLE IF EXISTS `tmp_tbl`; CREATE TABLE `tmp_tbl`( `a` bigint(20) unsigned NOT NULL, `b` bigint(20) unsigned NOT NULL, `sum` int(10) unsigned NOT NULL ); SET @tip_ap = 0; SET @tip_bp = 0; SET @tip_sum = 0; PREPARE stmt_insert FROM 'INSERT INTO `tmp_tbl` VALUES(?,?,?)'; OPEN cur1; WHILE NOT tip_last DO FETCH cur1 INTO tip_a, tip_b; # SELECT concat('s=',tip_a,', ap=',@tip_ap,', b=',tip_b,', bp=',@tip_bp,', sum=',@tip_sum) debug; IF (tip_a > @tip_ap) THEN IF (tip_b > @tip_bp) THEN SET @tip_ap = tip_a; SET @tip_bp = tip_b; SET @tip_sum = tip_b - tip_s +1; EXECUTE stmt_insert USING @tip_ap,@tip_bp,@tip_sum; END IF; ELSEIF (tip_a = @tip_ap) THEN IF (tip_b > @tip_bp) THEN SET @tip_ap = @tip_bp +1; SET @tip_bp = tip_b; SET @tip_sum = @tip_bp - @tip_ap; EXECUTE stmt_insert USING @tip_ap,@tip_bp,@tip_sum; END IF; END IF; END WHILE; CLOSE cur1; DEALLOCATE PREPARE stmt_insert; SELECT sum(sum) 'sum' FROM `tmp_tbl`; END $$ DELIMITER ; Using that SP am I able to get a result in about 50-60 seconds from ~3 million rows in tbl_src. Much better the 300days I got earlier I hope somebody else can use it as well. |