Unix Technical Forum

RE: data distribution question

This is a discussion on RE: data distribution question within the Informix forums, part of the Database Server Software category; --> >> Taking bin 3, out of 1.9 million rows in it, there are only 9.9k rows that are unique. ...


Go Back   Unix Technical Forum > Database Server Software > Informix

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-20-2008, 12:43 PM
Ford, Andrew G
 
Posts: n/a
Default RE: data distribution question

>> Taking bin 3, out of 1.9 million rows in it, there are only 9.9k rows
that are unique. The rest of the rows are duplicate.

Not really.

bin 3 represents 1.9 million rows having 9.9K different values between
25763 and 35887.

Think of it this way, if you executed the following query against your
data:

select
value,
count(*)
from
table
where
value between 25764 and 35887
group by
value
order by
value asc;

you would expect to see the following results:

value count(*)
25764 199
25765 199
25766 199
25767 199
25768 199
..
..
..
35885 199
35886 199
35887 199

Of course the counts are not going to be exact when you execute the
query but this is what the optimizer thinks your data looks like.

Andrew Ford


-----Original Message-----
From: informix-list-bounces@iiug.org
[mailto:informix-list-bounces@iiug.org] On Behalf Of Floyd Wellershaus
Sent: Thursday, August 03, 2006 4:58 AM
To: Andrew Ford; informix-list@iiug.org
Subject: Re: data distribution question


Thank you for the reply.
Let me try and understand.
Taking bin 3, out of 1.9 million rows in it, there are only 9.9k rows
that are unique. The rest of the rows are duplicate.
The range for that bin contains values between 25763 and 35887.
Since I don't see any values in that range, in the overflow, then I
guess I can assume that there are no major "skewers" of the value in
that bin, and that there is just a very high percentage of values that
are duplicated.

Is that right so far ?

So doesn't that mean that the index would have to do a lot of extra
reads to find a value ?

Thanks for your patience.





========================
-<<Floyd Wellershaus>>-
Database Administrator
Unix Administrator


email: fwellers@yahoo.com

Home: 703-430-0805

Cell: 703-477-6045
========================

http://www.one.org/


----- Original Message ----
From: Andrew Ford <andrewford@austin.rr.com>
To: Floyd Wellershaus <fwellers@yahoo.com>; informix-list@iiug.org
Sent: Wednesday, August 2, 2006 10:50:32 PM
Subject: Re: data distribution question


By 25% duplicate values do you mean you have 100,000 rows in the table
and 25,000 have a value of "A", 25,000 have a value of "B", 25,000 have
a value of "C" and 25,000 have a value of "D"?

If so, I would think that an index on that column might not serve much
purpose unless it was for a foreign key.

If you mean you have 75% unique values (or almost unique values) and 25%
of the data has the same value then the optimizer will (hopefully)
choose an index when you are looking for a few rows living in the 75% of
the data that is unique and might choose a scan if you're looking for a
few rows in the 25% of the data that is a duplicate and probably will
choose a scan if you are looking for a lot of rows in the table.

Based on your distributions it looks like your data has some good
uniqueness. If the value for that column isn't defined in the overflow
buckets you can be pretty sure you only have 0.5% duplicates based on
the data from bin #3.

You are right, the bins don't have anything to do with index traversal.
The optimizer uses the data distributions to make a guess as to which
access method will be the best: one of the many indexes you have created
or a sequential scan. Once the decision is made you either traverse the
index tree looking for data or sequentially scan the table.

Andrew

----- Original Message -----
From: Floyd Wellershaus <mailto:fwellers@yahoo.com>
Newsgroups: comp.databases.informix
To: Ford, Andrew G <mailto:Andrew_G_Ford@homedepot.com> ;
informix-list@iiug.org
Sent: Wednesday, August 02, 2006 7:29 PM
Subject: Re: data distribution question

Hmm. Thanks. I like the way you explain it. It helps me to see
it more clearly. I'm still a bit confused though. I was taught that a
general rule of thumb is that if a table has more than about 25%
duplicate values, it would take too many index reads to make the index
worthwhile.
In your scenario, the index would seem to narrow down to 199
values, but isn't that only if the index knows to go to the correct bin
in the first place ? Maybe they do, but I don't know. I didn't think
the bins had anything to do with index traversal to find the value.







========================
-<<Floyd Wellershaus>>-
Database Administrator
Unix Administrator


email: fwellers@yahoo.com

Home: 703-430-0805

Cell: 703-477-6045
========================

http://www.one.org/


----- Original Message ----
From: "Ford, Andrew G" <Andrew_G_Ford@homedepot.com>
To: Floyd Wellershaus <fwellers@yahoo.com>;
informix-list@iiug.org
Sent: Wednesday, August 2, 2006 6:38:34 PM
Subject: RE: data distribution question





bin# # rows in bucket # unique values highest
value in this bin

2: ( 1973055, 9364,
25763)

3: ( 1973055, 9915,
35887)

Doesn't look too bad. You've got 1973055 rows in bin #3
representing values between 25764 and 35887 and have 9915 unique values
in this range. Calculate this out and you have 1973055 / 9915 = 199
rows for each value between 25764 and 35887.

If I'm looking for a row with value of 28742 this index would
narrow the search down to approx. 199 rows.

Now, if I am looking for a row with even numbered values a
sequential scan would be better. It all depends on what info you are
trying to get to.

Check out this link for some good info on distributions and
update statistics by some guy,
http://www-1.ibm.com/partnerworld/pw...bizinformix_08
mar.html


(Did I get it right John?)




Andrew Ford


-----Original Message-----
From: informix-list-bounces@iiug.org
[mailto:informix-list-bounces@iiug.org] On Behalf Of Floyd Wellershaus
Sent: Wednesday, August 02, 2006 6:08 PM
To: informix-list@iiug.org
Subject: data distribution question


With the below values for the mbntfhst_token column, which is
being joined to in the query, I would think these distributions show way
too much duplication for an index to be effective. Yet, when I force a
scan on that table, the cost goes way way up.

What am I missing here ?

Thank you.

Floyd


1: ( 1973055, 13906, 14381)

2: ( 1973055, 9364, 25763)

3: ( 1973055, 9915, 35887)

4: ( 1973055, 5803, 41746)

5: ( 1973055, 4832, 46654)

6: ( 1973055, 5731, 52437)

7: ( 1973055, 5548, 58053)

8: ( 1973055, 6483, 65065)

9: ( 1973055, 11890, 77038)

10: ( 1973055, 10660, 88268)

11: ( 1973055, 5761, 94404)

12: ( 1973055, 4753, 99639)

13: ( 1973055, 8261, 107909)

14: ( 1973055, 7536, 116822)

15: ( 1973055, 4875, 123111)

16: ( 1973055, 3135, 127073)

17: ( 1973055, 5571, 134148)

...

...



Also, take a look at the overflow bucket for that column.

1: ( 1509940, 206221)

2: ( 499152, 249386)

3: ( 568126, 250016)

4: ( 522169, 276801)

5: ( 495874, 277207)

6: ( 564699, 305435)

7: ( 584721, 305658)

8: ( 507669, 305889)

9: ( 496176, 324590)

10: ( 545764, 335032)

11: ( 575454, 337654)

12: ( 662279, 338354)

13: ( 655045, 357904)

14: ( 1352666, 359863)

15: ( 684749, 368087)

16: ( 521739, 368452)

17: ( 687000, 371474)

18: ( 528608, 371973)

19: ( 679561, 388584)

20: ( 538942, 392313)

21: ( 494174, 393604)

22: ( 546620, 403774)

...

...







========================
-<<Floyd Wellershaus>>-
Database Administrator
Unix Administrator


email: fwellers@yahoo.com

Home: 703-430-0805

Cell: 703-477-6045
========================

http://www.one.org/




Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 10:24 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com