This is a discussion on RE: data distribution question within the Informix forums, part of the Database Server Software category; --> >> Taking bin 3, out of 1.9 million rows in it, there are only 9.9k rows that are unique. ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| >> Taking bin 3, out of 1.9 million rows in it, there are only 9.9k rows that are unique. The rest of the rows are duplicate. Not really. bin 3 represents 1.9 million rows having 9.9K different values between 25763 and 35887. Think of it this way, if you executed the following query against your data: select value, count(*) from table where value between 25764 and 35887 group by value order by value asc; you would expect to see the following results: value count(*) 25764 199 25765 199 25766 199 25767 199 25768 199 .. .. .. 35885 199 35886 199 35887 199 Of course the counts are not going to be exact when you execute the query but this is what the optimizer thinks your data looks like. Andrew Ford -----Original Message----- From: informix-list-bounces@iiug.org [mailto:informix-list-bounces@iiug.org] On Behalf Of Floyd Wellershaus Sent: Thursday, August 03, 2006 4:58 AM To: Andrew Ford; informix-list@iiug.org Subject: Re: data distribution question Thank you for the reply. Let me try and understand. Taking bin 3, out of 1.9 million rows in it, there are only 9.9k rows that are unique. The rest of the rows are duplicate. The range for that bin contains values between 25763 and 35887. Since I don't see any values in that range, in the overflow, then I guess I can assume that there are no major "skewers" of the value in that bin, and that there is just a very high percentage of values that are duplicated. Is that right so far ? So doesn't that mean that the index would have to do a lot of extra reads to find a value ? Thanks for your patience. ======================== -<<Floyd Wellershaus>>- Database Administrator Unix Administrator email: fwellers@yahoo.com Home: 703-430-0805 Cell: 703-477-6045 ======================== http://www.one.org/ ----- Original Message ---- From: Andrew Ford <andrewford@austin.rr.com> To: Floyd Wellershaus <fwellers@yahoo.com>; informix-list@iiug.org Sent: Wednesday, August 2, 2006 10:50:32 PM Subject: Re: data distribution question By 25% duplicate values do you mean you have 100,000 rows in the table and 25,000 have a value of "A", 25,000 have a value of "B", 25,000 have a value of "C" and 25,000 have a value of "D"? If so, I would think that an index on that column might not serve much purpose unless it was for a foreign key. If you mean you have 75% unique values (or almost unique values) and 25% of the data has the same value then the optimizer will (hopefully) choose an index when you are looking for a few rows living in the 75% of the data that is unique and might choose a scan if you're looking for a few rows in the 25% of the data that is a duplicate and probably will choose a scan if you are looking for a lot of rows in the table. Based on your distributions it looks like your data has some good uniqueness. If the value for that column isn't defined in the overflow buckets you can be pretty sure you only have 0.5% duplicates based on the data from bin #3. You are right, the bins don't have anything to do with index traversal. The optimizer uses the data distributions to make a guess as to which access method will be the best: one of the many indexes you have created or a sequential scan. Once the decision is made you either traverse the index tree looking for data or sequentially scan the table. Andrew ----- Original Message ----- From: Floyd Wellershaus <mailto:fwellers@yahoo.com> Newsgroups: comp.databases.informix To: Ford, Andrew G <mailto:Andrew_G_Ford@homedepot.com> ; informix-list@iiug.org Sent: Wednesday, August 02, 2006 7:29 PM Subject: Re: data distribution question Hmm. Thanks. I like the way you explain it. It helps me to see it more clearly. I'm still a bit confused though. I was taught that a general rule of thumb is that if a table has more than about 25% duplicate values, it would take too many index reads to make the index worthwhile. In your scenario, the index would seem to narrow down to 199 values, but isn't that only if the index knows to go to the correct bin in the first place ? Maybe they do, but I don't know. I didn't think the bins had anything to do with index traversal to find the value. ======================== -<<Floyd Wellershaus>>- Database Administrator Unix Administrator email: fwellers@yahoo.com Home: 703-430-0805 Cell: 703-477-6045 ======================== http://www.one.org/ ----- Original Message ---- From: "Ford, Andrew G" <Andrew_G_Ford@homedepot.com> To: Floyd Wellershaus <fwellers@yahoo.com>; informix-list@iiug.org Sent: Wednesday, August 2, 2006 6:38:34 PM Subject: RE: data distribution question bin# # rows in bucket # unique values highest value in this bin 2: ( 1973055, 9364, 25763) 3: ( 1973055, 9915, 35887) Doesn't look too bad. You've got 1973055 rows in bin #3 representing values between 25764 and 35887 and have 9915 unique values in this range. Calculate this out and you have 1973055 / 9915 = 199 rows for each value between 25764 and 35887. If I'm looking for a row with value of 28742 this index would narrow the search down to approx. 199 rows. Now, if I am looking for a row with even numbered values a sequential scan would be better. It all depends on what info you are trying to get to. Check out this link for some good info on distributions and update statistics by some guy, http://www-1.ibm.com/partnerworld/pw...bizinformix_08 mar.html (Did I get it right John?) Andrew Ford -----Original Message----- From: informix-list-bounces@iiug.org [mailto:informix-list-bounces@iiug.org] On Behalf Of Floyd Wellershaus Sent: Wednesday, August 02, 2006 6:08 PM To: informix-list@iiug.org Subject: data distribution question With the below values for the mbntfhst_token column, which is being joined to in the query, I would think these distributions show way too much duplication for an index to be effective. Yet, when I force a scan on that table, the cost goes way way up. What am I missing here ? Thank you. Floyd 1: ( 1973055, 13906, 14381) 2: ( 1973055, 9364, 25763) 3: ( 1973055, 9915, 35887) 4: ( 1973055, 5803, 41746) 5: ( 1973055, 4832, 46654) 6: ( 1973055, 5731, 52437) 7: ( 1973055, 5548, 58053) 8: ( 1973055, 6483, 65065) 9: ( 1973055, 11890, 77038) 10: ( 1973055, 10660, 88268) 11: ( 1973055, 5761, 94404) 12: ( 1973055, 4753, 99639) 13: ( 1973055, 8261, 107909) 14: ( 1973055, 7536, 116822) 15: ( 1973055, 4875, 123111) 16: ( 1973055, 3135, 127073) 17: ( 1973055, 5571, 134148) ... ... Also, take a look at the overflow bucket for that column. 1: ( 1509940, 206221) 2: ( 499152, 249386) 3: ( 568126, 250016) 4: ( 522169, 276801) 5: ( 495874, 277207) 6: ( 564699, 305435) 7: ( 584721, 305658) 8: ( 507669, 305889) 9: ( 496176, 324590) 10: ( 545764, 335032) 11: ( 575454, 337654) 12: ( 662279, 338354) 13: ( 655045, 357904) 14: ( 1352666, 359863) 15: ( 684749, 368087) 16: ( 521739, 368452) 17: ( 687000, 371474) 18: ( 528608, 371973) 19: ( 679561, 388584) 20: ( 538942, 392313) 21: ( 494174, 393604) 22: ( 546620, 403774) ... ... ======================== -<<Floyd Wellershaus>>- Database Administrator Unix Administrator email: fwellers@yahoo.com Home: 703-430-0805 Cell: 703-477-6045 ======================== http://www.one.org/ |