This is a discussion on improving random record selection within the MySQL General forum forums, part of the MySQL category; --> I posted this a month or so ago, and was helped a little, but I am now back. Currently ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I posted this a month or so ago, and was helped a little, but I am now back. Currently I use select x, y, z from images where (condition) order by rand() limit 1; As most know, it is slow, depending on the record set, and what I compare it to, it can be from one order of magnitude slower, to several. I have cobbled together a solution, but it on occasion, returns an empty result set, which is causing me grief. I should mention, there are holes in my id column, and I am needing to select a set based on a condition. The below sql I do not fully understand either, if someone could step me through it, I would certainly appreciate it, though my main goal is to figure out why I get an empty set on occasion. $sql = " SELECT storage_path, image_md5, t.id FROM images AS t JOIN (SELECT CEIL(MAX(id)*RAND()) AS id FROM images) AS x ON (t.id >= x.id) AND (t.approved = 1) AND (t.ip_address != '$exclude_ip') LIMIT 1"; * I could almost live with the slow speed of an order by rand() but I find it has a less than even distribution. While it indeed may be very random, I am looking for a pretty flat response in distribution of returned records over time. -- Scott talklists@newgeo.com |
| |||
| On Sat, May 17, 2008 at 2:32 PM, Scott Haneda <talklists@newgeo.com> wrote: > $sql = " > SELECT storage_path, image_md5, t.id > FROM images AS t > JOIN > (SELECT CEIL(MAX(id)*RAND()) AS id FROM images) AS x > ON (t.id >= x.id) > AND (t.approved = 1) AND (t.ip_address != > '$exclude_ip') LIMIT 1"; I am going to reformat (whitespace only) your query a bit to start out with: SELECT storage_path, image_md5, t.id FROM images AS t JOIN ( SELECT CEIL( MAX(id)*RAND() ) AS id FROM images ) AS x ON (t.id >= x.id) AND (t.approved = 1) AND (t.ip_address != '$exclude_ip') LIMIT 1 I am going to break this up a bit: SELECT storage_path, image_md5, t.id FROM images AS t JOIN .... This should be mostly self explanatory. t.id specifies the table because id is ambiguous (x.id is created later on). Please note that I never use JOIN by itself. I would make this an INNER JOIN. SELECT CEIL( MAX(id)*RAND() ) AS id FROM images AS x MAX(id) find the largest id that currently exists. This value is then multiplied by whatever rand returns, which would be a between 0 and 1. The result of the multiplication is then rounded up, and aliased as id. The the table (of one row) is then aliased as x. So you now have x.id which is a random number between 0 and the largest id value that currently exists. ON (t.id >= x.id) AND (t.approved = 1) AND (t.ip_address != '$exclude_ip') Finally we have your JOIN condition. It says, for the table aliased as t, the id must be great than or equal to x.id (which was explained above). This will eliminate some portion of the images table from the possibility of being selected. Next all rows in the same table where approved is not equal to 1 should be removed. Finally all rows that fail t.ip_address != '$exclude_ip' get excluded. LIMIT 1 Only return one row. Problems: 1. You should be using: AND (t.approved = 1) AND (t.ip_address != '$exclude_ip') in the subquery. If x.id is larger than the largest row that fits those conditions you will get no results. 2. There is no ORDER BY clause. There is nothing telling MySQL use the t.id which is next largest value above x.id. MySQL will probably pick out the right row, because they are probably stored in order. You probably can get away with not having the ORDER BY clause, and it will cost you extra cycles. How many extra cycles depends on how out of order the table is. You can reorder the row by id using: ALTER TABLE images ORDER BY id; 3. If the holes in your data are not distributed equally... Suggested new query: SELECT storage_path, image_md5, t.id FROM images AS t INNER JOIN ( SELECT CEIL( MAX(id)*RAND() ) AS id FROM images WHERE x.approved = 1 AND x.ip_address != '$exclude_ip' ) AS x ON (t.id >= x.id) ORDER BY t.id ASC LIMIT 1 -- Rob Wultsch wultsch@gmail.com wultsch (aim) |
| |||
| >From: Scott Haneda [mailto:talklists@newgeo.com] >Sent: Saturday, May 17, 2008 5:32 PM >To: mysql@lists.mysql.com >Subject: improving random record selection > >I posted this a month or so ago, and was helped a little, but I am now >back. > >Currently I use select x, y, z from images where (condition) order by >rand() limit 1; > >As most know, it is slow, depending on the record set, and what I >compare it to, it can be from one order of magnitude slower, to several. > >I have cobbled together a solution, but it on occasion, returns an >empty result set, which is causing me grief. I should mention, there >are holes in my id column, and I am needing to select a set based on a >condition. > >The below sql I do not fully understand either, if someone could step >me through it, I would certainly appreciate it, though my main goal is >to figure out why I get an empty set on occasion. > > $sql = " > SELECT storage_path, image_md5, t.id > FROM images AS t > JOIN > (SELECT CEIL(MAX(id)*RAND()) AS id FROM images) AS x >ON (t.id >= >x.id) > AND (t.approved = 1) AND (t.ip_address != >'$exclude_ip') LIMIT 1"; > >* I could almost live with the slow speed of an order by rand() but I >find it has a less than even distribution. While it indeed may be >very random, I am looking for a pretty flat response in distribution >of returned records over time. [JS] You say you want a "flat" distribution; by that I think you mean that the probability of selecting any given record is the same. If you have gaps in your data, I can't think of any way to do that other than be assigning a unique and sequential ID to each record. If you ever delete a record, you'd have to renumber the remaining ones. Then you'd pick off a random value for this unique ID. At first glance, this seems to be the only way to avoid sampling errors. If there were some way of setting a cursor to an arbitrary record, that would work very well; but you don't want to be stepping sequentially through (on average) half or your records. >-- >Scott >talklists@newgeo.com > > >-- >MySQL General Mailing List >For list archives: http://lists.mysql.com/mysql >To unsubscribe: http://lists.mysql.com/mysql?unsub=jschwartz@the- >infoshop.com |
| |||
| >-----Original Message----- >From: Rob Wultsch [mailto:wultsch@gmail.com] >Sent: Saturday, May 17, 2008 6:47 PM >To: Scott Haneda >Cc: mysql@lists.mysql.com >Subject: Re: improving random record selection > >On Sat, May 17, 2008 at 2:32 PM, Scott Haneda <talklists@newgeo.com> >wrote: >> $sql = " >> SELECT storage_path, image_md5, t.id >> FROM images AS t >> JOIN >> (SELECT CEIL(MAX(id)*RAND()) AS id FROM images) >AS x >> ON (t.id >= x.id) >> AND (t.approved = 1) AND (t.ip_address != >> '$exclude_ip') LIMIT 1"; >I am going to reformat (whitespace only) your query a bit to start out >with: >SELECT storage_path, image_md5, t.id >FROM images AS t > JOIN ( > SELECT CEIL( > MAX(id)*RAND() > ) AS id > FROM images > ) AS x ON (t.id >= x.id) > AND (t.approved = 1) > AND (t.ip_address != '$exclude_ip') >LIMIT 1 > >I am going to break this up a bit: >SELECT storage_path, image_md5, t.id >FROM images AS t >JOIN .... >This should be mostly self explanatory. t.id specifies the table >because id is ambiguous (x.id is created later on). Please note that I >never use JOIN by itself. I would make this an INNER JOIN. > >SELECT CEIL( > MAX(id)*RAND() > ) AS id >FROM images AS x > >MAX(id) find the largest id that currently exists. This value is then >multiplied by whatever rand returns, which would be a between 0 and 1. >The result of the multiplication is then rounded up, and aliased as >id. The the table (of one row) is then aliased as x. So you now have >x.id which is a random number between 0 and the largest id value that >currently exists. > >ON (t.id >= x.id) > AND (t.approved = 1) > AND (t.ip_address != '$exclude_ip') >Finally we have your JOIN condition. It says, for the table aliased as >t, the id must be great than or equal to x.id (which was explained >above). This will eliminate some portion of the images table from the >possibility of being selected. Next all rows in the same table where >approved is not equal to 1 should be removed. Finally all rows that >fail t.ip_address != '$exclude_ip' get excluded. > >LIMIT 1 >Only return one row. > >Problems: >1. You should be using: > AND (t.approved = 1) > AND (t.ip_address != '$exclude_ip') >in the subquery. If x.id is larger than the largest row that fits >those conditions you will get no results. >2. There is no ORDER BY clause. There is nothing telling MySQL use the >t.id which is next largest value above x.id. MySQL will probably pick >out the right row, because they are probably stored in order. You >probably can get away with not having the ORDER BY clause, and it will >cost you extra cycles. How many extra cycles depends on how out of >order the table is. You can reorder the row by id using: >ALTER TABLE images ORDER BY id; >3. If the holes in your data are not distributed equally... > >Suggested new query: >SELECT storage_path, image_md5, t.id >FROM images AS t > INNER JOIN ( > SELECT CEIL( > MAX(id)*RAND() > ) AS id > FROM images > WHERE x.approved = 1 > AND x.ip_address != '$exclude_ip' > ) AS x ON (t.id >= x.id) >ORDER BY t.id ASC >LIMIT 1 > [JS] I might not understand what this is doing, but I think it will preferentially sample the ids that are at the end of a gap. >-- >Rob Wultsch >wultsch@gmail.com >wultsch (aim) > >-- >MySQL General Mailing List >For list archives: http://lists.mysql.com/mysql >To unsubscribe: http://lists.mysql.com/mysql?unsub=jschwartz@the- >infoshop.com |
| |||
| On Mon, May 19, 2008 at 7:24 AM, Jerry Schwartz <jschwartz@the-infoshop.com> wrote: > I might not understand what this is doing, but I think it will preferentially sample the ids that are at the end of a gap. What don't you understand about the query or the way I described it? > You say you want a "flat" distribution; by that I think you mean that the probability of selecting any given record is the same. If you have gaps in your data, I can't think of any way to do that other than be assigning a unique and sequential ID to each record. If you ever delete a record, you'd have to renumber the remaining ones. Then you'd pick off a random value for this unique ID. There are alternatives. (generating a random number for each row for example, take a look at the original conversation). Having to keep the sequence holeless would be a pain in the back side, but could be done with a trigger running something like I describe in the thread -> http://lists.mysql.com/mysql/212838 . -- Rob Wultsch wultsch@gmail.com wultsch (aim) |
| ||||
| >-----Original Message----- >From: Rob Wultsch [mailto:wultsch@gmail.com] >Sent: Monday, May 19, 2008 11:20 AM >To: Jerry Schwartz >Cc: Scott Haneda; mysql@lists.mysql.com >Subject: Re: improving random record selection > >On Mon, May 19, 2008 at 7:24 AM, Jerry Schwartz ><jschwartz@the-infoshop.com> wrote: >> I might not understand what this is doing, but I think it will >preferentially sample the ids that are at the end of a gap. > >What don't you understand about the query or the way I described it? [JS] I was being cautious, I didn't have the wit or time to go over it in detail. > >> You say you want a "flat" distribution; by that I think you mean that >the probability of selecting any given record is the same. If you have >gaps in your data, I can't think of any way to do that other than be >assigning a unique and sequential ID to each record. If you ever delete >a record, you'd have to renumber the remaining ones. Then you'd pick off >a random value for this unique ID. > >There are alternatives. (generating a random number for each row for >example, take a look at the original conversation). Having to keep the >sequence holeless would be a pain in the back side, but could be done >with a trigger running something like I describe in the thread -> >http://lists.mysql.com/mysql/212838 . [JS] I think this would work: SET @rand_rec_num = (SELECT CAST(FLOOR(RAND() * COUNT(*) + 1) AS UNSIGNED) FROM bunya_map); PREPARE get_rand_rec FROM "SELECT * FROM bunya_map LIMIT ?, 1"; EXECUTE get_rand_rec USING @rand_rec_num; I suppose this could be put into a user function, if you only need a single value passed back. > >-- >Rob Wultsch >wultsch@gmail.com >wultsch (aim) |