Unix Technical Forum

improving random record selection

This is a discussion on improving random record selection within the MySQL General forum forums, part of the MySQL category; --> I posted this a month or so ago, and was helped a little, but I am now back. Currently ...


Go Back   Unix Technical Forum > Database Server Software > MySQL > MySQL General forum

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 05-18-2008, 10:02 PM
Scott Haneda
 
Posts: n/a
Default improving random record selection

I posted this a month or so ago, and was helped a little, but I am now
back.

Currently I use select x, y, z from images where (condition) order by
rand() limit 1;

As most know, it is slow, depending on the record set, and what I
compare it to, it can be from one order of magnitude slower, to several.

I have cobbled together a solution, but it on occasion, returns an
empty result set, which is causing me grief. I should mention, there
are holes in my id column, and I am needing to select a set based on a
condition.

The below sql I do not fully understand either, if someone could step
me through it, I would certainly appreciate it, though my main goal is
to figure out why I get an empty set on occasion.

$sql = "
SELECT storage_path, image_md5, t.id
FROM images AS t
JOIN
(SELECT CEIL(MAX(id)*RAND()) AS id FROM images) AS x ON (t.id >=
x.id)
AND (t.approved = 1) AND (t.ip_address != '$exclude_ip') LIMIT 1";

* I could almost live with the slow speed of an order by rand() but I
find it has a less than even distribution. While it indeed may be
very random, I am looking for a pretty flat response in distribution
of returned records over time.
--
Scott
talklists@newgeo.com

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 05-18-2008, 10:02 PM
Rob Wultsch
 
Posts: n/a
Default Re: improving random record selection

On Sat, May 17, 2008 at 2:32 PM, Scott Haneda <talklists@newgeo.com> wrote:
> $sql = "
> SELECT storage_path, image_md5, t.id
> FROM images AS t
> JOIN
> (SELECT CEIL(MAX(id)*RAND()) AS id FROM images) AS x
> ON (t.id >= x.id)
> AND (t.approved = 1) AND (t.ip_address !=
> '$exclude_ip') LIMIT 1";

I am going to reformat (whitespace only) your query a bit to start out with:
SELECT storage_path, image_md5, t.id
FROM images AS t
JOIN (
SELECT CEIL(
MAX(id)*RAND()
) AS id
FROM images
) AS x ON (t.id >= x.id)
AND (t.approved = 1)
AND (t.ip_address != '$exclude_ip')
LIMIT 1

I am going to break this up a bit:
SELECT storage_path, image_md5, t.id
FROM images AS t
JOIN ....
This should be mostly self explanatory. t.id specifies the table
because id is ambiguous (x.id is created later on). Please note that I
never use JOIN by itself. I would make this an INNER JOIN.

SELECT CEIL(
MAX(id)*RAND()
) AS id
FROM images AS x

MAX(id) find the largest id that currently exists. This value is then
multiplied by whatever rand returns, which would be a between 0 and 1.
The result of the multiplication is then rounded up, and aliased as
id. The the table (of one row) is then aliased as x. So you now have
x.id which is a random number between 0 and the largest id value that
currently exists.

ON (t.id >= x.id)
AND (t.approved = 1)
AND (t.ip_address != '$exclude_ip')
Finally we have your JOIN condition. It says, for the table aliased as
t, the id must be great than or equal to x.id (which was explained
above). This will eliminate some portion of the images table from the
possibility of being selected. Next all rows in the same table where
approved is not equal to 1 should be removed. Finally all rows that
fail t.ip_address != '$exclude_ip' get excluded.

LIMIT 1
Only return one row.

Problems:
1. You should be using:
AND (t.approved = 1)
AND (t.ip_address != '$exclude_ip')
in the subquery. If x.id is larger than the largest row that fits
those conditions you will get no results.
2. There is no ORDER BY clause. There is nothing telling MySQL use the
t.id which is next largest value above x.id. MySQL will probably pick
out the right row, because they are probably stored in order. You
probably can get away with not having the ORDER BY clause, and it will
cost you extra cycles. How many extra cycles depends on how out of
order the table is. You can reorder the row by id using:
ALTER TABLE images ORDER BY id;
3. If the holes in your data are not distributed equally...

Suggested new query:
SELECT storage_path, image_md5, t.id
FROM images AS t
INNER JOIN (
SELECT CEIL(
MAX(id)*RAND()
) AS id
FROM images
WHERE x.approved = 1
AND x.ip_address != '$exclude_ip'
) AS x ON (t.id >= x.id)
ORDER BY t.id ASC
LIMIT 1

--
Rob Wultsch
wultsch@gmail.com
wultsch (aim)
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 05-20-2008, 05:55 PM
Jerry Schwartz
 
Posts: n/a
Default RE: improving random record selection

>From: Scott Haneda [mailto:talklists@newgeo.com]
>Sent: Saturday, May 17, 2008 5:32 PM
>To: mysql@lists.mysql.com
>Subject: improving random record selection
>
>I posted this a month or so ago, and was helped a little, but I am now
>back.
>
>Currently I use select x, y, z from images where (condition) order by
>rand() limit 1;
>
>As most know, it is slow, depending on the record set, and what I
>compare it to, it can be from one order of magnitude slower, to several.
>
>I have cobbled together a solution, but it on occasion, returns an
>empty result set, which is causing me grief. I should mention, there
>are holes in my id column, and I am needing to select a set based on a
>condition.
>
>The below sql I do not fully understand either, if someone could step
>me through it, I would certainly appreciate it, though my main goal is
>to figure out why I get an empty set on occasion.
>
> $sql = "
> SELECT storage_path, image_md5, t.id
> FROM images AS t
> JOIN
> (SELECT CEIL(MAX(id)*RAND()) AS id FROM images) AS x
>ON (t.id >=
>x.id)
> AND (t.approved = 1) AND (t.ip_address !=
>'$exclude_ip') LIMIT 1";
>
>* I could almost live with the slow speed of an order by rand() but I
>find it has a less than even distribution. While it indeed may be
>very random, I am looking for a pretty flat response in distribution
>of returned records over time.

[JS] You say you want a "flat" distribution; by that I think you mean that
the probability of selecting any given record is the same. If you have gaps
in your data, I can't think of any way to do that other than be assigning a
unique and sequential ID to each record. If you ever delete a record, you'd
have to renumber the remaining ones. Then you'd pick off a random value for
this unique ID.

At first glance, this seems to be the only way to avoid sampling errors.

If there were some way of setting a cursor to an arbitrary record, that
would work very well; but you don't want to be stepping sequentially through
(on average) half or your records.
>--
>Scott
>talklists@newgeo.com
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe: http://lists.mysql.com/mysql?unsub=jschwartz@the-
>infoshop.com





Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 05-20-2008, 05:55 PM
Jerry Schwartz
 
Posts: n/a
Default RE: improving random record selection

>-----Original Message-----
>From: Rob Wultsch [mailto:wultsch@gmail.com]
>Sent: Saturday, May 17, 2008 6:47 PM
>To: Scott Haneda
>Cc: mysql@lists.mysql.com
>Subject: Re: improving random record selection
>
>On Sat, May 17, 2008 at 2:32 PM, Scott Haneda <talklists@newgeo.com>
>wrote:
>> $sql = "
>> SELECT storage_path, image_md5, t.id
>> FROM images AS t
>> JOIN
>> (SELECT CEIL(MAX(id)*RAND()) AS id FROM images)

>AS x
>> ON (t.id >= x.id)
>> AND (t.approved = 1) AND (t.ip_address !=
>> '$exclude_ip') LIMIT 1";

>I am going to reformat (whitespace only) your query a bit to start out
>with:
>SELECT storage_path, image_md5, t.id
>FROM images AS t
> JOIN (
> SELECT CEIL(
> MAX(id)*RAND()
> ) AS id
> FROM images
> ) AS x ON (t.id >= x.id)
> AND (t.approved = 1)
> AND (t.ip_address != '$exclude_ip')
>LIMIT 1
>
>I am going to break this up a bit:
>SELECT storage_path, image_md5, t.id
>FROM images AS t
>JOIN ....
>This should be mostly self explanatory. t.id specifies the table
>because id is ambiguous (x.id is created later on). Please note that I
>never use JOIN by itself. I would make this an INNER JOIN.
>
>SELECT CEIL(
> MAX(id)*RAND()
> ) AS id
>FROM images AS x
>
>MAX(id) find the largest id that currently exists. This value is then
>multiplied by whatever rand returns, which would be a between 0 and 1.
>The result of the multiplication is then rounded up, and aliased as
>id. The the table (of one row) is then aliased as x. So you now have
>x.id which is a random number between 0 and the largest id value that
>currently exists.
>
>ON (t.id >= x.id)
> AND (t.approved = 1)
> AND (t.ip_address != '$exclude_ip')
>Finally we have your JOIN condition. It says, for the table aliased as
>t, the id must be great than or equal to x.id (which was explained
>above). This will eliminate some portion of the images table from the
>possibility of being selected. Next all rows in the same table where
>approved is not equal to 1 should be removed. Finally all rows that
>fail t.ip_address != '$exclude_ip' get excluded.
>
>LIMIT 1
>Only return one row.
>
>Problems:
>1. You should be using:
> AND (t.approved = 1)
> AND (t.ip_address != '$exclude_ip')
>in the subquery. If x.id is larger than the largest row that fits
>those conditions you will get no results.
>2. There is no ORDER BY clause. There is nothing telling MySQL use the
>t.id which is next largest value above x.id. MySQL will probably pick
>out the right row, because they are probably stored in order. You
>probably can get away with not having the ORDER BY clause, and it will
>cost you extra cycles. How many extra cycles depends on how out of
>order the table is. You can reorder the row by id using:
>ALTER TABLE images ORDER BY id;
>3. If the holes in your data are not distributed equally...
>
>Suggested new query:
>SELECT storage_path, image_md5, t.id
>FROM images AS t
> INNER JOIN (
> SELECT CEIL(
> MAX(id)*RAND()
> ) AS id
> FROM images
> WHERE x.approved = 1
> AND x.ip_address != '$exclude_ip'
> ) AS x ON (t.id >= x.id)
>ORDER BY t.id ASC
>LIMIT 1
>

[JS] I might not understand what this is doing, but I think it will
preferentially sample the ids that are at the end of a gap.
>--
>Rob Wultsch
>wultsch@gmail.com
>wultsch (aim)
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe: http://lists.mysql.com/mysql?unsub=jschwartz@the-
>infoshop.com





Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 05-20-2008, 05:55 PM
Rob Wultsch
 
Posts: n/a
Default Re: improving random record selection

On Mon, May 19, 2008 at 7:24 AM, Jerry Schwartz
<jschwartz@the-infoshop.com> wrote:
> I might not understand what this is doing, but I think it will preferentially sample the ids that are at the end of a gap.


What don't you understand about the query or the way I described it?

> You say you want a "flat" distribution; by that I think you mean that the probability of selecting any given record is the same. If you have gaps in your data, I can't think of any way to do that other than be assigning a unique and sequential ID to each record. If you ever delete a record, you'd have to renumber the remaining ones. Then you'd pick off a random value for this unique ID.


There are alternatives. (generating a random number for each row for
example, take a look at the original conversation). Having to keep the
sequence holeless would be a pain in the back side, but could be done
with a trigger running something like I describe in the thread ->
http://lists.mysql.com/mysql/212838 .

--
Rob Wultsch
wultsch@gmail.com
wultsch (aim)
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 05-20-2008, 05:55 PM
Jerry Schwartz
 
Posts: n/a
Default RE: improving random record selection

>-----Original Message-----
>From: Rob Wultsch [mailto:wultsch@gmail.com]
>Sent: Monday, May 19, 2008 11:20 AM
>To: Jerry Schwartz
>Cc: Scott Haneda; mysql@lists.mysql.com
>Subject: Re: improving random record selection
>
>On Mon, May 19, 2008 at 7:24 AM, Jerry Schwartz
><jschwartz@the-infoshop.com> wrote:
>> I might not understand what this is doing, but I think it will

>preferentially sample the ids that are at the end of a gap.
>
>What don't you understand about the query or the way I described it?

[JS] I was being cautious, I didn't have the wit or time to go over it in
detail.
>
>> You say you want a "flat" distribution; by that I think you mean that

>the probability of selecting any given record is the same. If you have
>gaps in your data, I can't think of any way to do that other than be
>assigning a unique and sequential ID to each record. If you ever delete
>a record, you'd have to renumber the remaining ones. Then you'd pick off
>a random value for this unique ID.
>
>There are alternatives. (generating a random number for each row for
>example, take a look at the original conversation). Having to keep the
>sequence holeless would be a pain in the back side, but could be done
>with a trigger running something like I describe in the thread ->
>http://lists.mysql.com/mysql/212838 .

[JS] I think this would work:

SET @rand_rec_num = (SELECT CAST(FLOOR(RAND() * COUNT(*) + 1)
AS UNSIGNED) FROM bunya_map);

PREPARE get_rand_rec FROM "SELECT * FROM bunya_map LIMIT ?, 1";

EXECUTE get_rand_rec USING @rand_rec_num;

I suppose this could be put into a user function, if you only need a single
value passed back.


>
>--
>Rob Wultsch
>wultsch@gmail.com
>wultsch (aim)




Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 03:04 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com