Unix Technical Forum

How to group Typo entries in a database

This is a discussion on How to group Typo entries in a database within the MySQL forums, part of the Database Server Software category; --> Hi, I have a big database of about 14000 records. The entries have a lot of typos like punctuation. ...


Go Back   Unix Technical Forum > Database Server Software > MySQL

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 02-28-2008, 08:52 AM
premgrps@gmail.com
 
Posts: n/a
Default How to group Typo entries in a database

Hi,
I have a big database of about 14000 records. The entries have a lot
of typos like punctuation. ie. same name has punctuation in one entry
and no punctuation in another. These are showing up as unique records
when I do a search. Is there a way to do it using PERL/Mysql where I
can check for a percentage match of the strings and then do a replace?
I would like to have all these records to show up as a group instead of
unique records.

Manual correction is not too easy.

Thanks.

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 02-28-2008, 08:52 AM
strawberry
 
Posts: n/a
Default Re: How to group Typo entries in a database

Without knowing a bit more about the kind of entries I can't really say
- but I think some manual correction is likely to be unavoidable!

That aside, have a look at the soundex() function. Depending on the
size of the entries it may be offer a pracitcal solution - at least for
grouping similar entries - something like;

SELECT DISTINCT(SOUNDEX(t1.entry)),t2.entry
FROM table t1
LEFT JOIN table t2
ON SOUNDEX(t2.entry) = SOUNDEX(t1.entry)
LIMIT 10;

The LIMIT clause may be significant here because this kind of
comparison (without an index) could take a very long time! You could
also select the reuslts into a temporary table first - and then create
the indexes necessary to speed up the comparison.

premgrps@gmail.com wrote:
> Hi,
> I have a big database of about 14000 records. The entries have a lot
> of typos like punctuation. ie. same name has punctuation in one entry
> and no punctuation in another. These are showing up as unique records
> when I do a search. Is there a way to do it using PERL/Mysql where I
> can check for a percentage match of the strings and then do a replace?
> I would like to have all these records to show up as a group instead of
> unique records.
>
> Manual correction is not too easy.
>
> Thanks.


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 02-28-2008, 08:52 AM
Bill Karwin
 
Posts: n/a
Default Re: How to group Typo entries in a database

premgrps@gmail.com wrote:
> I have a big database of about 14000 records. The entries have a lot
> of typos like punctuation. ie. same name has punctuation in one entry
> and no punctuation in another. These are showing up as unique records
> when I do a search.


This is a very common problem. Database workers are frequently called
upon to perform laborious "data cleanup" projects that take months, and
never quite resolve 100% of cases.

> Is there a way to do it using PERL/Mysql where I
> can check for a percentage match of the strings and then do a replace?


Look into the SOUNDEX() function. It should help with certain types of
string comparisons. It won't help find cases like "Bob Smith" should be
treated the same as "Robert C. Smith", but it'll help in some types of
cases like punctuation differences.

http://dev.mysql.com/doc/refman/5.0/...functions.html

Regards,
Bill K.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:51 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com