This is a discussion on How to group Typo entries in a database within the MySQL forums, part of the Database Server Software category; --> Hi, I have a big database of about 14000 records. The entries have a lot of typos like punctuation. ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi, I have a big database of about 14000 records. The entries have a lot of typos like punctuation. ie. same name has punctuation in one entry and no punctuation in another. These are showing up as unique records when I do a search. Is there a way to do it using PERL/Mysql where I can check for a percentage match of the strings and then do a replace? I would like to have all these records to show up as a group instead of unique records. Manual correction is not too easy. Thanks. |
| |||
| Without knowing a bit more about the kind of entries I can't really say - but I think some manual correction is likely to be unavoidable! That aside, have a look at the soundex() function. Depending on the size of the entries it may be offer a pracitcal solution - at least for grouping similar entries - something like; SELECT DISTINCT(SOUNDEX(t1.entry)),t2.entry FROM table t1 LEFT JOIN table t2 ON SOUNDEX(t2.entry) = SOUNDEX(t1.entry) LIMIT 10; The LIMIT clause may be significant here because this kind of comparison (without an index) could take a very long time! You could also select the reuslts into a temporary table first - and then create the indexes necessary to speed up the comparison. premgrps@gmail.com wrote: > Hi, > I have a big database of about 14000 records. The entries have a lot > of typos like punctuation. ie. same name has punctuation in one entry > and no punctuation in another. These are showing up as unique records > when I do a search. Is there a way to do it using PERL/Mysql where I > can check for a percentage match of the strings and then do a replace? > I would like to have all these records to show up as a group instead of > unique records. > > Manual correction is not too easy. > > Thanks. |
| ||||
| premgrps@gmail.com wrote: > I have a big database of about 14000 records. The entries have a lot > of typos like punctuation. ie. same name has punctuation in one entry > and no punctuation in another. These are showing up as unique records > when I do a search. This is a very common problem. Database workers are frequently called upon to perform laborious "data cleanup" projects that take months, and never quite resolve 100% of cases. > Is there a way to do it using PERL/Mysql where I > can check for a percentage match of the strings and then do a replace? Look into the SOUNDEX() function. It should help with certain types of string comparisons. It won't help find cases like "Bob Smith" should be treated the same as "Robert C. Smith", but it'll help in some types of cases like punctuation differences. http://dev.mysql.com/doc/refman/5.0/...functions.html Regards, Bill K. |