vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I have a side project that needs to "intelligently" know if two strings are contextually similar. Think about how CDDB information is collected and sorted. It isn't perfect, but there should be enough information to be usable. Think about this: "pink floyd - dark side of the moon - money" "dark side of the moon - pink floyd - money" "money - dark side of the moon - pink floyd" etc. To a human, these strings are almost identical. Similarly: "dark floyd of money moon pink side the" Is a puzzle to be solved by 13 year old children before the movie starts. My post has three questions: (1) Does anyone know of an efficient and numerically quantified method of detecting these sorts of things? I currently have a fairly inefficient and numerically bogus solution that may be the only non-impossible solution for the problem. (2) Does any one see a need for this feature in PostgreSQL? If so, what kind of interface would be best accepted as a patch? I am currently returning a match liklihood between 0 and 100; (3) Is there also a desire for a Levenshtein distence function for text and varchars? I experimented with it, and was forced to write the function in item #1. ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Fri, May 19, 2006 at 04:00:48PM -0400, Mark Woodward wrote: > (3) Is there also a desire for a Levenshtein distence function for text > and varchars? I experimented with it, and was forced to write the function > in item #1. Postgres already has a Levenshtein distence function, see fuzzystrmatch in contrib. Whatever you come up with might fit in well there... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFEbiJ4IB7bNG8LQkwRAhuKAJ9SEPT36opokoZ73ngS3F RWn0KznACeJ/8T ks/jWPMuUcXE7J0rXCj2X1c= =CMop -----END PGP SIGNATURE----- |
| |||
| Mark Woodward wrote: > > (3) Is there also a desire for a Levenshtein distence function for text > and varchars? I experimented with it, and was forced to write the function > in item #1. > > fuzzystrmatch in contrib already has a Levenshtein function. cheers andrew ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Mark Woodward wrote: > I have a side project that needs to "intelligently" know if two strings > are contextually similar. Think about how CDDB information is collected > and sorted. It isn't perfect, but there should be enough information to be > usable. > > Think about this: > > "pink floyd - dark side of the moon - money" > "dark side of the moon - pink floyd - money" > "money - dark side of the moon - pink floyd" > etc. > > To a human, these strings are almost identical. Similarly: > > "dark floyd of money moon pink side the" > > Is a puzzle to be solved by 13 year old children before the movie starts. > > My post has three questions: > > (1) Does anyone know of an efficient and numerically quantified method of > detecting these sorts of things? I currently have a fairly inefficient and > numerically bogus solution that may be the only non-impossible solution > for the problem. > > (2) Does any one see a need for this feature in PostgreSQL? If so, what > kind of interface would be best accepted as a patch? I am currently > returning a match liklihood between 0 and 100; > > (3) Is there also a desire for a Levenshtein distence function for text > and varchars? I experimented with it, and was forced to write the function > in item #1. The Levenshtein distance (also known as "edit distance") won't really give you what you want above, because operations to transplant whole chunks of the string aren't supported. (You can simulate it with inserts and deletes, but you pay individually for each of them.) Also, Levenshtein distances don't charge much for changing a word into a similarly spelled but semantically distinct word, such as "word" => "work". What you would want, I think, is some function that recognizes that the whole substring "pink floyd" has been moved from the beginning to the middle of the string, and only charges you a small edit cost for having done so. It would need to recognize both the word boundaries and the transplants. Off the top of my head, I'm not sure how you would achieve that with good runtime characteristics. You can go even further and allow synonyms, so that "pink floyd" is more related to "red floyd" than it is to "large floyd", but for that sort of thing you would probably need to pull in wordnet. If you want to notice that two strings contain local similarity, but don't have an overall good Levenshtein distance, take a look at global vs. local alignment algorithms used in biological applications. Local alignment can be achieved in O(n*m) time, where n and m are the lengths of the two strings, using the Smith-Waterman algorithm. (Temple Smith and Michael Waterman). There are faster heuristic algorithms, but they don't have the same guarantees. These local alignments might tell you something useful as a part of the overall solution. Hmmm... I think I like this problem. Maybe I'll work on it a bit as a contrib module. mark |
| |||
| > Mark Woodward wrote: >> I have a side project that needs to "intelligently" know if two strings >> are contextually similar. Think about how CDDB information is collected >> and sorted. It isn't perfect, but there should be enough information to >> be >> usable. >> >> Think about this: >> >> "pink floyd - dark side of the moon - money" >> "dark side of the moon - pink floyd - money" >> "money - dark side of the moon - pink floyd" >> etc. >> >> To a human, these strings are almost identical. Similarly: >> >> "dark floyd of money moon pink side the" >> >> Is a puzzle to be solved by 13 year old children before the movie >> starts. [snip] > > Hmmm... I think I like this problem. Maybe I'll work on it a bit as a > contrib > module. I *have* a working function, but it is not very efficient and it is not what I would call numerically predictable. And it does find the various sub-strings between the two strings in question. Email me offline and we can make something for contrib. ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 > I have a side project that needs to "intelligently" know if two strings > are contextually similar. The examples you gave seem heavy on word order and whitespace consideration, before applying any algorithms. Here's a quick perl version that does the job: CREATE OR REPLACE FUNCTION matchval(text,text) RETURNS INT LANGUAGE plperlu AS $$ use strict; use String::Approx 'adist'; my $uno = join ' ', sort split /\s+/ => lc shift; my $dos = join ' ', sort split /\s+/ => lc shift; return adist(length $uno<length $dos ? ($uno,$dos) : ($dos,$uno)); $$; Some sample runs: SELECT matchval('pink floyd - dark side of the moon - money', 'dark side of the moon - pink floyd - money'); SELECT matchval('dark floyd of money moon pink side the', 'Money - dark side of the moon - Pink Floyd'); SELECT matchval('dark floyd of money moon pink side the', 'monee - drk sidez of da moon - pink floyd'); SELECT matchval('dark floyd of money moon pink side the', 'pink floyd - animals'); SELECT matchval('dark floyd of money moon pink side the', 'walking on the moon - the police'); The above returns 0, 0, 6, 10, and 17; a score of 0 is an exact match. - -- Greg Sabino Mullane greg@turnstep.com PGP Key: 0x14964AC8 200605191835 http://biglumber.com/x/web?pk=2529DF...9B906714964AC8 -----BEGIN PGP SIGNATURE----- iD8DBQFEbktUvJuQZxSWSsgRAiCtAJ9nlpqGxlYnimDPp8t5XQ sc8y9RywCfZZL6 iU9iPnxHaWOvYCUD7+rK8Do= =zo3T -----END PGP SIGNATURE----- ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| > > I have a side project that needs to "intelligently" know if two > > strings are contextually similar. Also check out the "fuzzystrmatch" module in /contrib, which offers soundex, metaphone and levenschtein functions. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > >> I have a side project that needs to "intelligently" know if two strings >> are contextually similar. > > The examples you gave seem heavy on word order and whitespace > consideration, > before applying any algorithms. Here's a quick perl version that does the > job: [SNIP] This is a case where the example was too simple to explain the problem, sorry. I have an implementation of Oracle's "contains" function for PostgreSQL, and it does basically what you are doing, and, in fact, also has Mohawk Software Extensions (LOL) that provide metaphone. The problem is that parsing white space realy isn't reliable. Sometimes it is pinkfloyd-darksideofthemoon. Also, I have been thinking of other applications. I have a piece of code that does this: apps$ ./stratest "pink foyd dark side of the moon money" "money dark side of the moon pink floyd" Match: dark side of the moon Match: pink f Match: money Match: oyd apps$ ./stratest "pinkfoyddarksideofthemoonmoney" "moneydarksideofthemoonpinkfloyd" Match: darksideofthemoon Match: pinkf Match: money Match: oyd I need to come up with a numerically sane way of taking this information and understanding overall "similarity." ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Get pg_trgm http://www.sai.msu.su/~megera/oddmus...cgi/ReadmeTrgm It doesn't depends on language. Oleg On Fri, 19 May 2006, Mark Woodward wrote: > I have a side project that needs to "intelligently" know if two strings > are contextually similar. Think about how CDDB information is collected > and sorted. It isn't perfect, but there should be enough information to be > usable. > > Think about this: > > "pink floyd - dark side of the moon - money" > "dark side of the moon - pink floyd - money" > "money - dark side of the moon - pink floyd" > etc. > > To a human, these strings are almost identical. Similarly: > > "dark floyd of money moon pink side the" > > Is a puzzle to be solved by 13 year old children before the movie starts. > > My post has three questions: > > (1) Does anyone know of an efficient and numerically quantified method of > detecting these sorts of things? I currently have a fairly inefficient and > numerically bogus solution that may be the only non-impossible solution > for the problem. > > (2) Does any one see a need for this feature in PostgreSQL? If so, what > kind of interface would be best accepted as a patch? I am currently > returning a match liklihood between 0 and 100; > > (3) Is there also a desire for a Levenshtein distence function for text > and varchars? I experimented with it, and was forced to write the function > in item #1. > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg __________________________________________________ ___________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| ||||
| > Get pg_trgm http://www.sai.msu.su/~megera/oddmus...cgi/ReadmeTrgm > It doesn't depends on language. That's an interesting approach. This is what I got: apps$ ./stratest "pink floyd dark side of the moon money" "dark side of the moon pink floyd" Match: dark side of the moon Match: pink floyd Similarity: 89 One function finds the substring runs, in descending order of length, between the two strings. After the function, I have number of runs, length of best run, total number of characters matched. Without going into too lengthy description, while space and punctuation are not reliable. Like this "pinkfloyd" or "pink floyd" "darkside" or "dark side" Humans are VERY good at seeing these things, computers, pardon, suck. What I was hoping someone had was a function that could find the substring runs in something less than a strlen1*strlen2 number of operations and a numerically sane way of representing the similarity or difference. ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |