This is a discussion on Table Design Question within the Informix forums, part of the Database Server Software category; --> I've got this little 30million row table, the entry point wrt to the app will be via a char(12) ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I've got this little 30million row table, the entry point wrt to the app will be via a char(12) unique field. No real hassles, a little big of fragmentation and it's all fine but ... The typical search is <str>* [25%] *<str> [25%] and *<str>* [50%], and to make life more awkward it has to treat certain characters as identical so, for example, 0 [zero] = O [letter] and vice versa. But at least it's not case sensitive. It has a major update applied once per week (about 10%) but apart from that it is static. All smart arse ideas accepted:-)) -- Paul Watson # Oninit Ltd # Growing old is mandatory Tel: +44 1436 672201 # Growing up is optional Fax: +44 1436 678693 # Mob: +44 7818 003457 # www.oninit.com # |
| |||
| On Fri, 23 Jan 2004 06:28:18 -0500, Paul Watson wrote: > I've got this little 30million row table, the entry point wrt to the app > will be via a char(12) unique field. No real hassles, a little big of > fragmentation and it's all fine but ... > > The typical search is <str>* [25%] *<str> [25%] and *<str>* [50%], and to > make life more awkward it has to treat certain characters as identical so, > for example, 0 [zero] = O [letter] and vice versa. But at least it's not > case sensitive. > > It has a major update applied once per week (about 10%) but apart from that > it is static. > > All smart arse ideas accepted:-)) > > Create a functional index to map the equivalent chars for that part. For the partial searches, You might use two mapping tables (or one) containing trailing and contained substrings with the full key column. Then a functional index on the substring column to effect the equivalence mappings. Lead substrings are simple and handled by the functional index on the original column directly. Art S. Kagel |
| |||
| Paul Watson wrote: > I've got this little 30million row table, the entry point wrt to > the app will be via a char(12) unique field. No real hassles, a little > big of fragmentation and it's all fine but ... > > The typical search is <str>* [25%] *<str> [25%] and *<str>* [50%], > and to make life more awkward it has to treat certain characters > as identical so, for example, 0 [zero] = O [letter] and vice versa. > But at least it's not case sensitive. > > It has a major update applied once per week (about 10%) but apart from > that it is static. > > All smart arse ideas accepted:-)) Create a UDT that matches 0 to o and any other stupidities like that. You might want to use Excalibur text index or VTS to do the indexing. -- "C'est pas parce qu'on n'a rien à dire qu'il faut fermer sa gueule" - Coluche |
| |||
| To get round the *<str> I was going to store the string backwards, index and then search <rts>*, this is blindingly fast. Functional indexes get round the odd character mapping, but the problem query is still *<str>* which unfortunately is the main search type. OTC suggest the textblade, which might be an idea and needs to be investigated further "Art S. Kagel" wrote: > > On Fri, 23 Jan 2004 06:28:18 -0500, Paul Watson wrote: > > > I've got this little 30million row table, the entry point wrt to the app > > will be via a char(12) unique field. No real hassles, a little big of > > fragmentation and it's all fine but ... > > > > The typical search is <str>* [25%] *<str> [25%] and *<str>* [50%], and to > > make life more awkward it has to treat certain characters as identical so, > > for example, 0 [zero] = O [letter] and vice versa. But at least it's not > > case sensitive. > > > > It has a major update applied once per week (about 10%) but apart from that > > it is static. > > > > All smart arse ideas accepted:-)) > > > > > Create a functional index to map the equivalent chars for that part. For the > partial searches, You might use two mapping tables (or one) containing > trailing and contained substrings with the full key column. Then a functional > index on the substring column to effect the equivalence mappings. Lead > substrings are simple and handled by the functional index on the original > column directly. > > Art S. Kagel -- Paul Watson # Oninit Ltd # Growing old is mandatory Tel: +44 1436 672201 # Growing up is optional Fax: +44 1436 678693 # Mob: +44 7818 003457 # www.oninit.com # |
| |||
| Paul Watson <paul@oninit.com> wrote in message news:<40110552.6EB20A52@oninit.com>... > I've got this little 30million row table, the entry point wrt to > the app will be via a char(12) unique field. No real hassles, a little > big of fragmentation and it's all fine but ... > > The typical search is <str>* [25%] *<str> [25%] and *<str>* [50%], > and to make life more awkward it has to treat certain characters > as identical so, for example, 0 [zero] = O [letter] and vice versa. > But at least it's not case sensitive. > > It has a major update applied once per week (about 10%) but apart from > that it is static. > > All smart arse ideas accepted:-)) To be really clever about this one would have to know more about the data and queries. I'll suggest some ideas some will be quite insane based on the realities of the situation, but one might work for you. Or it might give you an idea for something that would work. Idea one call it cache searching: if there are repeated queires with the same *<string>* values then cache the results in a relationship table: create table cache_query_results( fk_id integer, str varchar(255) ) ; create unique index cache_query_results_1x on cache_query_results( str, fk_id ) ; -- Only store one STR for each fk_id. This can of course be prepopulated with values that you know are searched for often. Your cqr table can be of a manageable size and since you are mostly static this might work for you. Do some calculations and you should probably limit the size of what you store: for example don't store *e* in your cqr table, don't store words over a certain size in the table or break the *<string>* into words so that you have a dictionary of where things are, etc. you might only want to store strings above some number of characters that appears to keep the size down. Of course if you don't have predictable queries this idea is quite insane. You can even add a cache management table like the following: create table cache( str varchar(255) count integer ) ; to trim the cqr table of things that aren't used much or recently ( change count to timestamp ) another idea is to index on some subset of the key I have used all 2 letter combinations before ( in your data there may not be as many valid two letter combinations as you might think especially because your data is not case sensitive and some letters are treated the same) Of course again this can be awful depending on the data so if each key has 60 2 letter combos you would have a table with 60*30,000,000 rows ouch!( it might still work depending on your hardware) But I think in practice that would not be the case you will have to run counts to see if it is manageable. create table cqr( fk_id int, char_comb char(2) ) ; create unique index cqr_1x( char_comb, fk_id ) ; So a query of *John Smith* would look for: would look like: "Jo" "hn" " S" "mi" "th" ( you can add the other two letter combinations but this should narrow your searh just fine ) select * from cqr where char_comb = "Jo" and char_comb = "hn and char_comb = " S" and char_comb = "mi" and char_comb = "th" and fk_id = id and key matches( "*John Smith*" ) again this is crazy under a lot of circumstances but can be managed with some limits to reduce your search. I know that these ideas are hacks but you are trying to speed up something that is hard to speed up. |
| |||
| Paul Watson <paul@oninit.com> wrote in message news:<40110552.6EB20A52@oninit.com>... > I've got this little 30million row table, the entry point wrt to > the app will be via a char(12) unique field. No real hassles, a little > big of fragmentation and it's all fine but ... > > The typical search is <str>* [25%] *<str> [25%] and *<str>* [50%], > and to make life more awkward it has to treat certain characters > as identical so, for example, 0 [zero] = O [letter] and vice versa. > But at least it's not case sensitive. > > It has a major update applied once per week (about 10%) but apart from > that it is static. > > All smart arse ideas accepted:-)) To be really clever about this one would have to know more about the data and queries. I'll suggest some ideas some will be quite insane based on the realities of the situation, but one might work for you. Or it might give you an idea for something that would work. Idea one call it cache searching: if there are repeated queires with the same *<string>* values then cache the results in a relationship table: create table cache_query_results( fk_id integer, str varchar(255) ) ; create unique index cache_query_results_1x on cache_query_results( str, fk_id ) ; -- Only store one STR for each fk_id. This can of course be prepopulated with values that you know are searched for often. Your cqr table can be of a manageable size and since you are mostly static this might work for you. Do some calculations and you should probably limit the size of what you store: for example don't store *e* in your cqr table, don't store words over a certain size in the table or break the *<string>* into words so that you have a dictionary of where things are, etc. you might only want to store strings above some number of characters that appears to keep the size down. Of course if you don't have predictable queries this idea is quite insane. You can even add a cache management table like the following: create table cache( str varchar(255) count integer ) ; to trim the cqr table of things that aren't used much or recently ( change count to timestamp ) another idea is to index on some subset of the key I have used all 2 letter combinations before ( in your data there may not be as many valid two letter combinations as you might think especially because your data is not case sensitive and some letters are treated the same) Of course again this can be awful depending on the data so if each key has 60 2 letter combos you would have a table with 60*30,000,000 rows ouch!( it might still work depending on your hardware) But I think in practice that would not be the case you will have to run counts to see if it is manageable. create table cqr( fk_id int, char_comb char(2) ) ; create unique index cqr_1x( char_comb, fk_id ) ; So a query of *John Smith* would look for: would look like: "Jo" "hn" " S" "mi" "th" ( you can add the other two letter combinations but this should narrow your searh just fine ) select * from cqr where char_comb = "Jo" and char_comb = "hn and char_comb = " S" and char_comb = "mi" and char_comb = "th" and fk_id = id and key matches( "*John Smith*" ) again this is crazy under a lot of circumstances but can be managed with some limits to reduce your search. I know that these ideas are hacks but you are trying to speed up something that is hard to speed up. |
| ||||
| Paul Watson <paul@oninit.com> wrote in message news:<40110552.6EB20A52@oninit.com>... > I've got this little 30million row table, the entry point wrt to > the app will be via a char(12) unique field. No real hassles, a little > big of fragmentation and it's all fine but ... > > The typical search is <str>* [25%] *<str> [25%] and *<str>* [50%], > and to make life more awkward it has to treat certain characters > as identical so, for example, 0 [zero] = O [letter] and vice versa. > But at least it's not case sensitive. > > It has a major update applied once per week (about 10%) but apart from > that it is static. > > All smart arse ideas accepted:-)) My earlier query in the post was wrong for the 2 letter selects. select t1.fk_id from real_table rt, cqr t1, cqr t2, cqr t3, cqr t4, cqr t6 where t1.fk_id = t2.fk_id and t2.fk_id = t3.fk_id and t3.fk_id = t4.fk_id and t5.fk_id = t6.fk_id and t1.keypart = 'Jo' and t2.keypart = 'hn' and t3.keypart = ' S' and t4.keypart = 'mi' and t5.keypart = 'th' and rt.key matches('*John Smith*') and rt.id = t1.fk_id and rt.id = t2.fk_id and -- Not strictly needed rt.id = t3.fk_id and -- " " " rt.id = t4.fk_id and -- " " " rt.id = t5.fk_id and -- " " " rt.id = t6.fk_id -- " " " ; |