This is a discussion on 10g - Need advise on large lookup table and optimizing io within the Oracle Database forums, part of the Database Server Software category; --> Folks, I inherited an application that has 1 very large lookup table. This lookup table has a 90/10 read/write ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Folks, I inherited an application that has 1 very large lookup table. This lookup table has a 90/10 read/write ratio. Most of the time it's read heavily, but during uploads records that don't currently exist in that table get inserted. This table looks like this. USER_ID| EMAIL | CLIENT_ID | DUPLICATE_ID ------------------------------------------------------------------------ 1000000 'email1' 12312 NULL 1000001 'email2' 13434 1 Anyway this table has 400 million records an is not partitioned. He way 2 main types of queries against this table. SELECT email from THISTABLE where DUPLICATE_ID IS NULL and user_id = ? The other type of query executed is a SELECT where this table is simply joined on USER_PROF_ID to pull the email address. SELECT A.EMAIL, B.COL1,B.COL2 FROM DEMOGRAPHICS B, THISTABLE A WHERE B.USER_PROF_ID = A.USER_PROF_ID We now have indexes on the "THISTABLE" so that the actual data is never read. Only the indexes in place are used to satisfy all queries. The above SELECT query like many typical queries on our system pulls lots of data (500K records to 1 million records). We have one specific index on the columns USER_ID|EMAIL|DUPLICATE_ID|CLIENT_ID that we use to satisfy all the large select queries...these queries do an INDEX RANGE SCAN on the index columns I just described ..and so never need to access the table rows. Because the table/indexes have nearly 400 million records in them, and this index is read my all our "larger" select queries, this index is contented for and the IO waits are high. I know that this data model is not great, but i inherited it and can't really change too mch now. But I'm wondering what I can do to speed up the lookups during the SELECT queries against that very large index ( Index now has 3 levels). I've been doing some testing with a single table hash cluster, but I'm not sure if the benefits of a hash cluster outperform the benefits (in terms of IO) of looking up the rows in the large index and not having to read the table data. If anyone out there has any input, I'd greatly appreciate it. --peter |
| |||
| peter wrote: > Folks, > I inherited an application that has 1 very large lookup table. This > lookup table has > a 90/10 read/write ratio. Most of the time it's read heavily, but > during uploads > records that don't currently exist in that table get inserted. > This table looks like this. > > USER_ID| EMAIL | CLIENT_ID | DUPLICATE_ID > > ------------------------------------------------------------------------ > 1000000 'email1' 12312 NULL > 1000001 'email2' 13434 1 > > Anyway this table has 400 million records an is not partitioned. He > way 2 main types of > queries against this table. > > SELECT email from THISTABLE where DUPLICATE_ID IS NULL and user_id = > ? > > The other type of query executed is a SELECT where this table is > simply joined on > USER_PROF_ID to pull the email address. > > SELECT A.EMAIL, B.COL1,B.COL2 > FROM DEMOGRAPHICS B, THISTABLE A > WHERE B.USER_PROF_ID = A.USER_PROF_ID > > We now have indexes on the "THISTABLE" so that the actual data is never > read. Only the indexes in place are used to satisfy all queries. The > above SELECT query like many typical queries on our system pulls lots > of data (500K records to 1 million records). > > We have one specific index on the columns > USER_ID|EMAIL|DUPLICATE_ID|CLIENT_ID > that we use to satisfy all the large select queries...these queries do > an INDEX RANGE SCAN on the index columns I just described ..and so > never need to access the table rows. > > Because the table/indexes have nearly 400 million records in them, and > this index is read my all our "larger" select queries, this index is > contented for and the IO waits are high. > > I know that this data model is not great, but i inherited it and can't > really change too mch now. But I'm wondering what I can do to speed up > the lookups during the SELECT queries against that very large index ( > Index now has 3 levels). > > I've been doing some testing with a single table hash cluster, but I'm > not sure if the benefits of a hash cluster outperform the benefits (in > terms of IO) of looking up the rows in the large index and not having > to read the table data. > > If anyone out there has any input, I'd greatly appreciate it. > > --peter A couple of things I'd like to know about the tables ... how many different user_ids and user_prof_ids in those 400M records. Is the data skewed? And I'd be inclined to take a very harsh look at "DUPLICATE_ID IS NULL" and this is forcing a full table scan. Starting with DUPLICATE_ID ... can you use a default value in place NULL? If not consider a function based index and query for the value created by the function used to build the index. With respect to the USER_ID = ... my first instinct would be to convert the table to either a sorted hash cluster (10g) or Index-Organized Table. With respect ot USER_PROF_ID ... I'd need to know more about the data. All of the above is just pure instinct and I wouldn't do any of it without tracing and testing extensively. HTH -- Daniel A. Morgan http://www.psoug.org damorgan@x.washington.edu (replace x with u to respond) |
| |||
| duplicate_id is null is ok in this case, as it is used in composite index. you used user_prof_id and user_id in different two querys, are they same column? if you are doing index only query, rebuiild the index does help if there is also dml on this table. to help improve sga efficiency. index level =3 should be fine. can you list a full structure of the two tables, and the SQLs? |
| |||
| Can you put the index in the keep pool and add to your SGA (DB_KEEP_CACHE_SIZE) the number of blocks v$bh shows it has been using? Something under a gig, I'd guess from your post, and if it is that heavily used, you might be able to shrink your regular buffers some. Helps to post version and platform and appropriate configuration parameters. jg -- @home.com is bogus. http://catless.ncl.ac.uk/Risks/24.14.html#subj11 |
| |||
| Here's a quick update... This 400MM record table is mostly used in an equality join to pull the EMAIL address of a user. So every customer of ours has a DATA table with a USER_ID, and this lookup to pull the EMAIL is typically an equality join in a 3 table join... ie DRIVING_TABLE a, DATA_TABLE b, LARGE_LOOKUP_TABLE c WHERE a.userid = b.userid and a.userid = c.userid Every query in the system (large queries which pull millions of records) pull the EMAIL from that large lookup table.. and typically the wait time on those datafiles avg about 30-150 ms. I was able to convince my management to move the EMAIL address into each customers DATA_TABLE so that everyone is no longer contenting for that LARGE_LOOKUP_TABLE and the indexes used to satisfy queries... I think this simple change will dramatically decrease contention, plus it breaks that dependency on a large lookup table to individual smaller tables. This gives us some flexibility as we grow (either multiple customer databases or RAC). -- thanks to everyone for the great feedback as always. |
| |||
| "peter" <p_msantos@yahoo.com> wrote in message news:1136566558.089817.287670@g49g2000cwa.googlegr oups.com... > Folks, > I inherited an application that has 1 very large lookup table. This > lookup table has > a 90/10 read/write ratio. Most of the time it's read heavily, but > during uploads > records that don't currently exist in that table get inserted. > This table looks like this. > > USER_ID| EMAIL | CLIENT_ID | DUPLICATE_ID > > ------------------------------------------------------------------------ > 1000000 'email1' 12312 NULL > 1000001 'email2' 13434 1 > > Anyway this table has 400 million records an is not partitioned. He > way 2 main types of > queries against this table. > > SELECT email from THISTABLE where DUPLICATE_ID IS NULL and user_id = > ? > > The other type of query executed is a SELECT where this table is > simply joined on > USER_PROF_ID to pull the email address. > > SELECT A.EMAIL, B.COL1,B.COL2 > FROM DEMOGRAPHICS B, THISTABLE A > WHERE B.USER_PROF_ID = A.USER_PROF_ID > > We now have indexes on the "THISTABLE" so that the actual data is never > read. Only the indexes in place are used to satisfy all queries. The > above SELECT query like many typical queries on our system pulls lots > of data (500K records to 1 million records). > > We have one specific index on the columns > USER_ID|EMAIL|DUPLICATE_ID|CLIENT_ID > that we use to satisfy all the large select queries...these queries do > an INDEX RANGE SCAN on the index columns I just described ..and so > never need to access the table rows. > > Because the table/indexes have nearly 400 million records in them, and > this index is read my all our "larger" select queries, this index is > contented for and the IO waits are high. > > I know that this data model is not great, but i inherited it and can't > really change too mch now. But I'm wondering what I can do to speed up > the lookups during the SELECT queries against that very large index ( > Index now has 3 levels). > > I've been doing some testing with a single table hash cluster, but I'm > not sure if the benefits of a hash cluster outperform the benefits (in > terms of IO) of looking up the rows in the large index and not having > to read the table data. > > If anyone out there has any input, I'd greatly appreciate it. > > --peter > You have shown a query on user_prof_id, which is a column that doesn't appear to exist, so I assume it is really user_id. Assuming you have a primary key (user_id, email_id seems to be the probably key) this looks like a good candidate for a hash partitioned IOT. Gets rid of the table, automatically satisfies the indexed access requirement, and takes the heat off the root block. I am, of course, guessing about the intent and statistics. -- Regards Jonathan Lewis http://www.jlcomp.demon.co.uk/faq/ind_faq.html The Co-operative Oracle Users' FAQ http://www.jlcomp.demon.co.uk/cbo_book/ind_book.html Cost Based Oracle: Fundamentals http://www.jlcomp.demon.co.uk/appearances.html Public Appearances - schedule updated 10th Jan 2006 |
| ||||
| Thanks for the feedback Jonathan. I'm currently testing this right now, and it seems like a good idea. However I'm testing another solution also which I think might be even better. This large lookup table is joined into a 3 table total query that pulls lots of data out of our database. The index is currently there to assist this so that the table does not have to read, but because our queries select an avg of 50-70 million records in a 24hr time span, this index is very contened for... The average waits on these index files avg from 30ms - 150ms. If I move the column EMAIL into a customer specific table..then this large table is not needed and now every customer query is only 2 table without the need to join on this large table... It performs better. I've even tested setting a parallel degree of 2 for each table in the 2 table SELECT query and the performance and IO reductions from the 3 table query is enourmous... We have CPU queues that are generally at 7-8 while the machine has 12 CPUs... But I do have a few concerns about the memory requirements,PGA and how memory is allocated since these queries come from shared/MTS connections while other things in our system are a mix of dedicate/shared connections.. everyone's feedback is always appreciated.. BTW, read your "Snark Research" posting on your website. Really good stuff, but some pieces of memory allocation,shared/MTS connections and (in my case parallelism) are still unclear. I posted a thread about parallelism concerns...if you have some time to quickly browse it I'd appreciate it (posted Jan 25 3:06 pm) http://groups.google.com/group/comp....08a7fb9b977212 thanks again. -peter |