This is a discussion on Indexing normalized db within the MySQL forums, part of the Database Server Software category; --> Hi all! I have a main table with most columns pointing to other secondary tables (for normalization). I indexed ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi all! I have a main table with most columns pointing to other secondary tables (for normalization). I indexed some columns on main table utilizing type BTREE. On secondary tables there is no indexes (only mandatory primary keys), since I only need to find records by PK on it. I`m getting select response times too slow. Is there some smarter way for optimize it? schema: http://www.tinecon.com.br/sqlog.sql.html details: I`m building a db for squid logs (www.squid-cache.org). Basically I have to save each line of log (with 13 fields: time, size, user, src ip, dst domain, file type, etc...). For optimization, I`ve normalized 10 columns so the main table stores only ID`s of other tables. I don`t know if the indexes that I created is the best way for optimize the selects (regarding the INSERTs, there is no priority for performance, as it is done with scheduled at night). Thank you |
| |||
| On Sat, 16 Feb 2008 09:27:37 -0800 (PST), Tom wrote: > Hi all! > > I have a main table with most columns pointing to other secondary > tables (for normalization). I indexed some columns on main table > utilizing type BTREE. On secondary tables there is no indexes (only > mandatory primary keys), since I only need to find records by PK on > it. This is backwards from how most normalization happens, IMHO. Normally, you'll see child tables referring to parent tables by the parent's ID, not IDs of child records in the parent. > I`m getting select response times too slow. > Is there some smarter way for optimize it? > > schema: http://www.tinecon.com.br/sqlog.sql.html > > details: > I`m building a db for squid logs (www.squid-cache.org). Basically I > have to save each line of log (with 13 fields: time, size, user, src > ip, dst domain, file type, etc...). For optimization, I`ve normalized > 10 columns so the main table stores only ID`s of other tables. > > I don`t know if the indexes that I created is the best way for > optimize the selects (regarding the INSERTs, there is no priority for > performance, as it is done with scheduled at night). Sometimes there's good reason NOT to normalize. I think showing us a mysqldump --no-data of this database may be helpful. Also using EXPLAIN in some of the slow queries and posting the results would be useful. -- 2. My ventilation ducts will be too small to crawl through. --Peter Anspach's list of things to do as an Evil Overlord |
| |||
| Peter H. Coffin wrote: > On Sat, 16 Feb 2008 09:27:37 -0800 (PST), Tom wrote: >> Hi all! >> >> I have a main table with most columns pointing to other secondary >> tables (for normalization). I indexed some columns on main table >> utilizing type BTREE. On secondary tables there is no indexes (only >> mandatory primary keys), since I only need to find records by PK on >> it. > > This is backwards from how most normalization happens, IMHO. Normally, > you'll see child tables referring to parent tables by the parent's ID, > not IDs of child records in the parent. This all depends Peter, as I would view this as a "star-schema" design used in a lot of DW applications. One fact table with many other tables describing data in this fact table. http://en.wikipedia.org/wiki/Star_schema (aka snowflake) > >> I`m getting select response times too slow. >> Is there some smarter way for optimize it? >> >> schema: http://www.tinecon.com.br/sqlog.sql.html >> >> details: >> I`m building a db for squid logs (www.squid-cache.org). Basically I >> have to save each line of log (with 13 fields: time, size, user, src >> ip, dst domain, file type, etc...). For optimization, I`ve normalized >> 10 columns so the main table stores only ID`s of other tables. >> >> I don`t know if the indexes that I created is the best way for >> optimize the selects (regarding the INSERTs, there is no priority for >> performance, as it is done with scheduled at night). > > Sometimes there's good reason NOT to normalize. I think showing us a > mysqldump --no-data of this database may be helpful. Also using EXPLAIN > in some of the slow queries and posting the results would be useful. > Sometimes you will need to index your primary table to have indexes on those "foreign keys" - again, dependent upon your data and how you are trying to "join" the data together. Seeing the explain plans would help in determining what needs to be indexed and which table it needs to be indexed on... |
| |||
| Hi Peter, Michael! Thank you for the help. I populated the db with about 3.500.000 records (two weeks of log) on main table (access). It is using about 285 MB (170 MB for indexes), with MYISAM storage. Now it is acceptable, but I pretend to save maybe one year or more, if it is possible. Maybe if I prepare some views of the more probably queries, but sometimes a query will have no precedent. This SELECT takes about 5 seconds to return: SELECT user.value, count( DISTINCT domain.value ) as sites, sum(size) as downsize FROM access LEFT JOIN user ON access.user_id = user.id LEFT JOIN domain ON access.domain_id = domain.id WHERE day(time)=25 GROUP BY user.value ORDER BY sites explain: | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra +----+-------------+--------+--------+---------------+--------- +----------+---------------------------+--------- +-------------------------------------------------+ | 1 | SIMPLE | access | ALL | NULL | NULL | NULL | NULL | 3404948 | Using where, temporary, filesort | | 1 | SIMPLE | user | eq_ref | PRIMARY | PRIMARY | 2 | squidlog.access.user_id | 1 | | | 1 | SIMPLE | domain | eq_ref | PRIMARY| PRIMARY | 3 | squidlog.access.domain_id | 1 | Another SELECT, this takes about 10 sec (only reads main table: access): SELECT DISTINCT date(time) as day, count(DISTINCT access.user_id) as users,sum(size)/1024/1024 as downsize FROM access GROUP BY day ORDER BY downsize DESC; explain: | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | | 1 | SIMPLE | access | ALL | NULL | NULL | NULL | NULL | 3404948 | Using temporary; Using filesort | Here is the schema. Also It is available on http://www.tinecon.com.br/sqlog.sql.html CREATE TABLE `access` ( `time` timestamp NOT NULL default '0000-00-00 00:00:00', `size` int(10) unsigned NOT NULL, `ip_id` smallint(6) unsigned NOT NULL, `result_id` tinyint(4) unsigned NOT NULL, `duration` mediumint(8) unsigned NOT NULL, `method_id` tinyint(4) unsigned NOT NULL, `proto_id` tinyint(4) unsigned NOT NULL, `domain_id` mediumint(9) unsigned NOT NULL, `path_id` int(10) unsigned NOT NULL, `user_id` smallint(6) unsigned NOT NULL, `hier_code_id` tinyint(4) unsigned NOT NULL, `hier_ip_id` mediumint(9) unsigned NOT NULL, `mime_id` tinyint(4) unsigned NOT NULL, KEY `user` USING BTREE (`user_id`), KEY `path` USING BTREE (`path_id`), KEY `domain` USING BTREE (`domain_id`), KEY `ip` USING BTREE (`ip_id`), KEY `time` USING BTREE (`time`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1; CREATE TABLE `ip` ( `id` smallint(6) unsigned NOT NULL auto_increment, `value` varchar(15) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=275 DEFAULT CHARSET=latin1; CREATE TABLE `result` ( `id` tinyint(4) unsigned NOT NULL auto_increment, `value` varchar(30) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=67 DEFAULT CHARSET=latin1; CREATE TABLE `method` ( `id` tinyint(4) unsigned NOT NULL auto_increment, `value` varchar(15) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=16 DEFAULT CHARSET=latin1; CREATE TABLE `proto` ( `id` tinyint(4) unsigned NOT NULL auto_increment, `value` varchar(10) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=latin1; CREATE TABLE `domain` ( `id` mediumint(9) unsigned NOT NULL auto_increment, `value` varchar(100) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=14991 DEFAULT CHARSET=latin1; CREATE TABLE `path` ( `id` int(10) unsigned NOT NULL auto_increment, `value` varchar(200) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=287047 DEFAULT CHARSET=latin1; CREATE TABLE `user` ( `id` smallint(6) unsigned NOT NULL auto_increment, `value` varchar(50) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=323 DEFAULT CHARSET=latin1; CREATE TABLE `hier_code` ( `id` tinyint(4) unsigned NOT NULL auto_increment, `value` varchar(30) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=latin1; CREATE TABLE `hier_ip` ( `id` mediumint(9) unsigned NOT NULL auto_increment, `value` varchar(50) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=8556 DEFAULT CHARSET=latin1; CREATE TABLE `mime` ( `id` tinyint(4) unsigned NOT NULL auto_increment, `value` varchar(40) NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=141 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC; |
| |||
| Since the explains output that I write on last email are showing wrong because google formatation, now it is available on http://www.tinecon.com.br/sqlog.sql.html |
| |||
| Tom wrote: > Since the explains output that I write on last email are showing wrong > because google formatation, now it is available on http://www.tinecon.com.br/sqlog.sql.html Link is verified as non-threatening - I have a way of reading these things without endangering my Windoze box...) Looking at your "description tables", they are essentially worthless. the data is duplicated from your 'access' table and most add a useless "id" column that is auto-increment. Save the processing power and storage and just drop these. Now, if you want to insert data into these tables first and use them as foreign-key fields where this data must exist before you can insert into the primary table and use the [description table].id as the data in the 'access' table, that may be acceptable. But my first overview shows they are not providing ANY useful purpose. you do realize that only using "WHERE day(time)=25" you are returning all rows where the day = the 25th of ALL months because you want "to save maybe one year or more, if it is possible." What this means is that if you have data from Jan 25, Feb 25, March 25, etc...) it will get counted in this query. The way you are structuring your query on the main 'access' table, you are not giving it any useful columns to work with, therefore these queries will ALWAYS do a full table scan which translates to LONG query times. Add a column to 'access' called transday (transaction day) and insert it's value if you are going to be needing this type of query on a continuous basis. (you can do it programatically or using a trigger. Then the query will use the index and not need to evaluate all 3.4M rows. When creating databases stay away from table and column names that are considered "reserved words', like date, time, etc... |
| |||
| On Sun, 17 Feb 2008 16:20:52 -0800 (PST), Tom wrote: > > Hi Peter, Michael! > Thank you for the help. > > I populated the db with about 3.500.000 records (two weeks of log) on > main table (access). It is using about 285 MB (170 MB for indexes), > with MYISAM storage. Now it is acceptable, but I pretend to save maybe > one year or more, if it is possible. > > Maybe if I prepare some views of the more probably queries, but > sometimes a query will have no precedent. > > This SELECT takes about 5 seconds to return: > > SELECT user.value, count( DISTINCT domain.value ) as sites, sum(size) > as downsize > FROM access > LEFT JOIN user ON access.user_id = user.id > LEFT JOIN domain ON access.domain_id = domain.id > WHERE day(time)=25 > GROUP BY user.value > ORDER BY sites > > explain: >| id | select_type | table | type | possible_keys | key >| key_len | ref | rows | Extra > +----+-------------+--------+--------+---------------+--------- > +----------+---------------------------+--------- > +-------------------------------------------------+ >| 1 | SIMPLE | access | ALL | NULL | NULL | > NULL | NULL | 3404948 | Using where, temporary, > filesort | >| 1 | SIMPLE | user | eq_ref | PRIMARY | PRIMARY | 2 >| squidlog.access.user_id | 1 >| | >| 1 | SIMPLE | domain | eq_ref | PRIMARY| PRIMARY | 3 | > squidlog.access.domain_id | 1 >| > > > Another SELECT, this takes about 10 sec (only reads main table: > access): > SELECT DISTINCT date(time) as day, count(DISTINCT access.user_id) as > users,sum(size)/1024/1024 as downsize > FROM access > GROUP BY day > ORDER BY downsize DESC; > > explain: > >| id | select_type | table | type | possible_keys | key | > key_len | ref | rows | Extra | >| 1 | SIMPLE | access | ALL | NULL | NULL | > NULL | NULL | 3404948 | Using temporary; Using filesort | Okay! Good, solid info. While the day() function is fairly cheap, it's still not entirely free, and it looks like something you're using fairly often, so some careful denormalization is probably not a bad tradoff. You've got indexes over the pretty much everything else you're selecting with, but that filesort is essentially reading all of `access` at least once, doing day() on every record, building an index on the fly. If you DO use the day() function in many quesries, it may be worth adding that value into your `access` table itself, throwing on an index over that, and populating it whenever you're about to start running queries. -- I don't see what C++ has to do with keeping people from shooting themselves in the foot. C++ will happily load the gun, offer you a drink to steady your nerves, and help you aim. -- Peter da Silva |
| |||
| Hi Michael! On Feb 17, 10:06 pm, Michael Austin <maus...@firstdbasource.com> wrote: > Looking at your "description tables", they are essentially worthless. > the data is duplicated from your 'access' table and most add a useless > "id" column that is auto-increment. Save the processing power and > storage and just drop these. Please, explain me. As I can understand (I a have only basic knowledge in DBMS), I made a star schema as you mentioned. Removing almost meaning data from from main table (access), leaving on it only FKs, that points to the secondary tables (domain, user, path, mime, etc...) PKs, which, in turn, has the real data. > Now, if you want to insert data into these tables first and use them as > foreign-key fields where this data must exist before you can insert into > the primary table and use the [description table].id as the data in the > 'access' table, that may be acceptable. But my first overview shows > they are not providing ANY useful purpose. In really, it is what I`m doing. Before to start loop in logfile (from where comes data to DB), I build a list of each secondary table itens plus its ID. So, while lopping on each line of the logfile, I check if the item already exist, if yes, I insert its ID on access record, if no, first I insert the item on secondary table, then get last_inserted_id and finally insert its id to access table. A little cumbersome, but works well, since the inserts take work at night, via scheduled jobs. As a example, 500MB of logfile (15 days of logging) takes 20 min to be inserted. For who likes the details =) Here is the populdb.pl: http://www.tinecon.com.br/sqlog.sql.html > you do realize that only using "WHERE day(time)=25" you are returning > all rows where the day = the 25th of ALL months because you want "to > save maybe one year or more, if it is possible." What this means is > that if you have data from Jan 25, Feb 25, March 25, etc...) it will get > counted in this query. yes, in really I`m only using this SELECT during the tests, since there is only 15 days of log, serially. > The way you are structuring your query on the main 'access' table, you > are not giving it any useful columns to work with, therefore these > queries will ALWAYS do a full table scan which translates to LONG query > times. > > Add a column to 'access' called transday (transaction day) and insert > it's value if you are going to be needing this type of query on a > continuous basis. (you can do it programatically or using a trigger. > Then the query will use the index and not need to evaluate all 3.4M rows. > > When creating databases stay away from table and column names that are > considered "reserved words', like date, time, etc... good, I will do it. Thank you very much for the help. |
| |||
| On Feb 17, 10:12 pm, "Peter H. Coffin" <hell...@ninehells.com> wrote: > Okay! Good, solid info. > > While the day() function is fairly cheap, it's still not entirely free, > and it looks like something you're using fairly often, so some careful > denormalization is probably not a bad tradoff. You've got indexes over > the pretty much everything else you're selecting with, but that filesort > is essentially reading all of `access` at least once, doing day() on > every record, building an index on the fly. If you DO use the day() > function in many quesries, it may be worth adding that value into your > `access` table itself, throwing on an index over that, and populating it > whenever you're about to start running queries. good, I will consider this denormalization. Also, I will test using time > 20080224 AND time < 20080225 What about break YYYYMMDDHHMMSS in two columns YYYMMDD and HHMMSS? |
| ||||
| On Sun, 17 Feb 2008 19:08:28 -0800 (PST), Tom wrote: > On Feb 17, 10:12 pm, "Peter H. Coffin" <hell...@ninehells.com> wrote: >> Okay! Good, solid info. >> >> While the day() function is fairly cheap, it's still not entirely free, >> and it looks like something you're using fairly often, so some careful >> denormalization is probably not a bad tradoff. You've got indexes over >> the pretty much everything else you're selecting with, but that filesort >> is essentially reading all of `access` at least once, doing day() on >> every record, building an index on the fly. If you DO use the day() >> function in many quesries, it may be worth adding that value into your >> `access` table itself, throwing on an index over that, and populating it >> whenever you're about to start running queries. > > good, I will consider this denormalization. > Also, I will test using time > 20080224 AND time < 20080225 > > What about break YYYYMMDDHHMMSS in two columns > YYYMMDD and HHMMSS? It may be worthwhile. That's something you can try, then test, and see who much improvement it brings. One of the nice things about working with fairly static data (like log data or archived transaction, etc), is that you do have the luxury to work on the process and do preparation work on the data to make it process quickly on subsequent queries. This is not the case with more immediate reporting, such as "how many widgets do we have scheduled to ship overseas RIGHT NOW". -- The Web brings people together because no matter what kind of a twisted sexual mutant you happen to be, you've got millions of pals out there. Type in 'Find people that have sex with goats that are on fire' and the computer will ask, 'Specify type of goat.' -- Rich Jeni |