Unix Technical Forum

Indexing normalized db

This is a discussion on Indexing normalized db within the MySQL forums, part of the Database Server Software category; --> Hi all! I have a main table with most columns pointing to other secondary tables (for normalization). I indexed ...


Go Back   Unix Technical Forum > Database Server Software > MySQL

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 02-28-2008, 11:33 AM
Tom
 
Posts: n/a
Default Indexing normalized db

Hi all!

I have a main table with most columns pointing to other secondary
tables (for normalization). I indexed some columns on main table
utilizing type BTREE. On secondary tables there is no indexes (only
mandatory primary keys), since I only need to find records by PK on
it.

I`m getting select response times too slow.
Is there some smarter way for optimize it?

schema: http://www.tinecon.com.br/sqlog.sql.html

details:
I`m building a db for squid logs (www.squid-cache.org). Basically I
have to save each line of log (with 13 fields: time, size, user, src
ip, dst domain, file type, etc...). For optimization, I`ve normalized
10 columns so the main table stores only ID`s of other tables.

I don`t know if the indexes that I created is the best way for
optimize the selects (regarding the INSERTs, there is no priority for
performance, as it is done with scheduled at night).


Thank you
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 02-28-2008, 11:33 AM
Peter H. Coffin
 
Posts: n/a
Default Re: Indexing normalized db

On Sat, 16 Feb 2008 09:27:37 -0800 (PST), Tom wrote:
> Hi all!
>
> I have a main table with most columns pointing to other secondary
> tables (for normalization). I indexed some columns on main table
> utilizing type BTREE. On secondary tables there is no indexes (only
> mandatory primary keys), since I only need to find records by PK on
> it.


This is backwards from how most normalization happens, IMHO. Normally,
you'll see child tables referring to parent tables by the parent's ID,
not IDs of child records in the parent.

> I`m getting select response times too slow.
> Is there some smarter way for optimize it?
>
> schema: http://www.tinecon.com.br/sqlog.sql.html
>
> details:
> I`m building a db for squid logs (www.squid-cache.org). Basically I
> have to save each line of log (with 13 fields: time, size, user, src
> ip, dst domain, file type, etc...). For optimization, I`ve normalized
> 10 columns so the main table stores only ID`s of other tables.
>
> I don`t know if the indexes that I created is the best way for
> optimize the selects (regarding the INSERTs, there is no priority for
> performance, as it is done with scheduled at night).


Sometimes there's good reason NOT to normalize. I think showing us a
mysqldump --no-data of this database may be helpful. Also using EXPLAIN
in some of the slow queries and posting the results would be useful.

--
2. My ventilation ducts will be too small to crawl through.
--Peter Anspach's list of things to do as an Evil Overlord
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 02-28-2008, 11:33 AM
Michael Austin
 
Posts: n/a
Default Re: Indexing normalized db

Peter H. Coffin wrote:
> On Sat, 16 Feb 2008 09:27:37 -0800 (PST), Tom wrote:
>> Hi all!
>>
>> I have a main table with most columns pointing to other secondary
>> tables (for normalization). I indexed some columns on main table
>> utilizing type BTREE. On secondary tables there is no indexes (only
>> mandatory primary keys), since I only need to find records by PK on
>> it.

>
> This is backwards from how most normalization happens, IMHO. Normally,
> you'll see child tables referring to parent tables by the parent's ID,
> not IDs of child records in the parent.


This all depends Peter, as I would view this as a "star-schema" design
used in a lot of DW applications.

One fact table with many other tables describing data in this fact table.

http://en.wikipedia.org/wiki/Star_schema (aka snowflake)


>
>> I`m getting select response times too slow.
>> Is there some smarter way for optimize it?
>>
>> schema: http://www.tinecon.com.br/sqlog.sql.html
>>
>> details:
>> I`m building a db for squid logs (www.squid-cache.org). Basically I
>> have to save each line of log (with 13 fields: time, size, user, src
>> ip, dst domain, file type, etc...). For optimization, I`ve normalized
>> 10 columns so the main table stores only ID`s of other tables.
>>
>> I don`t know if the indexes that I created is the best way for
>> optimize the selects (regarding the INSERTs, there is no priority for
>> performance, as it is done with scheduled at night).

>
> Sometimes there's good reason NOT to normalize. I think showing us a
> mysqldump --no-data of this database may be helpful. Also using EXPLAIN
> in some of the slow queries and posting the results would be useful.
>


Sometimes you will need to index your primary table to have indexes on
those "foreign keys" - again, dependent upon your data and how you are
trying to "join" the data together. Seeing the explain plans would help
in determining what needs to be indexed and which table it needs to be
indexed on...
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 02-28-2008, 11:33 AM
Tom
 
Posts: n/a
Default Re: Indexing normalized db


Hi Peter, Michael!
Thank you for the help.

I populated the db with about 3.500.000 records (two weeks of log) on
main table (access). It is using about 285 MB (170 MB for indexes),
with MYISAM storage. Now it is acceptable, but I pretend to save maybe
one year or more, if it is possible.

Maybe if I prepare some views of the more probably queries, but
sometimes a query will have no precedent.

This SELECT takes about 5 seconds to return:

SELECT user.value, count( DISTINCT domain.value ) as sites, sum(size)
as downsize
FROM access
LEFT JOIN user ON access.user_id = user.id
LEFT JOIN domain ON access.domain_id = domain.id
WHERE day(time)=25
GROUP BY user.value
ORDER BY sites

explain:
| id | select_type | table | type | possible_keys | key
| key_len | ref | rows | Extra
+----+-------------+--------+--------+---------------+---------
+----------+---------------------------+---------
+-------------------------------------------------+
| 1 | SIMPLE | access | ALL | NULL | NULL |
NULL | NULL | 3404948 | Using where, temporary,
filesort |
| 1 | SIMPLE | user | eq_ref | PRIMARY | PRIMARY | 2
| squidlog.access.user_id | 1
| |
| 1 | SIMPLE | domain | eq_ref | PRIMARY| PRIMARY | 3 |
squidlog.access.domain_id | 1
|


Another SELECT, this takes about 10 sec (only reads main table:
access):
SELECT DISTINCT date(time) as day, count(DISTINCT access.user_id) as
users,sum(size)/1024/1024 as downsize
FROM access
GROUP BY day
ORDER BY downsize DESC;

explain:

| id | select_type | table | type | possible_keys | key |
key_len | ref | rows | Extra |
| 1 | SIMPLE | access | ALL | NULL | NULL |
NULL | NULL | 3404948 | Using temporary; Using filesort |


Here is the schema. Also It is available on http://www.tinecon.com.br/sqlog.sql.html

CREATE TABLE `access` (
`time` timestamp NOT NULL default '0000-00-00 00:00:00',
`size` int(10) unsigned NOT NULL,
`ip_id` smallint(6) unsigned NOT NULL,
`result_id` tinyint(4) unsigned NOT NULL,
`duration` mediumint(8) unsigned NOT NULL,
`method_id` tinyint(4) unsigned NOT NULL,
`proto_id` tinyint(4) unsigned NOT NULL,
`domain_id` mediumint(9) unsigned NOT NULL,
`path_id` int(10) unsigned NOT NULL,
`user_id` smallint(6) unsigned NOT NULL,
`hier_code_id` tinyint(4) unsigned NOT NULL,
`hier_ip_id` mediumint(9) unsigned NOT NULL,
`mime_id` tinyint(4) unsigned NOT NULL,
KEY `user` USING BTREE (`user_id`),
KEY `path` USING BTREE (`path_id`),
KEY `domain` USING BTREE (`domain_id`),
KEY `ip` USING BTREE (`ip_id`),
KEY `time` USING BTREE (`time`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

CREATE TABLE `ip` (
`id` smallint(6) unsigned NOT NULL auto_increment,
`value` varchar(15) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=275 DEFAULT CHARSET=latin1;

CREATE TABLE `result` (
`id` tinyint(4) unsigned NOT NULL auto_increment,
`value` varchar(30) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=67 DEFAULT CHARSET=latin1;

CREATE TABLE `method` (
`id` tinyint(4) unsigned NOT NULL auto_increment,
`value` varchar(15) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=16 DEFAULT CHARSET=latin1;

CREATE TABLE `proto` (
`id` tinyint(4) unsigned NOT NULL auto_increment,
`value` varchar(10) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;

CREATE TABLE `domain` (
`id` mediumint(9) unsigned NOT NULL auto_increment,
`value` varchar(100) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=14991 DEFAULT CHARSET=latin1;

CREATE TABLE `path` (
`id` int(10) unsigned NOT NULL auto_increment,
`value` varchar(200) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=287047 DEFAULT CHARSET=latin1;

CREATE TABLE `user` (
`id` smallint(6) unsigned NOT NULL auto_increment,
`value` varchar(50) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=323 DEFAULT CHARSET=latin1;

CREATE TABLE `hier_code` (
`id` tinyint(4) unsigned NOT NULL auto_increment,
`value` varchar(30) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;

CREATE TABLE `hier_ip` (
`id` mediumint(9) unsigned NOT NULL auto_increment,
`value` varchar(50) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=8556 DEFAULT CHARSET=latin1;

CREATE TABLE `mime` (
`id` tinyint(4) unsigned NOT NULL auto_increment,
`value` varchar(40) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=141 DEFAULT CHARSET=latin1
ROW_FORMAT=DYNAMIC;


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 02-28-2008, 11:33 AM
Tom
 
Posts: n/a
Default Re: Indexing normalized db

Since the explains output that I write on last email are showing wrong
because google formatation, now it is available on http://www.tinecon.com.br/sqlog.sql.html
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 02-28-2008, 11:33 AM
Michael Austin
 
Posts: n/a
Default Re: Indexing normalized db

Tom wrote:
> Since the explains output that I write on last email are showing wrong
> because google formatation, now it is available on http://www.tinecon.com.br/sqlog.sql.html


Link is verified as non-threatening (You never know with the internet
- I have a way of reading these things without endangering my Windoze
box...)

Looking at your "description tables", they are essentially worthless.
the data is duplicated from your 'access' table and most add a useless
"id" column that is auto-increment. Save the processing power and
storage and just drop these.

Now, if you want to insert data into these tables first and use them as
foreign-key fields where this data must exist before you can insert into
the primary table and use the [description table].id as the data in the
'access' table, that may be acceptable. But my first overview shows
they are not providing ANY useful purpose.

you do realize that only using "WHERE day(time)=25" you are returning
all rows where the day = the 25th of ALL months because you want "to
save maybe one year or more, if it is possible." What this means is
that if you have data from Jan 25, Feb 25, March 25, etc...) it will get
counted in this query.

The way you are structuring your query on the main 'access' table, you
are not giving it any useful columns to work with, therefore these
queries will ALWAYS do a full table scan which translates to LONG query
times.

Add a column to 'access' called transday (transaction day) and insert
it's value if you are going to be needing this type of query on a
continuous basis. (you can do it programatically or using a trigger.
Then the query will use the index and not need to evaluate all 3.4M rows.

When creating databases stay away from table and column names that are
considered "reserved words', like date, time, etc...
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 02-28-2008, 11:33 AM
Peter H. Coffin
 
Posts: n/a
Default Re: Indexing normalized db

On Sun, 17 Feb 2008 16:20:52 -0800 (PST), Tom wrote:
>
> Hi Peter, Michael!
> Thank you for the help.
>
> I populated the db with about 3.500.000 records (two weeks of log) on
> main table (access). It is using about 285 MB (170 MB for indexes),
> with MYISAM storage. Now it is acceptable, but I pretend to save maybe
> one year or more, if it is possible.
>
> Maybe if I prepare some views of the more probably queries, but
> sometimes a query will have no precedent.
>
> This SELECT takes about 5 seconds to return:
>
> SELECT user.value, count( DISTINCT domain.value ) as sites, sum(size)
> as downsize
> FROM access
> LEFT JOIN user ON access.user_id = user.id
> LEFT JOIN domain ON access.domain_id = domain.id
> WHERE day(time)=25
> GROUP BY user.value
> ORDER BY sites
>
> explain:
>| id | select_type | table | type | possible_keys | key
>| key_len | ref | rows | Extra
> +----+-------------+--------+--------+---------------+---------
> +----------+---------------------------+---------
> +-------------------------------------------------+
>| 1 | SIMPLE | access | ALL | NULL | NULL |
> NULL | NULL | 3404948 | Using where, temporary,
> filesort |
>| 1 | SIMPLE | user | eq_ref | PRIMARY | PRIMARY | 2
>| squidlog.access.user_id | 1
>| |
>| 1 | SIMPLE | domain | eq_ref | PRIMARY| PRIMARY | 3 |
> squidlog.access.domain_id | 1
>|
>
>
> Another SELECT, this takes about 10 sec (only reads main table:
> access):
> SELECT DISTINCT date(time) as day, count(DISTINCT access.user_id) as
> users,sum(size)/1024/1024 as downsize
> FROM access
> GROUP BY day
> ORDER BY downsize DESC;
>
> explain:
>
>| id | select_type | table | type | possible_keys | key |
> key_len | ref | rows | Extra |
>| 1 | SIMPLE | access | ALL | NULL | NULL |
> NULL | NULL | 3404948 | Using temporary; Using filesort |


Okay! Good, solid info.

While the day() function is fairly cheap, it's still not entirely free,
and it looks like something you're using fairly often, so some careful
denormalization is probably not a bad tradoff. You've got indexes over
the pretty much everything else you're selecting with, but that filesort
is essentially reading all of `access` at least once, doing day() on
every record, building an index on the fly. If you DO use the day()
function in many quesries, it may be worth adding that value into your
`access` table itself, throwing on an index over that, and populating it
whenever you're about to start running queries.

--
I don't see what C++ has to do with keeping people from shooting themselves
in the foot. C++ will happily load the gun, offer you a drink to steady
your nerves, and help you aim.
-- Peter da Silva
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 02-28-2008, 11:33 AM
Tom
 
Posts: n/a
Default Re: Indexing normalized db


Hi Michael!

On Feb 17, 10:06 pm, Michael Austin <maus...@firstdbasource.com>
wrote:
> Looking at your "description tables", they are essentially worthless.
> the data is duplicated from your 'access' table and most add a useless
> "id" column that is auto-increment. Save the processing power and
> storage and just drop these.


Please, explain me. As I can understand (I a have only basic knowledge
in
DBMS), I made a star schema as you mentioned. Removing almost meaning
data from from main table (access), leaving on it only FKs, that
points
to the secondary tables (domain, user, path, mime, etc...) PKs, which,
in turn, has the real data.

> Now, if you want to insert data into these tables first and use them as
> foreign-key fields where this data must exist before you can insert into
> the primary table and use the [description table].id as the data in the
> 'access' table, that may be acceptable. But my first overview shows
> they are not providing ANY useful purpose.


In really, it is what I`m doing. Before to start loop in logfile (from
where
comes data to DB), I build a list of each secondary table itens plus
its ID.
So, while lopping on each line of the logfile, I check if the item
already exist,
if yes, I insert its ID on access record, if no, first I insert the
item on
secondary table, then get last_inserted_id and finally insert its id
to access
table.

A little cumbersome, but works well, since the inserts take work at
night, via
scheduled jobs. As a example, 500MB of logfile (15 days of logging)
takes
20 min to be inserted.

For who likes the details =) Here is the populdb.pl:
http://www.tinecon.com.br/sqlog.sql.html

> you do realize that only using "WHERE day(time)=25" you are returning
> all rows where the day = the 25th of ALL months because you want "to
> save maybe one year or more, if it is possible." What this means is
> that if you have data from Jan 25, Feb 25, March 25, etc...) it will get
> counted in this query.


yes, in really I`m only using this SELECT during the tests, since
there is
only 15 days of log, serially.

> The way you are structuring your query on the main 'access' table, you
> are not giving it any useful columns to work with, therefore these
> queries will ALWAYS do a full table scan which translates to LONG query
> times.
>
> Add a column to 'access' called transday (transaction day) and insert
> it's value if you are going to be needing this type of query on a
> continuous basis. (you can do it programatically or using a trigger.
> Then the query will use the index and not need to evaluate all 3.4M rows.
>
> When creating databases stay away from table and column names that are
> considered "reserved words', like date, time, etc...


good, I will do it.


Thank you very much for the help.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 02-28-2008, 11:33 AM
Tom
 
Posts: n/a
Default Re: Indexing normalized db

On Feb 17, 10:12 pm, "Peter H. Coffin" <hell...@ninehells.com> wrote:
> Okay! Good, solid info.
>
> While the day() function is fairly cheap, it's still not entirely free,
> and it looks like something you're using fairly often, so some careful
> denormalization is probably not a bad tradoff. You've got indexes over
> the pretty much everything else you're selecting with, but that filesort
> is essentially reading all of `access` at least once, doing day() on
> every record, building an index on the fly. If you DO use the day()
> function in many quesries, it may be worth adding that value into your
> `access` table itself, throwing on an index over that, and populating it
> whenever you're about to start running queries.


good, I will consider this denormalization.
Also, I will test using time > 20080224 AND time < 20080225

What about break YYYYMMDDHHMMSS in two columns
YYYMMDD and HHMMSS?


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 02-28-2008, 11:33 AM
Peter H. Coffin
 
Posts: n/a
Default Re: Indexing normalized db

On Sun, 17 Feb 2008 19:08:28 -0800 (PST), Tom wrote:
> On Feb 17, 10:12 pm, "Peter H. Coffin" <hell...@ninehells.com> wrote:
>> Okay! Good, solid info.
>>
>> While the day() function is fairly cheap, it's still not entirely free,
>> and it looks like something you're using fairly often, so some careful
>> denormalization is probably not a bad tradoff. You've got indexes over
>> the pretty much everything else you're selecting with, but that filesort
>> is essentially reading all of `access` at least once, doing day() on
>> every record, building an index on the fly. If you DO use the day()
>> function in many quesries, it may be worth adding that value into your
>> `access` table itself, throwing on an index over that, and populating it
>> whenever you're about to start running queries.

>
> good, I will consider this denormalization.
> Also, I will test using time > 20080224 AND time < 20080225
>
> What about break YYYYMMDDHHMMSS in two columns
> YYYMMDD and HHMMSS?


It may be worthwhile. That's something you can try, then test, and see
who much improvement it brings. One of the nice things about working
with fairly static data (like log data or archived transaction, etc), is
that you do have the luxury to work on the process and do preparation
work on the data to make it process quickly on subsequent queries. This
is not the case with more immediate reporting, such as "how many widgets
do we have scheduled to ship overseas RIGHT NOW".

--
The Web brings people together because no matter what kind of a twisted
sexual mutant you happen to be, you've got millions of pals out there.
Type in 'Find people that have sex with goats that are on fire' and the
computer will ask, 'Specify type of goat.' -- Rich Jeni
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 07:01 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com