vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi All, I'm working on a project that is using a MySQL database. Every hour some bulk data is processed and then inserted into a database. The database looks like this: CREATE TABLE mytable { data1 char(255), data2 char(255), int1 tinyint unsigned, int2 int unsigned, int3 int unsigned, PRIMARY KEY(data1, data2, int1) ) TYPE=MyISAM; Now, typically I'm inserting all this data into the database nicely. The problem happens when I already have an entry in the database (with the same primary key). In that case I have to assign int2 to the one from bulk data and add int3 with the one from bulk data. Thanks to ON DUPLICATE KEY, this is straightforward: INSERT INTO mytable (data1, data2, int1, int2, int3) VALUES (newdata1, newdata2, newint1, newint2, newint3) ON DUPLICATE KEY UPDATE int2 = newint2, int3 = int3 + newint3; The problem here is that when the table gets pretty big, the performance goes down - the table currently has about 9 million rows and my typical bulk imports have around 60.000 rows - this takes about 10 minutes to import. What would be the best way to optimize this? I tried creating a temporary table in memory and then using select from that table to update the real one, but I didn't get anything improved speed wise. What would experts here recommend? Thanks a lot. |
| |||
| user wrote: > Hi All, > > I'm working on a project that is using a MySQL database. Every hour some > bulk data is processed and then inserted into a database. > > The database looks like this: > > CREATE TABLE mytable { > data1 char(255), > data2 char(255), > int1 tinyint unsigned, > int2 int unsigned, > int3 int unsigned, > PRIMARY KEY(data1, data2, int1) > ) TYPE=MyISAM; > > > Now, typically I'm inserting all this data into the database nicely. The > problem happens when I already have an entry in the database (with the > same primary key). In that case I have to assign int2 to the one from > bulk data and add int3 with the one from bulk data. > Thanks to ON DUPLICATE KEY, this is straightforward: > > INSERT INTO mytable (data1, data2, int1, int2, int3) VALUES (newdata1, > newdata2, newint1, newint2, newint3) ON DUPLICATE KEY UPDATE int2 = > newint2, int3 = int3 + newint3; > > > The problem here is that when the table gets pretty big, the performance > goes down - the table currently has about 9 million rows and my typical > bulk imports have around 60.000 rows - this takes about 10 minutes to > import. > > > What would be the best way to optimize this? I tried creating a > temporary table in memory and then using select from that table to > update the real one, but I didn't get anything improved speed wise. > > > What would experts here recommend? > > > Thanks a lot. > Sounds like a bad design to have a primary key on a bulk load-type table - especially if you *expect* more than one record for that primary key. Drop the primary key and just add a "duplicates allowed" index on (data1, data2, int1). Now you will be able to process this data and aggregate the "duplicates" during post-processing. If you need "order" add a timestamp field with default value of the current system time. You could now potentially drop columns int2 and int3. "been there, done that!" -- Michael Austin. Database Consultant |
| ||||
| On Mon, 1 Jan 2007 00:01:18 +0000 (UTC), user wrote: > Hi All, > > I'm working on a project that is using a MySQL database. Every hour some > bulk data is processed and then inserted into a database. > > The database looks like this: > > CREATE TABLE mytable { > data1 char(255), > data2 char(255), > int1 tinyint unsigned, > int2 int unsigned, > int3 int unsigned, > PRIMARY KEY(data1, data2, int1) > ) TYPE=MyISAM; > > > Now, typically I'm inserting all this data into the database nicely. The > problem happens when I already have an entry in the database (with the > same primary key). In that case I have to assign int2 to the one from > bulk data and add int3 with the one from bulk data. > Thanks to ON DUPLICATE KEY, this is straightforward: > > INSERT INTO mytable (data1, data2, int1, int2, int3) VALUES (newdata1, > newdata2, newint1, newint2, newint3) ON DUPLICATE KEY UPDATE int2 = > newint2, int3 = int3 + newint3; Compound primary keys may be a good idea somewhere, but I haven't been to that place yet. (This is a mild form of "This is never a good idea.") My usual primary key is a nice integer field of an appropriate width, that has absolutely no meaning other than being the primary key for that particular table. All the intertable relationships are maintained with keys to different fields than the primary. This means a little more manual work, but its work that's done more or less only once. Data gets backed up with the primary key, but never exported with it. This means your loading process will quickly and happily assign new primary keys to records, and your secondary keys (or "relationship keys") point you at the potential duplicate records you may need to reconcile. > The problem here is that when the table gets pretty big, the performance > goes down - the table currently has about 9 million rows and my typical > bulk imports have around 60.000 rows - this takes about 10 minutes to > import. I wouldn't classify that as horrible -- You are, after all, doing one of the most complicated and least optimized way of getting data in. > What would be the best way to optimize this? I tried creating a > temporary table in memory and then using select from that table to > update the real one, but I didn't get anything improved speed wise. Separate out the primary key from your actually-significant data. Use "LOAD DATA INFILE" instead. Your 10-minute load will drop to a couple of operations that take mere seconds. -- 186,000 Miles per Second. It's not just a good idea. IT'S THE LAW. |