vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi all, I have a table like this, CREATE TABLE mytable( a varchar(40), b text, c text, PRIMARY KEY (a, b) ); What I want to do is: insert a record into a table, and when the record already exists(according to the primary key), update it. I know that there is a ON DUPLICATE clause with MySQL, so I'm wondering is there a quick and clean way to do this in PostgreSQL ? I have googled and currently the only way I can find is do query first and then update or insert. Thanks a lot. Regards, ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On Dec 24, 2007, at 22:03 , fdu.xiaojf@gmail.com wrote: > I have googled and currently the only way I can find is do query > first and then update or insert. Or alternatively, UPDATE and see if you've affected any rows. If not, insert. Michael Glaesemann grzm seespotcode net ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Michael Glaesemann wrote: > > On Dec 24, 2007, at 22:03 , fdu.xiaojf@gmail.com wrote: > >> I have googled and currently the only way I can find is do query first >> and then update or insert. > > Or alternatively, UPDATE and see if you've affected any rows. If not, > insert. > > Michael Glaesemann > grzm seespotcode net > > > Thanks for your quick reply! Is the update and insert method quick? I have tried the query and update/insert way, and it was very slow when more than 1 million records have been inserted. (I have more than 20 million records to insert.) Thanks again! Xiao Jianfeng ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Tue, 25 Dec 2007, fdu.xiaojf@gmail.com wrote: > insert a record into a table, and when the record already > exists(according to the primary key), update it. There is an example that does exactly that, 37-1, in the documentation at http://www.postgresql.org/docs/curre...tructures.html It actually does the update first and only if that fails does the insert, which avoids the whole duplicate key issue altogether. > I have tried the query and update/insert way, and it was very slow when > more than 1 million records have been inserted. (I have more than 20 > million records to insert.) This may be better because it isn't doing the query first. You may discover that you need to aggressively run one of the VACUUM processes (I'd guess regular and ANALYZE but not FULL) in order to keep performance steady as the number of records grows. Anytime you update a row, that becomes a dead row that's still taking up space, and if you do a lot of those they get in the way of finding the rows that are still live. Take a look at http://www.postgresql.org/docs/curre...vacuuming.html to get an idea of the process. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| fdu.xiaojf@gmail.com wrote: > I have a table like this, > CREATE TABLE mytable( > a varchar(40), > b text, > c text, > PRIMARY KEY (a, b) > ); > Is the update and insert method quick? > > I have tried the query and update/insert way, and it was very slow when > more than 1 million records have been inserted. (I have more than 20 > million records to insert.) Your data structure is not conducive to "quick". You have a primary key over two columns, a VARCHAR and a TEXT. That's going to be a very slow comparison on INSERT when checking for duplicate keys. Is it really true that column b contributes to the identity of the row? Find a more compact way to do the PRIMARY KEY. Perhaps there is a way to label the row uniquely, if column a by itself cannot do it. It'd be good to avoid having an unbounded character column in the PK. Can column b get quite large? -- Lew |
| |||
| On Dec 24, 2007, at 11:15 PM, Greg Smith wrote: >> > > This may be better because it isn't doing the query first. You may > discover that you need to aggressively run one of the VACUUM > processes (I'd guess regular and ANALYZE but not FULL) in order to > keep performance steady as the number of records grows. Anytime you > update a row, that becomes a dead row that's still taking up space, > and if you do a lot of those they get in the way of finding the rows > that are still live. Take a look at http://www.postgresql.org/docs/curre...vacuuming.html > to get an idea of the process. Whoa. I am going to have to dig into the implementation. What is wrong with update in place, concurrency issues? The dead row presumably is no longer indexed, right? Since it is known to be dead is it automatically removed when there are no live transaction that reference or may reference it and its data page space marked available for new rows? If not, why not? I'm dredging my mind for stuff from my RDBMS implementation grad course a very long time ago. I would imagine that vacuuming often in a huge insert update would be a pretty poor performer depending on implementation. How is this implemented? Why would it be heavy IO if a list of pointers effectively is being kept to the dead rows to simply be added to the free list? What else is vacuum doing? Lookup implemented removal from indices, something else? - samantha ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On Thursday 27 December 2007 12:23, Samantha Atkins wrote: > On Dec 24, 2007, at 11:15 PM, Greg Smith wrote: > > This may be better because it isn't doing the query first. You may > > discover that you need to aggressively run one of the VACUUM > > processes (I'd guess regular and ANALYZE but not FULL) in order to > > keep performance steady as the number of records grows. Anytime you > > update a row, that becomes a dead row that's still taking up space, > > and if you do a lot of those they get in the way of finding the rows > > that are still live. Take a look at > > http://www.postgresql.org/docs/curre...vacuuming.html > > to get an idea of the process. > > Whoa. I am going to have to dig into the implementation. What is > wrong with update in place, concurrency issues? The dead row > presumably is no longer indexed, right? At the time your transaction commits, it cannot update in place, since someone else may be looking at the old version of the row in the middle of thier transaction, so you need two copies. Even after updated you still need some pointer in the index for the old version of the row, in case it its referenced again. > Since it is known to be dead > is it automatically removed when there are no live transaction that > reference or may reference it and its data page space marked available > for new rows? If not, why not? I'm dredging my mind for stuff from > my RDBMS implementation grad course a very long time ago. > The problem is you have determined in your mind that a row is "known dead" without explination of how that would actually be determined. A given transaction doesn't have a way to determine if there are live transaction looking at the row, that would require quite a bit of knowledge about what else is occuring in the system to be able to determine that. That level of knowledge/complexity is what vacuum takes care of. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On 12/28/07, Robert Treat <xzilla@users.sourceforge.net> wrote: > On Thursday 27 December 2007 12:23, Samantha Atkins wrote: > > Since it is known to be dead > > is it automatically removed when there are no live transaction that > > reference or may reference it and its data page space marked available > > for new rows? If not, why not? I'm dredging my mind for stuff from > > my RDBMS implementation grad course a very long time ago. > The problem is you have determined in your mind that a row is "known dead" > without explination of how that would actually be determined. A given > transaction doesn't have a way to determine if there are live transaction > looking at the row, that would require quite a bit of knowledge about what > else is occuring in the system to be able to determine that. That level of > knowledge/complexity is what vacuum takes care of. If you're familiar with the intricacies of the garbage collection vs realloc/free debate surrounding programming languages, especially in multithreaded environments, this is basically the same thing applied to disk storage. PostgreSQL implements garbage collection. Its methods for doing it automatically are still being refined, hence the advice on manual tuning for specific workloads. ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| ||||
| On Fri, 28 Dec 2007, Robert Treat wrote: > A given transaction doesn't have a way to determine if there are live > transaction looking at the row, that would require quite a bit of > knowledge about what else is occuring in the system to be able to > determine that. That level of knowledge/complexity is what vacuum takes > care of. One reason it doesn't happen automatically is that it would slow down the client doing the update. Sorting through to figure out which rows can be seen by which clients is better done later for a number of reasons; some samples: -It's more likely there won't be any transactions still referencing the original row as time moves forward. If you do it the instant the row is dead, odds are higher there's still somebody using the original one and you can't prune it yet anyway. -It's more efficicent to sort through a bunch of these in bulk than to do them one at a time. -You need to have a similar vacuum process happening regularly anyway to analyze your tables and keep statistics about them up to date, so might as well do both of those things at once. The downside is that vacuum can have a relatively high impact on the system, but the answer there is to do it more often so that any individual vacuum is less difficult. It'a also worth mentioning that some of this update row reuse happens more automatically in V8.3 with a new feature called HOT, so in some cases this particular issue has already has a workaround everyone can get in the near future. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |