This is a discussion on Re: within the Pgsql Performance forums, part of the PostgreSQL category; --> Steven Rosenstein <srosenst@us.ibm.com> writes: > My question is, are there any advantages, drawbacks, or outright > restrictions to using ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Steven Rosenstein <srosenst@us.ibm.com> writes: > My question is, are there any advantages, drawbacks, or outright > restrictions to using multiple simultaneous COPY commands to load data into > the same table? It will work; not sure about whether there is any performance benefit. I vaguely recall someone having posted about doing this, so you might check the archives. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| > Steven Rosenstein <srosenst@us.ibm.com> writes: > > My question is, are there any advantages, drawbacks, or outright > > restrictions to using multiple simultaneous COPY commands to load > data into > > the same table? Do you mean, multiple COPY commands (connections) being putline'd from the same thread (process)? I have indirect evidence that this may hurt. Two copy commands from different threads/processes are fine, and can help, if they alternate contention on some other resource (disk/CPU). I'm basing this on being at the third generation of a COPY implementation. The app loads about 1M objects/hour from 6 servers. Each object is split across four tables. The batch load opens four connections and firehoses records down each. A batch is 10K objects. COPY invokes all the same logic as INSERT on the server side (rowexclusive locking, transaction log, updating indexes, rules). The difference is that all the rows are inserted as a single transaction. This reduces the number of fsync's on the xlog, which may be a limiting factor for you. You'll want to crank WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though. One of my streams has 6K records; I run with WB=1000, CS=128. The downside I found with multiple clients inserting large blocks of rows was, that they serialized. I THINK that's because at some point they all needed to lock the same portions of the same indexes. I'm still working on how to avoid that, tuning the batch size and inserting into a "queue" table with fewer indexes. COPY (via putline) didn't do measurably better than INSERT until I batched 40 newline-separate rows into one putline call, which improved it 2-3:1. The suspect problem was stalling on the TCP stream; the driver was flushing small packets. This may or may not be relevant to you; depends on how much processing (waiting) your app does between posting of rows. In such a case, writing alternately to two TCP streams from the same process increases the likelihood of a stall. I've never tested that set-up; it would have been heading AWAY from the solution in my case. Hope that helps. -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| |||
| On 5/3/05, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Steven Rosenstein <srosenst@us.ibm.com> writes: > > My question is, are there any advantages, drawbacks, or outright > > restrictions to using multiple simultaneous COPY commands to load data into > > the same table? > > It will work; not sure about whether there is any performance benefit. > I vaguely recall someone having posted about doing this, so you might > check the archives. > I may be one of Tom's vague "voices". you need to remove all you UNIQUE constraints before sending multiple COPYs to the server. This includes the PRIMARY KEY constraint. To the backend, COPY is just like INSERT and all constraints need to be checked and this will block the commit of one of the COPY streams. However, multiple COPYs may no be needed. I regularly load several table totaling around 50M rows with a single COPY per table. I drop (actually, this is during DB reload, so I don't yet create...) all fkeys, constraints and indexes and the data loads in a matter of 5 minutes or so. Hope that helps! -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| > COPY invokes all the same logic as INSERT on the server side > (rowexclusive locking, transaction log, updating indexes, rules). > The difference is that all the rows are inserted as a single > transaction. This reduces the number of fsync's on the xlog, > which may be a limiting factor for you. You'll want to crank > WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though. > One of my streams has 6K records; I run with WB=1000, CS=128. So what's the difference between a COPY and a batch of INSERT statements. Also, surely, fsyncs only occur at the end of a transaction, no need to fsync before a commit has been issued, right? David ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Quoting David Roussel <pgsql-performance@diroussel.xsmail.com>: > > COPY invokes all the same logic as INSERT on the server side > > (rowexclusive locking, transaction log, updating indexes, rules). > > The difference is that all the rows are inserted as a single > > transaction. This reduces the number of fsync's on the xlog, > > which may be a limiting factor for you. You'll want to crank > > WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though. > > One of my streams has 6K records; I run with WB=1000, CS=128. > > So what's the difference between a COPY and a batch of INSERT > statements. Also, surely, fsyncs only occur at the end of a > transaction, no need to fsync before a commit has been issued, > right? Sorry, I was comparing granularities the other way araound. As far as xlog is concerned, a COPY is ALWAYS one big txn, no matter how many putline commands you use to feed the copy. With inserts, you can choose whether to commit every row, every nth row, etc. Copy makes better use of the TCP connection for transmission. COPY uses the TCP connection like a one-way pipe. INSERT is like an RPC: the sender has to wait until the insert's return status roundtrips. -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| David Roussel wrote: >>COPY invokes all the same logic as INSERT on the server side >>(rowexclusive locking, transaction log, updating indexes, rules). >>The difference is that all the rows are inserted as a single >>transaction. This reduces the number of fsync's on the xlog, >>which may be a limiting factor for you. You'll want to crank >>WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though. >>One of my streams has 6K records; I run with WB=1000, CS=128. > > > So what's the difference between a COPY and a batch of INSERT > statements. Also, surely, fsyncs only occur at the end of a > transaction, no need to fsync before a commit has been issued, right? I think COPY also has the advantage that for index updates it only grabs the lock once, rather than grabbing and releasing for each row. But I believe you are right that fsync only happens on COMMIT. > > David John =:-> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFCeVmLJdeBCYSNAAMRAuWrAKCbyhxIXWmpXWuKbPGRhj yloIU7swCgxjPE Gn0DQu9Wj5m+Mr3Wv5Zzokk= =pkfD -----END PGP SIGNATURE----- |
| |||
| On 5/4/05, Mischa Sandberg <mischa.sandberg@telus.net> wrote: > Quoting David Roussel <pgsql-performance@diroussel.xsmail.com>: > > > > COPY invokes all the same logic as INSERT on the server side > > > (rowexclusive locking, transaction log, updating indexes, rules). > > > The difference is that all the rows are inserted as a single > > > transaction. This reduces the number of fsync's on the xlog, > > > which may be a limiting factor for you. You'll want to crank > > > WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though. > > > One of my streams has 6K records; I run with WB=1000, CS=128. > > > > So what's the difference between a COPY and a batch of INSERT > > statements. Also, surely, fsyncs only occur at the end of a > > transaction, no need to fsync before a commit has been issued, > > right? > > Sorry, I was comparing granularities the other way araound. As far as > xlog is concerned, a COPY is ALWAYS one big txn, no matter how many > putline commands you use to feed the copy. With inserts, you can choose > whether to commit every row, every nth row, etc. > > Copy makes better use of the TCP connection for transmission. COPY uses > the TCP connection like a one-way pipe. INSERT is like an RPC: the > sender has to wait until the insert's return status roundtrips. I have found even greater performance increases by using COPY FROM <filename> not COPY FROM STDIN. This allows the backend process to directly read the file, rather than shoving it over a pipe (thereby potentially hitting the CPU multiple times). My experience is that this is anywhere from 5-10x faster than INSERT statements on the whole, and sometimes 200x. Chris -- | Christopher Petrilli | petrilli@gmail.com ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Wed, 4 May 2005, Mischa Sandberg wrote: > Copy makes better use of the TCP connection for transmission. COPY uses > the TCP connection like a one-way pipe. INSERT is like an RPC: the > sender has to wait until the insert's return status roundtrips. Not true. A client may send any number of Bind/Execute messages on a prepared statement before a Sync message. So multiple inserts may be sent in one network roundtrip. This is exactly how the JDBC driver implements batch statements. There is some limit to the number of queries in flight at any given moment because there is the potential to deadlock if both sides of network buffers are filled up and each side is blocked waiting on a write. The JDBC driver has conservatively selected 256 as the maximum number of queries to send at once. Kris Jurka ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| > So what's the difference between a COPY and a batch of INSERT > statements. Also, surely, fsyncs only occur at the end of a > transaction, no need to fsync before a commit has been issued, right? With COPY, the data being inserted itself does not have to pass through the postgresql parser. Chris ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| ||||
| Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: >> So what's the difference between a COPY and a batch of INSERT >> statements. Also, surely, fsyncs only occur at the end of a >> transaction, no need to fsync before a commit has been issued, right? > With COPY, the data being inserted itself does not have to pass through > the postgresql parser. Also, there is a whole lot of one-time-per-statement overhead that can be amortized across many rows instead of only one. Stuff like opening the target table, looking up the per-column I/O conversion functions, identifying trigger functions if any, yadda yadda. It's not *that* expensive, but compared to an operation as small as inserting a single row, it's significant. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |