Unix Technical Forum

Re: COPY FROM performance improvements

This is a discussion on Re: COPY FROM performance improvements within the Pgsql Patches forums, part of the PostgreSQL category; --> On Thu, 14 Jul 2005 17:22:18 -0700 "Alon Goldshuv" <agoldshuv@greenplum.com> wrote: > I revisited my patch and removed the ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Patches

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-18-2008, 12:42 AM
Mark Wong
 
Posts: n/a
Default Re: COPY FROM performance improvements

On Thu, 14 Jul 2005 17:22:18 -0700
"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:

> I revisited my patch and removed the code duplications that were there, and
> added support for CSV with buffered input, so CSV now runs faster too
> (although it is not as optimized as the TEXT format parsing). So now
> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.


Hi Alon,

I'm curious, what kind of system are you testing this on? I'm trying to
load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
interested in the results you would expect.

Mark

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-18-2008, 12:42 AM
Alon Goldshuv
 
Posts: n/a
Default Re: COPY FROM performance improvements

Hi Mark,

I improved the data *parsing* capabilities of COPY, and didn't touch the
data conversion or data insertion parts of the code. The parsing improvement
will vary largely depending on the ratio of parsing -to- converting and
inserting.

Therefore, the speed increase really depends on the nature of your data:

100GB file with
long data rows (lots of parsing)
Small number of columns (small number of attr conversions per row)
less rows (less tuple insertions)

Will show the best performance improvements.

However, same file size 100GB with
Short data rows (minimal parsing)
large number of columns (large number of attr conversions per row)
AND/OR
more rows (more tuple insertions)

Will show improvements but not as significant.
In general I'll estimate 40%-95% improvement in load speed for the 1st case
and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
etc... This is for TEXT format. As for CSV, it may be faster but not as much
as I specified here. BINARY will stay the same as before.

HTH
Alon.






On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote:

> On Thu, 14 Jul 2005 17:22:18 -0700
> "Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
>
>> I revisited my patch and removed the code duplications that were there, and
>> added support for CSV with buffered input, so CSV now runs faster too
>> (although it is not as optimized as the TEXT format parsing). So now
>> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.

>
> Hi Alon,
>
> I'm curious, what kind of system are you testing this on? I'm trying to
> load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
> interested in the results you would expect.
>
> Mark
>




---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-18-2008, 12:42 AM
Mark Wong
 
Posts: n/a
Default Re: COPY FROM performance improvements

Hi Alon,

Yeah, that helps. I just need to break up my scripts a little to just
load the data and not build indexes.

Is the following information good enough to give a guess about the data
I'm loading, if you don't mind? Here's a link to my script to create
tables:
http://developer.osdl.org/markw/mt/g...e_tables.sh.in

File sizes:
-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
-rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
-rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
-rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
-rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
-rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
-rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl

Number of rows:
# wc -l *.tbl
15000000 customer.tbl
600037902 lineitem.tbl
25 nation.tbl
150000000 orders.tbl
20000000 part.tbl
80000000 partsupp.tbl
5 region.tbl
1000000 supplier.tbl

Thanks,
Mark

On Tue, 19 Jul 2005 14:05:56 -0700
"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:

> Hi Mark,
>
> I improved the data *parsing* capabilities of COPY, and didn't touch the
> data conversion or data insertion parts of the code. The parsing improvement
> will vary largely depending on the ratio of parsing -to- converting and
> inserting.
>
> Therefore, the speed increase really depends on the nature of your data:
>
> 100GB file with
> long data rows (lots of parsing)
> Small number of columns (small number of attr conversions per row)
> less rows (less tuple insertions)
>
> Will show the best performance improvements.
>
> However, same file size 100GB with
> Short data rows (minimal parsing)
> large number of columns (large number of attr conversions per row)
> AND/OR
> more rows (more tuple insertions)
>
> Will show improvements but not as significant.
> In general I'll estimate 40%-95% improvement in load speed for the 1st case
> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
> etc... This is for TEXT format. As for CSV, it may be faster but not as much
> as I specified here. BINARY will stay the same as before.
>
> HTH
> Alon.
>
>
>
>
>
>
> On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote:
>
> > On Thu, 14 Jul 2005 17:22:18 -0700
> > "Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
> >
> >> I revisited my patch and removed the code duplications that were there, and
> >> added support for CSV with buffered input, so CSV now runs faster too
> >> (although it is not as optimized as the TEXT format parsing). So now
> >> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.

> >
> > Hi Alon,
> >
> > I'm curious, what kind of system are you testing this on? I'm trying to
> > load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
> > interested in the results you would expect.
> >
> > Mark
> >

>


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-18-2008, 12:42 AM
Alon Goldshuv
 
Posts: n/a
Default Re: COPY FROM performance improvements

Mark,

Thanks for the info.

Yes, isolating indexes out of the picture is a good idea for this purpose.

I can't really give a guess to how fast the load rate should be. I don't
know how your system is configured, and all the hardware characteristics
(and even if I knew that info I may not be able to guess...). I am pretty
confident that the load will be faster than before, I'll risk that ;-)
Looking into your TPC-H size and metadata I'll estimate that
partsupp,customer and orders will have the most significant increase in load
rate. You could start with those.

I guess the only way to really know is to try... Load several times with the
existing PG-COPY and then load several times with the patched COPY and
compare. I'll be curious to hear your results.

Thx,
Alon.




On 7/19/05 2:37 PM, "Mark Wong" <markw@osdl.org> wrote:

> Hi Alon,
>
> Yeah, that helps. I just need to break up my scripts a little to just
> load the data and not build indexes.
>
> Is the following information good enough to give a guess about the data
> I'm loading, if you don't mind? Here's a link to my script to create
> tables:
> http://developer.osdl.org/markw/mt/g...9780645b2bb44f
> 7f23437e432&path=scripts/pgsql/create_tables.sh.in
>
> File sizes:
> -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
> -rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
> -rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
> -rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
> -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
> -rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
> -rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
> -rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl
>
> Number of rows:
> # wc -l *.tbl
> 15000000 customer.tbl
> 600037902 lineitem.tbl
> 25 nation.tbl
> 150000000 orders.tbl
> 20000000 part.tbl
> 80000000 partsupp.tbl
> 5 region.tbl
> 1000000 supplier.tbl
>
> Thanks,
> Mark
>
> On Tue, 19 Jul 2005 14:05:56 -0700
> "Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
>
>> Hi Mark,
>>
>> I improved the data *parsing* capabilities of COPY, and didn't touch the
>> data conversion or data insertion parts of the code. The parsing improvement
>> will vary largely depending on the ratio of parsing -to- converting and
>> inserting.
>>
>> Therefore, the speed increase really depends on the nature of your data:
>>
>> 100GB file with
>> long data rows (lots of parsing)
>> Small number of columns (small number of attr conversions per row)
>> less rows (less tuple insertions)
>>
>> Will show the best performance improvements.
>>
>> However, same file size 100GB with
>> Short data rows (minimal parsing)
>> large number of columns (large number of attr conversions per row)
>> AND/OR
>> more rows (more tuple insertions)
>>
>> Will show improvements but not as significant.
>> In general I'll estimate 40%-95% improvement in load speed for the 1st case
>> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
>> etc... This is for TEXT format. As for CSV, it may be faster but not as much
>> as I specified here. BINARY will stay the same as before.
>>
>> HTH
>> Alon.
>>
>>
>>
>>
>>
>>
>> On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote:
>>
>>> On Thu, 14 Jul 2005 17:22:18 -0700
>>> "Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
>>>
>>>> I revisited my patch and removed the code duplications that were there, and
>>>> added support for CSV with buffered input, so CSV now runs faster too
>>>> (although it is not as optimized as the TEXT format parsing). So now
>>>> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original
>>>> file.
>>>
>>> Hi Alon,
>>>
>>> I'm curious, what kind of system are you testing this on? I'm trying to
>>> load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
>>> interested in the results you would expect.
>>>
>>> Mark
>>>

>>

>




---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-18-2008, 12:42 AM
Andrew Dunstan
 
Posts: n/a
Default Re: COPY FROM performance improvements

Mark,

You should definitely not be doing this sort of thing, I believe:

CREATE TABLE orders (
o_orderkey INTEGER,
o_custkey INTEGER,
o_orderstatus CHAR(1),
o_totalprice REAL,
o_orderDATE DATE,
o_orderpriority CHAR(15),
o_clerk CHAR(15),
o_shippriority INTEGER,
o_comment VARCHAR(79),
PRIMARY KEY (o_orderkey))

Create the table with no constraints, load the data, then set up primary keys and whatever other constraints you want using ALTER TABLE. Last time I did a load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup from deferring constarint creation.


cheers

andrew



Mark Wong wrote:

>Hi Alon,
>
>Yeah, that helps. I just need to break up my scripts a little to just
>load the data and not build indexes.
>
>Is the following information good enough to give a guess about the data
>I'm loading, if you don't mind? Here's a link to my script to create
>tables:
>http://developer.osdl.org/markw/mt/g...e_tables.sh.in
>
>File sizes:
>-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
>-rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
>-rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
>-rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
>-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
>-rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
>-rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
>-rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl
>
>Number of rows:
># wc -l *.tbl
> 15000000 customer.tbl
> 600037902 lineitem.tbl
> 25 nation.tbl
> 150000000 orders.tbl
> 20000000 part.tbl
> 80000000 partsupp.tbl
> 5 region.tbl
> 1000000 supplier.tbl
>
>Thanks,
>Mark
>
>On Tue, 19 Jul 2005 14:05:56 -0700
>"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
>
>
>
>>Hi Mark,
>>
>>I improved the data *parsing* capabilities of COPY, and didn't touch the
>>data conversion or data insertion parts of the code. The parsing improvement
>>will vary largely depending on the ratio of parsing -to- converting and
>>inserting.
>>
>>Therefore, the speed increase really depends on the nature of your data:
>>
>>100GB file with
>>long data rows (lots of parsing)
>>Small number of columns (small number of attr conversions per row)
>>less rows (less tuple insertions)
>>
>>Will show the best performance improvements.
>>
>>However, same file size 100GB with
>>Short data rows (minimal parsing)
>>large number of columns (large number of attr conversions per row)
>>AND/OR
>>more rows (more tuple insertions)
>>
>>Will show improvements but not as significant.
>>In general I'll estimate 40%-95% improvement in load speed for the 1st case
>>and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
>>etc... This is for TEXT format. As for CSV, it may be faster but not as much
>>as I specified here. BINARY will stay the same as before.
>>
>>HTH
>>Alon.
>>
>>
>>
>>
>>
>>
>>On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote:
>>
>>
>>
>>>On Thu, 14 Jul 2005 17:22:18 -0700
>>>"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
>>>
>>>
>>>
>>>>I revisited my patch and removed the code duplications that were there, and
>>>>added support for CSV with buffered input, so CSV now runs faster too
>>>>(although it is not as optimized as the TEXT format parsing). So now
>>>>TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
>>>>
>>>>
>>>Hi Alon,
>>>
>>>I'm curious, what kind of system are you testing this on? I'm trying to
>>>load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
>>>interested in the results you would expect.
>>>
>>>Mark
>>>
>>>
>>>

>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>
>
>


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-18-2008, 12:42 AM
Mark Wong
 
Posts: n/a
Default Re: COPY FROM performance improvements

Whoopsies, yeah good point about the PRIMARY KEY. I'll fix that.

Mark

On Tue, 19 Jul 2005 18:17:52 -0400
Andrew Dunstan <andrew@dunslane.net> wrote:

> Mark,
>
> You should definitely not be doing this sort of thing, I believe:
>
> CREATE TABLE orders (
> o_orderkey INTEGER,
> o_custkey INTEGER,
> o_orderstatus CHAR(1),
> o_totalprice REAL,
> o_orderDATE DATE,
> o_orderpriority CHAR(15),
> o_clerk CHAR(15),
> o_shippriority INTEGER,
> o_comment VARCHAR(79),
> PRIMARY KEY (o_orderkey))
>
> Create the table with no constraints, load the data, then set up primary keys and whatever other constraints you want using ALTER TABLE. Last time I did a load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup from deferring constarint creation.
>
>
> cheers
>
> andrew
>
>
>
> Mark Wong wrote:
>
> >Hi Alon,
> >
> >Yeah, that helps. I just need to break up my scripts a little to just
> >load the data and not build indexes.
> >
> >Is the following information good enough to give a guess about the data
> >I'm loading, if you don't mind? Here's a link to my script to create
> >tables:
> >http://developer.osdl.org/markw/mt/g...e_tables.sh.in
> >
> >File sizes:
> >-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
> >-rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
> >-rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
> >-rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
> >-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
> >-rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
> >-rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
> >-rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl
> >
> >Number of rows:
> ># wc -l *.tbl
> > 15000000 customer.tbl
> > 600037902 lineitem.tbl
> > 25 nation.tbl
> > 150000000 orders.tbl
> > 20000000 part.tbl
> > 80000000 partsupp.tbl
> > 5 region.tbl
> > 1000000 supplier.tbl
> >
> >Thanks,
> >Mark
> >
> >On Tue, 19 Jul 2005 14:05:56 -0700
> >"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
> >
> >
> >
> >>Hi Mark,
> >>
> >>I improved the data *parsing* capabilities of COPY, and didn't touch the
> >>data conversion or data insertion parts of the code. The parsing improvement
> >>will vary largely depending on the ratio of parsing -to- converting and
> >>inserting.
> >>
> >>Therefore, the speed increase really depends on the nature of your data:
> >>
> >>100GB file with
> >>long data rows (lots of parsing)
> >>Small number of columns (small number of attr conversions per row)
> >>less rows (less tuple insertions)
> >>
> >>Will show the best performance improvements.
> >>
> >>However, same file size 100GB with
> >>Short data rows (minimal parsing)
> >>large number of columns (large number of attr conversions per row)
> >>AND/OR
> >>more rows (more tuple insertions)
> >>
> >>Will show improvements but not as significant.
> >>In general I'll estimate 40%-95% improvement in load speed for the 1st case
> >>and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
> >>etc... This is for TEXT format. As for CSV, it may be faster but not as much
> >>as I specified here. BINARY will stay the same as before.
> >>
> >>HTH
> >>Alon.
> >>
> >>
> >>
> >>
> >>
> >>
> >>On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote:
> >>
> >>
> >>
> >>>On Thu, 14 Jul 2005 17:22:18 -0700
> >>>"Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
> >>>
> >>>
> >>>
> >>>>I revisited my patch and removed the code duplications that were there, and
> >>>>added support for CSV with buffered input, so CSV now runs faster too
> >>>>(although it is not as optimized as the TEXT format parsing). So now
> >>>>TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
> >>>>
> >>>>
> >>>Hi Alon,
> >>>
> >>>I'm curious, what kind of system are you testing this on? I'm trying to
> >>>load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
> >>>interested in the results you would expect.
> >>>
> >>>Mark
> >>>
> >>>
> >>>

> >


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-18-2008, 12:42 AM
Luke Lonergan
 
Posts: n/a
Default Re: COPY FROM performance improvements

Good points on all, another element in the performance expectations is the
ratio of CPU speed to I/O subsystem speed, as Alon had hinted earlier.

This patch substantially (500%) improves the efficiency of parsing in the
COPY path, which, on a 3GHz P4 desktop with a commodity disk drive
represents 8 of a total of 30 seconds of processing time. So, by reducing
the parsing time from 8 seconds to 1.5 seconds, the overall COPY time is
reduced from 30 seconds to 23.5 seconds, or a speedup of about 20%.

On a dual 2.2GHz Opteron machine with a 6-disk SCSI RAID subsystem capable
of 240MB/s sequential read and writes, the ratios change and we see between
35% and 95% increase in COPY performance, with the bottleneck being CPU.
The disk is only running at about 90MB/s during this period.

I'd expect that as your CPUs slow down relative to your I/O speed, and
Itaniums or IT2s are quite slow, you should see an increased effect of the
parsing improvements.

One good way to validate the effect is to watch the I/O bandwidth using
vmstat 1 (on Linux) while the load is progressing. When you watch that with
the unpatched source and with the patched source, if they are the same, you
should see no benefit from the patch (you are I/O limited).

If you check your underlying sequential write speed, you will be
bottlenecked at roughly half that in performing COPY because of the
write-through the WAL.

- Luke

On 7/19/05 3:51 PM, "Mark Wong" <markw@osdl.org> wrote:

> Whoopsies, yeah good point about the PRIMARY KEY. I'll fix that.
>
> Mark
>
> On Tue, 19 Jul 2005 18:17:52 -0400
> Andrew Dunstan <andrew@dunslane.net> wrote:
>
>> Mark,
>>
>> You should definitely not be doing this sort of thing, I believe:
>>
>> CREATE TABLE orders (
>> o_orderkey INTEGER,
>> o_custkey INTEGER,
>> o_orderstatus CHAR(1),
>> o_totalprice REAL,
>> o_orderDATE DATE,
>> o_orderpriority CHAR(15),
>> o_clerk CHAR(15),
>> o_shippriority INTEGER,
>> o_comment VARCHAR(79),
>> PRIMARY KEY (o_orderkey))
>>
>> Create the table with no constraints, load the data, then set up primary keys
>> and whatever other constraints you want using ALTER TABLE. Last time I did a
>> load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup
>> from deferring constarint creation.
>>
>>
>> cheers
>>
>> andrew
>>
>>
>>
>> Mark Wong wrote:
>>
>>> Hi Alon,
>>>
>>> Yeah, that helps. I just need to break up my scripts a little to just
>>> load the data and not build indexes.
>>>
>>> Is the following information good enough to give a guess about the data
>>> I'm loading, if you don't mind? Here's a link to my script to create
>>> tables:
>>> http://developer.osdl.org/markw/mt/g...729780645b2bb4
>>> 4f7f23437e432&path=scripts/pgsql/create_tables.sh.in
>>>
>>> File sizes:
>>> -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
>>> -rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
>>> -rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
>>> -rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
>>> -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
>>> -rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
>>> -rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
>>> -rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl
>>>
>>> Number of rows:
>>> # wc -l *.tbl
>>> 15000000 customer.tbl
>>> 600037902 lineitem.tbl
>>> 25 nation.tbl
>>> 150000000 orders.tbl
>>> 20000000 part.tbl
>>> 80000000 partsupp.tbl
>>> 5 region.tbl
>>> 1000000 supplier.tbl
>>>
>>> Thanks,
>>> Mark
>>>
>>> On Tue, 19 Jul 2005 14:05:56 -0700
>>> "Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
>>>
>>>
>>>
>>>> Hi Mark,
>>>>
>>>> I improved the data *parsing* capabilities of COPY, and didn't touch the
>>>> data conversion or data insertion parts of the code. The parsing
>>>> improvement
>>>> will vary largely depending on the ratio of parsing -to- converting and
>>>> inserting.
>>>>
>>>> Therefore, the speed increase really depends on the nature of your data:
>>>>
>>>> 100GB file with
>>>> long data rows (lots of parsing)
>>>> Small number of columns (small number of attr conversions per row)
>>>> less rows (less tuple insertions)
>>>>
>>>> Will show the best performance improvements.
>>>>
>>>> However, same file size 100GB with
>>>> Short data rows (minimal parsing)
>>>> large number of columns (large number of attr conversions per row)
>>>> AND/OR
>>>> more rows (more tuple insertions)
>>>>
>>>> Will show improvements but not as significant.
>>>> In general I'll estimate 40%-95% improvement in load speed for the 1st case
>>>> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
>>>> etc... This is for TEXT format. As for CSV, it may be faster but not as
>>>> much
>>>> as I specified here. BINARY will stay the same as before.
>>>>
>>>> HTH
>>>> Alon.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 7/19/05 12:54 PM, "Mark Wong" <markw@osdl.org> wrote:
>>>>
>>>>
>>>>
>>>>> On Thu, 14 Jul 2005 17:22:18 -0700
>>>>> "Alon Goldshuv" <agoldshuv@greenplum.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> I revisited my patch and removed the code duplications that were there,
>>>>>> and
>>>>>> added support for CSV with buffered input, so CSV now runs faster too
>>>>>> (although it is not as optimized as the TEXT format parsing). So now
>>>>>> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original
>>>>>> file.
>>>>>>
>>>>>>
>>>>> Hi Alon,
>>>>>
>>>>> I'm curious, what kind of system are you testing this on? I'm trying to
>>>>> load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
>>>>> interested in the results you would expect.
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>>
>>>

>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>




---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:25 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com