Unix Technical Forum

Re: [PERFORM] A Better External Sort?

This is a discussion on Re: [PERFORM] A Better External Sort? within the pgsql Hackers forums, part of the PostgreSQL category; --> You have not said anything about what HW, OS version, and pg version used here, but even at that ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 07:02 AM
Ron Peacetree
 
Posts: n/a
Default Re: [PERFORM] A Better External Sort?

You have not said anything about what HW, OS version, and pg version
used here, but even at that can't you see that something Smells Wrong?

The most common CPUs currently shipping have clock rates of ~2-3GHz
and have 8B-16B internal pathways. SPARCs and other like CPUs are
clocked slower but have 16B-32B internal pathways. In short, these
CPU's have an internal bandwidth of 16+ GBps.

The most common currently shipping mainboards have 6.4GBps RAM
subsystems. ITRW, their peak is ~80% of that, or ~5.1GBps.

In contrast, the absolute peak bandwidth of a 133MHx 8B PCI-X bus is
1GBps, and ITRW it peaks at ~800-850MBps. Should anyone ever build
a RAID system that can saturate a PCI-Ex16 bus, that system will be
maxing ITRW at ~3.2GBps.

CPUs should NEVER be 100% utilized during copy IO. They should be
idling impatiently waiting for the next piece of data to finish being
processed even when the RAM IO subsystem is pegged; and they
definitely should be IO starved rather than CPU bound when doing
HD IO.

Those IO rates are also alarming in all but possibly the first case. A
single ~50MBps HD doing 21MBps isn't bad, but for even a single
~80MBps HD it starts to be of concern. If any these IO rates came
from any reasonable 300+MBps RAID array, then they are BAD.

What your simple experiment really does is prove We Have A
Problem (tm) with our IO code at either or both of the OS or the pg
level(s).

Ron


-----Original Message-----
From: Martijn van Oosterhout <kleptog@svana.org>
Sent: Oct 1, 2005 12:19 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On Sat, Oct 01, 2005 at 10:22:40AM -0400, Ron Peacetree wrote:
> Assuming we get the abyssmal physical IO performance fixed...
> (because until we do, _nothing_ is going to help us as much)


I'm still not convinced this is the major problem. For example, in my
totally unscientific tests on an oldish machine I have here:

Direct filesystem copy to /dev/null
21MB/s 10% user 50% system (dual cpu, so the system is using a whole CPU)

COPY TO /dev/null WITH binary
13MB/s 55% user 45% system (ergo, CPU bound)

COPY TO /dev/null
4.4MB/s 60% user 40% system

\copy to /dev/null in psql
6.5MB/s 60% user 40% system

This machine is a bit strange setup, not sure why fs copy is so slow.
As to why \copy is faster than COPY, I have no idea, but it is
repeatable. And actually turning the tuples into a printable format is
the most expensive. But it does point out that the whole process is
probably CPU bound more than anything else.

So, I don't think physical I/O is the problem. It's something further
up the call tree. I wouldn't be surprised at all it it had to do with
the creation and destruction of tuples. The cost of comparing tuples
should not be underestimated.

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 07:03 AM
Martijn van Oosterhout
 
Posts: n/a
Default Re: [PERFORM] A Better External Sort?

[removed -performance, not subscribed]

On Sat, Oct 01, 2005 at 01:42:32PM -0400, Ron Peacetree wrote:
> You have not said anything about what HW, OS version, and pg version
> used here, but even at that can't you see that something Smells Wrong?


Somewhat old machine running 7.3 on Linux 2.4. Not exactly speed
daemons but it's still true that the whole process would be CPU bound
*even* if the O/S could idle while it's waiting. PostgreSQL used a
*whole CPU* which is its limit. My point is that trying to reduce I/O
by increasing CPU usage is not going to be benficial, we need CPU usage
down also.

Anyway, to bring some real info I just profiled PostgreSQL 8.1beta
doing an index create on a 2960296 row table (3 columns, table size
317MB).

The number 1 bottleneck with 41% of user time is comparetup_index. It
was called 95,369,361 times (about 2*ln(N)*N). It used 3 tapes. Another
15% of time went to tuplesort_heap_siftup.

The thing is, I can't see anything in comparetup_index() that could
take much time. The actual comparisons are accounted elsewhere
(inlineApplySortFunction) which amounted to <10% of total time. Since
nocache_index_getattr doesn't feature I can't imagine index_getattr
being a big bottleneck. Any ideas what's going on here?

Other interesting features:
- ~4 memory allocations per tuple, nearly all of which were explicitly
freed
- Things I though would be expensive, like: heapgettup and
myFunctionCall2 didn't really count for much.

Have a nice weekend,

% cumulative self self total
time seconds seconds calls s/call s/call name
43.63 277.81 277.81 95370055 0.00 0.00 comparetup_index
16.24 381.24 103.43 5920592 0.00 0.00 tuplesort_heap_siftup
3.76 405.17 23.93 95370055 0.00 0.00 inlineApplySortFunction
3.18 425.42 20.26 95370056 0.00 0.00 btint4cmp
2.82 443.37 17.95 11856219 0.00 0.00 AllocSetAlloc
2.52 459.44 16.07 95370055 0.00 0.00 myFunctionCall2
1.71 470.35 10.91 2960305 0.00 0.00 heapgettup
1.26 478.38 8.03 11841204 0.00 0.00 GetMemoryChunkSpace
1.14 485.67 7.29 5920592 0.00 0.00 tuplesort_heap_insert
1.11 492.71 7.04 2960310 0.00 0.00 index_form_tuple
1.09 499.67 6.96 11855105 0.00 0.00 AllocSetFree
0.97 505.83 6.17 23711355 0.00 0.00 AllocSetFreeIndex
0.84 511.19 5.36 5920596 0.00 0.00 LogicalTapeWrite
0.84 516.51 5.33 2960314 0.00 0.00 slot_deform_tuple
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFDPwXxIB7bNG8LQkwRAmA3AJ0SJrScvMeXv99OsRZivZ cxojUfhgCgiLzQ
adITHNOK9MrxSou0G5gNFko=
=PdSb
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-11-2008, 07:03 AM
Tom Lane
 
Posts: n/a
Default Re: [PERFORM] A Better External Sort?

Martijn van Oosterhout <kleptog@svana.org> writes:
> Anyway, to bring some real info I just profiled PostgreSQL 8.1beta
> doing an index create on a 2960296 row table (3 columns, table size
> 317MB).


3 columns in the index you mean? What were the column datatypes?
Any null values?

> The number 1 bottleneck with 41% of user time is comparetup_index.
> ...
> The thing is, I can't see anything in comparetup_index() that could
> take much time.


The index_getattr and heap_getattr macros can be annoyingly expensive.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-11-2008, 07:03 AM
Martijn van Oosterhout
 
Posts: n/a
Default Re: [PERFORM] A Better External Sort?

On Sat, Oct 01, 2005 at 11:26:07PM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Anyway, to bring some real info I just profiled PostgreSQL 8.1beta
> > doing an index create on a 2960296 row table (3 columns, table size
> > 317MB).

>
> 3 columns in the index you mean? What were the column datatypes?
> Any null values?


Nope, three columns in the table, one column in the index, no nulls.
The indexed column was integer. I did it once with around 6500 values
repeated over and over, lots of duplicate kays. And once on a serial
column but it made no descernable difference either way. Although the
comparison function was called less (only 76 million times), presumably
because it was mostly sorted already.

> > The number 1 bottleneck with 41% of user time is comparetup_index.
> > ...
> > The thing is, I can't see anything in comparetup_index() that could
> > take much time.

>
> The index_getattr and heap_getattr macros can be annoyingly expensive.


And yet they are optimised for the common case. nocache_index_getattr
was only called 7 times, which is about what you expect. I'm getting
annotated output now, to determine which line takes the time...
Actually, my previous profile overstated stuff a bit. Profiling turned
off optimisation so I put it back and you get better results but the
order doesn't change much. By line results are below.

The top two are the index_getattr calls in comparetup_index. Third and
fourth are the HEAPCOMPARES in tuplesort_heap_siftup. Then comes the
inlineApplySortFunction call (which isn't being inlined, despite
suggesting it should be, -Winline warns about this).

Looks to me that there are no real gains to be made in this function.
What is needed is an algorithmic change to call this function less
often...

Have a nice weekend,

% cumulative self self total
time seconds seconds calls ms/call ms/call name
9.40 22.56 22.56 comparetup_index (tuplesort.c:2042 @ 8251060)
5.07 34.73 12.17 comparetup_index (tuplesort.c:2043 @ 82510c0)
4.73 46.09 11.36 tuplesort_heap_siftup(tuplesort.c:1648 @ 825074d)
3.48 54.45 8.36 tuplesort_heap_siftup(tuplesort.c:1661 @ 82507a9)
2.80 61.18 6.73 comparetup_index (tuplesort.c:2102 @ 8251201)
2.68 67.62 6.44 comparetup_index (tuplesort.c:2048 @ 8251120)
2.16 72.82 5.20 tuplesort_heap_siftup(tuplesort.c:1652 @ 825076d)
1.88 77.34 4.52 76025782 0.00 0.00 comparetup_index (tuplesort.c:2016 @ 8251010)
1.82 81.70 4.36 76025782 0.00 0.00 inlineApplySortFunction (tuplesort.c:1833 @ 8251800)
1.73 85.85 4.15 readtup_heap (tuplesort.c:2000 @ 8250fd8)
1.67 89.86 4.01 AllocSetAlloc (aset.c:568 @ 824bec0)
1.61 93.72 3.86 comparetup_index (tuplesort.c:2025 @ 825102f)
1.47 97.25 3.53 76025785 0.00 0.00 btint4cmp (nbtcompare..c:74 @ 80924a0)
1.11 99.92 2.67 readtup_datum (tuplesort.c:2224 @ 82517c4)
1.10 102.55 2.64 comparetup_index (tuplesort.c:2103 @ 82511e7)

% cumulative self self total
time seconds seconds calls s/call s/call name
28.34 68.01 68.01 76025782 0.00 0.00 comparetup_index
13.56 100.54 32.53 7148934 0.00 0.00 tuplesort_heap_siftup
8.66 121.33 20.79 76025782 0.00 0.00 inlineApplySortFunction
4.43 131.96 10.63 13084567 0.00 0.00 AllocSetAlloc
3.73 140.90 8.94 76025785 0.00 0.00 btint4cmp
2.15 146.07 5.17 6095625 0.00 0.00 LWLockAcquire
2.02 150.92 4.85 2960305 0.00 0.00 heapgettup
1.98 155.66 4.74 7148934 0.00 0.00 tuplesort_heap_insert
1.78 159.94 4.28 2960312 0.00 0.00 slot_deform_tuple
1.73 164.09 4.15 readtup_heap
1.67 168.09 4.00 6095642 0.00 0.00 LWLockRelease
1.53 171.76 3.68 2960308 0.00 0.00 index_form_tuple
1.44 175.21 3.45 13083442 0.00 0.00 AllocSetFree
1.28 178.28 3.07 8377285 0.00 0.00 LogicalTapeWrite
1.25 181.29 3.01 8377285 0.00 0.00 LogicalTapeRead
1.11 183.96 2.67 readtup_datum
1.06 186.51 2.55 1 2.55 123.54 IndexBuildHeapScan

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFDP9NnIB7bNG8LQkwRAh0TAJ0UnvSw9wBY+EdineD6QS WRUGh2EwCfVwq1
GVyJz5jgohzQzP6inojA9G4=
=s43K
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-11-2008, 07:03 AM
Martijn van Oosterhout
 
Posts: n/a
Default Re: [PERFORM] A Better External Sort?

Ok, I tried two optimisations:

1. By creating a special version of comparetup_index for single key
integer indexes. Create an index_get_attr with byval and len args. By
using fetch_att and specifying the values at compile time, gcc
optimises the whole call to about 12 instructions of assembly rather
than the usual mess.

2. By specifying: -Winline -finline-limit-1500 (only on tuplesort.c).
This causes inlineApplySortFunction() to be inlined, like the code
obviously expects it to be.

default build (baseline) 235 seconds
-finline only 217 seconds (7% better)
comparetup_index_fastbyval4 only 221 seconds (6% better)
comparetup_index_fastbyval4 and -finline 203 seconds (13.5% better)

This is indexing the integer sequence column on a 2.7 million row
table. The times are as given by gprof and so exclude system call time.

Basically, I recommend adding "-Winline -finline-limit-1500" to the
default build while we discuss other options.

comparetup_index_fastbyval4 patch attached per example.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFDQDc6IB7bNG8LQkwRAskpAJ4oX70LIkeobYjuwjbz/pqwCHfxUACgijKl
vxCH8Kf9KpoKYYQecGqdtpg=
=M3LE
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-11-2008, 07:04 AM
Simon Riggs
 
Posts: n/a
Default Re: [PERFORM] A Better External Sort?

On Sun, 2005-10-02 at 21:38 +0200, Martijn van Oosterhout wrote:
> Ok, I tried two optimisations:
>


> 2. By specifying: -Winline -finline-limit-1500 (only on tuplesort.c).
> This causes inlineApplySortFunction() to be inlined, like the code
> obviously expects it to be.
>
> default build (baseline) 235 seconds
> -finline only 217 seconds (7% better)
> comparetup_index_fastbyval4 only 221 seconds (6% better)
> comparetup_index_fastbyval4 and -finline 203 seconds (13.5% better)
>
> This is indexing the integer sequence column on a 2.7 million row
> table. The times are as given by gprof and so exclude system call time.
>
> Basically, I recommend adding "-Winline -finline-limit-1500" to the
> default build while we discuss other options.


I add -Winline but get no warnings. Why would I use -finline-limit-1500?

I'm interested, but uncertain as to what difference this makes. Surely
using -O3 works fine?

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-11-2008, 07:04 AM
Martijn van Oosterhout
 
Posts: n/a
Default Re: [PERFORM] A Better External Sort?

On Mon, Oct 03, 2005 at 10:51:32PM +0100, Simon Riggs wrote:
> > Basically, I recommend adding "-Winline -finline-limit-1500" to the
> > default build while we discuss other options.

>
> I add -Winline but get no warnings. Why would I use -finline-limit-1500?
>
> I'm interested, but uncertain as to what difference this makes. Surely
> using -O3 works fine?


Different versions of gcc have different ideas of when a function can
be inlined. From my reading of the documentation, this decision is
independant of optimisation level. Maybe your gcc version has a limit
higher than 1500 by default.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFDQlO4IB7bNG8LQkwRAi4sAJ9vNeAbCjfuFfVxd49c6p wLDy4eSACeL023
HDxAYHDb4vP8PFsJtwGrgXQ=
=ybxt
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-11-2008, 07:05 AM
Simon Riggs
 
Posts: n/a
Default Re: [PERFORM] A Better External Sort?

On Tue, 2005-10-04 at 12:04 +0200, Martijn van Oosterhout wrote:
> On Mon, Oct 03, 2005 at 10:51:32PM +0100, Simon Riggs wrote:
> > > Basically, I recommend adding "-Winline -finline-limit-1500" to the
> > > default build while we discuss other options.

> >
> > I add -Winline but get no warnings. Why would I use -finline-limit-1500?
> >
> > I'm interested, but uncertain as to what difference this makes. Surely
> > using -O3 works fine?


How did you determine the 1500 figure? Can you give some more info to
surround that recommendation to allow everybody to evaluate it?

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-11-2008, 07:05 AM
Martijn van Oosterhout
 
Posts: n/a
Default Re: [PERFORM] A Better External Sort?

On Tue, Oct 04, 2005 at 12:24:54PM +0100, Simon Riggs wrote:
> How did you determine the 1500 figure? Can you give some more info to
> surround that recommendation to allow everybody to evaluate it?


kleptog@vali:~/dl/cvs/pgsql-local/src/backend/utils/sort$ gcc -finline-limit-1000 -Winline -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -fno-strict-aliasing -g -I../../../../src/include -D_GNU_SOURCE -c -o tuplesort.o tuplesort.c
tuplesort.c: In function 'applySortFunction':
tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction'
tuplesort.c:1906: warning: called from here
tuplesort.c: In function 'comparetup_heap':
tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction'
tuplesort.c:1937: warning: called from here
tuplesort.c: In function 'comparetup_index':
tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction'
tuplesort.c:2048: warning: called from here
tuplesort.c: In function 'comparetup_datum':
tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction'
tuplesort.c:2167: warning: called from here
kleptog@vali:~/dl/cvs/pgsql-local/src/backend/utils/sort$ gcc -finline-limit-1500 -Winline -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -fno-strict-aliasing -g -I../../../../src/include -D_GNU_SOURCE -c -o tuplesort.o tuplesort.c
<no warnings>

A quick binary search puts the cutoff between 1200 and 1300. Given
version variation I picked a nice round number, 1500.

Ugh, that's for -O2, for -O3 and above it needs to be 4100 to work.
Maybe we should go for 5000 or so.

I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13)

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFDQnSKIB7bNG8LQkwRAgpdAJ4u+KmNpMaHY/WknCkvep4N5YGL3QCfS2qR
OU90plt1mOZZ9UjVH/9oTeI=
=Axn9
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-11-2008, 07:05 AM
Tom Lane
 
Posts: n/a
Default Re: [PERFORM] A Better External Sort?

Martijn van Oosterhout <kleptog@svana.org> writes:
> A quick binary search puts the cutoff between 1200 and 1300. Given
> version variation I picked a nice round number, 1500.


> Ugh, that's for -O2, for -O3 and above it needs to be 4100 to work.
> Maybe we should go for 5000 or so.


> I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13)


I don't know what the units of this number are, but it's apparently far
too gcc-version-dependent to consider putting into our build scripts.
Using gcc version 4.0.1 20050727 (current Fedora Core 4 compiler) on
i386, and compiling tuplesort.c as you did, I find:
-O2: warning goes away between 800 and 900
-O3: warning is always there (tried values up to 10000000)
(the latter behavior may indicate a bug, not sure).

What's even more interesting is that the warning does not appear in
either case if I omit -finline-limit --- so the default value is plenty.

At least on this particular compiler, the proposed switch would be
counterproductive.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 08:52 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
www.UnixAdminTalk.com