This is a discussion on Re: [PERFORM] A Better External Sort? within the pgsql Hackers forums, part of the PostgreSQL category; --> You have not said anything about what HW, OS version, and pg version used here, but even at that ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| You have not said anything about what HW, OS version, and pg version used here, but even at that can't you see that something Smells Wrong? The most common CPUs currently shipping have clock rates of ~2-3GHz and have 8B-16B internal pathways. SPARCs and other like CPUs are clocked slower but have 16B-32B internal pathways. In short, these CPU's have an internal bandwidth of 16+ GBps. The most common currently shipping mainboards have 6.4GBps RAM subsystems. ITRW, their peak is ~80% of that, or ~5.1GBps. In contrast, the absolute peak bandwidth of a 133MHx 8B PCI-X bus is 1GBps, and ITRW it peaks at ~800-850MBps. Should anyone ever build a RAID system that can saturate a PCI-Ex16 bus, that system will be maxing ITRW at ~3.2GBps. CPUs should NEVER be 100% utilized during copy IO. They should be idling impatiently waiting for the next piece of data to finish being processed even when the RAM IO subsystem is pegged; and they definitely should be IO starved rather than CPU bound when doing HD IO. Those IO rates are also alarming in all but possibly the first case. A single ~50MBps HD doing 21MBps isn't bad, but for even a single ~80MBps HD it starts to be of concern. If any these IO rates came from any reasonable 300+MBps RAID array, then they are BAD. What your simple experiment really does is prove We Have A Problem (tm) with our IO code at either or both of the OS or the pg level(s). Ron -----Original Message----- From: Martijn van Oosterhout <kleptog@svana.org> Sent: Oct 1, 2005 12:19 PM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? On Sat, Oct 01, 2005 at 10:22:40AM -0400, Ron Peacetree wrote: > Assuming we get the abyssmal physical IO performance fixed... > (because until we do, _nothing_ is going to help us as much) I'm still not convinced this is the major problem. For example, in my totally unscientific tests on an oldish machine I have here: Direct filesystem copy to /dev/null 21MB/s 10% user 50% system (dual cpu, so the system is using a whole CPU) COPY TO /dev/null WITH binary 13MB/s 55% user 45% system (ergo, CPU bound) COPY TO /dev/null 4.4MB/s 60% user 40% system \copy to /dev/null in psql 6.5MB/s 60% user 40% system This machine is a bit strange setup, not sure why fs copy is so slow. As to why \copy is faster than COPY, I have no idea, but it is repeatable. And actually turning the tuples into a printable format is the most expensive. But it does point out that the whole process is probably CPU bound more than anything else. So, I don't think physical I/O is the problem. It's something further up the call tree. I wouldn't be surprised at all it it had to do with the creation and destruction of tuples. The cost of comparing tuples should not be underestimated. ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| [removed -performance, not subscribed] On Sat, Oct 01, 2005 at 01:42:32PM -0400, Ron Peacetree wrote: > You have not said anything about what HW, OS version, and pg version > used here, but even at that can't you see that something Smells Wrong? Somewhat old machine running 7.3 on Linux 2.4. Not exactly speed daemons but it's still true that the whole process would be CPU bound *even* if the O/S could idle while it's waiting. PostgreSQL used a *whole CPU* which is its limit. My point is that trying to reduce I/O by increasing CPU usage is not going to be benficial, we need CPU usage down also. Anyway, to bring some real info I just profiled PostgreSQL 8.1beta doing an index create on a 2960296 row table (3 columns, table size 317MB). The number 1 bottleneck with 41% of user time is comparetup_index. It was called 95,369,361 times (about 2*ln(N)*N). It used 3 tapes. Another 15% of time went to tuplesort_heap_siftup. The thing is, I can't see anything in comparetup_index() that could take much time. The actual comparisons are accounted elsewhere (inlineApplySortFunction) which amounted to <10% of total time. Since nocache_index_getattr doesn't feature I can't imagine index_getattr being a big bottleneck. Any ideas what's going on here? Other interesting features: - ~4 memory allocations per tuple, nearly all of which were explicitly freed - Things I though would be expensive, like: heapgettup and myFunctionCall2 didn't really count for much. Have a nice weekend, % cumulative self self total time seconds seconds calls s/call s/call name 43.63 277.81 277.81 95370055 0.00 0.00 comparetup_index 16.24 381.24 103.43 5920592 0.00 0.00 tuplesort_heap_siftup 3.76 405.17 23.93 95370055 0.00 0.00 inlineApplySortFunction 3.18 425.42 20.26 95370056 0.00 0.00 btint4cmp 2.82 443.37 17.95 11856219 0.00 0.00 AllocSetAlloc 2.52 459.44 16.07 95370055 0.00 0.00 myFunctionCall2 1.71 470.35 10.91 2960305 0.00 0.00 heapgettup 1.26 478.38 8.03 11841204 0.00 0.00 GetMemoryChunkSpace 1.14 485.67 7.29 5920592 0.00 0.00 tuplesort_heap_insert 1.11 492.71 7.04 2960310 0.00 0.00 index_form_tuple 1.09 499.67 6.96 11855105 0.00 0.00 AllocSetFree 0.97 505.83 6.17 23711355 0.00 0.00 AllocSetFreeIndex 0.84 511.19 5.36 5920596 0.00 0.00 LogicalTapeWrite 0.84 516.51 5.33 2960314 0.00 0.00 slot_deform_tuple -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDPwXxIB7bNG8LQkwRAmA3AJ0SJrScvMeXv99OsRZivZ cxojUfhgCgiLzQ adITHNOK9MrxSou0G5gNFko= =PdSb -----END PGP SIGNATURE----- |
| |||
| Martijn van Oosterhout <kleptog@svana.org> writes: > Anyway, to bring some real info I just profiled PostgreSQL 8.1beta > doing an index create on a 2960296 row table (3 columns, table size > 317MB). 3 columns in the index you mean? What were the column datatypes? Any null values? > The number 1 bottleneck with 41% of user time is comparetup_index. > ... > The thing is, I can't see anything in comparetup_index() that could > take much time. The index_getattr and heap_getattr macros can be annoyingly expensive. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Sat, Oct 01, 2005 at 11:26:07PM -0400, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > Anyway, to bring some real info I just profiled PostgreSQL 8.1beta > > doing an index create on a 2960296 row table (3 columns, table size > > 317MB). > > 3 columns in the index you mean? What were the column datatypes? > Any null values? Nope, three columns in the table, one column in the index, no nulls. The indexed column was integer. I did it once with around 6500 values repeated over and over, lots of duplicate kays. And once on a serial column but it made no descernable difference either way. Although the comparison function was called less (only 76 million times), presumably because it was mostly sorted already. > > The number 1 bottleneck with 41% of user time is comparetup_index. > > ... > > The thing is, I can't see anything in comparetup_index() that could > > take much time. > > The index_getattr and heap_getattr macros can be annoyingly expensive. And yet they are optimised for the common case. nocache_index_getattr was only called 7 times, which is about what you expect. I'm getting annotated output now, to determine which line takes the time... Actually, my previous profile overstated stuff a bit. Profiling turned off optimisation so I put it back and you get better results but the order doesn't change much. By line results are below. The top two are the index_getattr calls in comparetup_index. Third and fourth are the HEAPCOMPARES in tuplesort_heap_siftup. Then comes the inlineApplySortFunction call (which isn't being inlined, despite suggesting it should be, -Winline warns about this). Looks to me that there are no real gains to be made in this function. What is needed is an algorithmic change to call this function less often... Have a nice weekend, % cumulative self self total time seconds seconds calls ms/call ms/call name 9.40 22.56 22.56 comparetup_index (tuplesort.c:2042 @ 8251060) 5.07 34.73 12.17 comparetup_index (tuplesort.c:2043 @ 82510c0) 4.73 46.09 11.36 tuplesort_heap_siftup(tuplesort.c:1648 @ 825074d) 3.48 54.45 8.36 tuplesort_heap_siftup(tuplesort.c:1661 @ 82507a9) 2.80 61.18 6.73 comparetup_index (tuplesort.c:2102 @ 8251201) 2.68 67.62 6.44 comparetup_index (tuplesort.c:2048 @ 8251120) 2.16 72.82 5.20 tuplesort_heap_siftup(tuplesort.c:1652 @ 825076d) 1.88 77.34 4.52 76025782 0.00 0.00 comparetup_index (tuplesort.c:2016 @ 8251010) 1.82 81.70 4.36 76025782 0.00 0.00 inlineApplySortFunction (tuplesort.c:1833 @ 8251800) 1.73 85.85 4.15 readtup_heap (tuplesort.c:2000 @ 8250fd8) 1.67 89.86 4.01 AllocSetAlloc (aset.c:568 @ 824bec0) 1.61 93.72 3.86 comparetup_index (tuplesort.c:2025 @ 825102f) 1.47 97.25 3.53 76025785 0.00 0.00 btint4cmp (nbtcompare..c:74 @ 80924a0) 1.11 99.92 2.67 readtup_datum (tuplesort.c:2224 @ 82517c4) 1.10 102.55 2.64 comparetup_index (tuplesort.c:2103 @ 82511e7) % cumulative self self total time seconds seconds calls s/call s/call name 28.34 68.01 68.01 76025782 0.00 0.00 comparetup_index 13.56 100.54 32.53 7148934 0.00 0.00 tuplesort_heap_siftup 8.66 121.33 20.79 76025782 0.00 0.00 inlineApplySortFunction 4.43 131.96 10.63 13084567 0.00 0.00 AllocSetAlloc 3.73 140.90 8.94 76025785 0.00 0.00 btint4cmp 2.15 146.07 5.17 6095625 0.00 0.00 LWLockAcquire 2.02 150.92 4.85 2960305 0.00 0.00 heapgettup 1.98 155.66 4.74 7148934 0.00 0.00 tuplesort_heap_insert 1.78 159.94 4.28 2960312 0.00 0.00 slot_deform_tuple 1.73 164.09 4.15 readtup_heap 1.67 168.09 4.00 6095642 0.00 0.00 LWLockRelease 1.53 171.76 3.68 2960308 0.00 0.00 index_form_tuple 1.44 175.21 3.45 13083442 0.00 0.00 AllocSetFree 1.28 178.28 3.07 8377285 0.00 0.00 LogicalTapeWrite 1.25 181.29 3.01 8377285 0.00 0.00 LogicalTapeRead 1.11 183.96 2.67 readtup_datum 1.06 186.51 2.55 1 2.55 123.54 IndexBuildHeapScan -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDP9NnIB7bNG8LQkwRAh0TAJ0UnvSw9wBY+EdineD6QS WRUGh2EwCfVwq1 GVyJz5jgohzQzP6inojA9G4= =s43K -----END PGP SIGNATURE----- |
| |||
| Ok, I tried two optimisations: 1. By creating a special version of comparetup_index for single key integer indexes. Create an index_get_attr with byval and len args. By using fetch_att and specifying the values at compile time, gcc optimises the whole call to about 12 instructions of assembly rather than the usual mess. 2. By specifying: -Winline -finline-limit-1500 (only on tuplesort.c). This causes inlineApplySortFunction() to be inlined, like the code obviously expects it to be. default build (baseline) 235 seconds -finline only 217 seconds (7% better) comparetup_index_fastbyval4 only 221 seconds (6% better) comparetup_index_fastbyval4 and -finline 203 seconds (13.5% better) This is indexing the integer sequence column on a 2.7 million row table. The times are as given by gprof and so exclude system call time. Basically, I recommend adding "-Winline -finline-limit-1500" to the default build while we discuss other options. comparetup_index_fastbyval4 patch attached per example. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDQDc6IB7bNG8LQkwRAskpAJ4oX70LIkeobYjuwjbz/pqwCHfxUACgijKl vxCH8Kf9KpoKYYQecGqdtpg= =M3LE -----END PGP SIGNATURE----- |
| |||
| On Sun, 2005-10-02 at 21:38 +0200, Martijn van Oosterhout wrote: > Ok, I tried two optimisations: > > 2. By specifying: -Winline -finline-limit-1500 (only on tuplesort.c). > This causes inlineApplySortFunction() to be inlined, like the code > obviously expects it to be. > > default build (baseline) 235 seconds > -finline only 217 seconds (7% better) > comparetup_index_fastbyval4 only 221 seconds (6% better) > comparetup_index_fastbyval4 and -finline 203 seconds (13.5% better) > > This is indexing the integer sequence column on a 2.7 million row > table. The times are as given by gprof and so exclude system call time. > > Basically, I recommend adding "-Winline -finline-limit-1500" to the > default build while we discuss other options. I add -Winline but get no warnings. Why would I use -finline-limit-1500? I'm interested, but uncertain as to what difference this makes. Surely using -O3 works fine? Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Mon, Oct 03, 2005 at 10:51:32PM +0100, Simon Riggs wrote: > > Basically, I recommend adding "-Winline -finline-limit-1500" to the > > default build while we discuss other options. > > I add -Winline but get no warnings. Why would I use -finline-limit-1500? > > I'm interested, but uncertain as to what difference this makes. Surely > using -O3 works fine? Different versions of gcc have different ideas of when a function can be inlined. From my reading of the documentation, this decision is independant of optimisation level. Maybe your gcc version has a limit higher than 1500 by default. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDQlO4IB7bNG8LQkwRAi4sAJ9vNeAbCjfuFfVxd49c6p wLDy4eSACeL023 HDxAYHDb4vP8PFsJtwGrgXQ= =ybxt -----END PGP SIGNATURE----- |
| |||
| On Tue, 2005-10-04 at 12:04 +0200, Martijn van Oosterhout wrote: > On Mon, Oct 03, 2005 at 10:51:32PM +0100, Simon Riggs wrote: > > > Basically, I recommend adding "-Winline -finline-limit-1500" to the > > > default build while we discuss other options. > > > > I add -Winline but get no warnings. Why would I use -finline-limit-1500? > > > > I'm interested, but uncertain as to what difference this makes. Surely > > using -O3 works fine? How did you determine the 1500 figure? Can you give some more info to surround that recommendation to allow everybody to evaluate it? Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Tue, Oct 04, 2005 at 12:24:54PM +0100, Simon Riggs wrote: > How did you determine the 1500 figure? Can you give some more info to > surround that recommendation to allow everybody to evaluate it? kleptog@vali:~/dl/cvs/pgsql-local/src/backend/utils/sort$ gcc -finline-limit-1000 -Winline -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -fno-strict-aliasing -g -I../../../../src/include -D_GNU_SOURCE -c -o tuplesort.o tuplesort.c tuplesort.c: In function 'applySortFunction': tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction' tuplesort.c:1906: warning: called from here tuplesort.c: In function 'comparetup_heap': tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction' tuplesort.c:1937: warning: called from here tuplesort.c: In function 'comparetup_index': tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction' tuplesort.c:2048: warning: called from here tuplesort.c: In function 'comparetup_datum': tuplesort.c:1833: warning: inlining failed in call to 'inlineApplySortFunction' tuplesort.c:2167: warning: called from here kleptog@vali:~/dl/cvs/pgsql-local/src/backend/utils/sort$ gcc -finline-limit-1500 -Winline -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -fno-strict-aliasing -g -I../../../../src/include -D_GNU_SOURCE -c -o tuplesort.o tuplesort.c <no warnings> A quick binary search puts the cutoff between 1200 and 1300. Given version variation I picked a nice round number, 1500. Ugh, that's for -O2, for -O3 and above it needs to be 4100 to work. Maybe we should go for 5000 or so. I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13) Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDQnSKIB7bNG8LQkwRAgpdAJ4u+KmNpMaHY/WknCkvep4N5YGL3QCfS2qR OU90plt1mOZZ9UjVH/9oTeI= =Axn9 -----END PGP SIGNATURE----- |
| ||||
| Martijn van Oosterhout <kleptog@svana.org> writes: > A quick binary search puts the cutoff between 1200 and 1300. Given > version variation I picked a nice round number, 1500. > Ugh, that's for -O2, for -O3 and above it needs to be 4100 to work. > Maybe we should go for 5000 or so. > I'm using: gcc (GCC) 3.3.5 (Debian 1:3.3.5-13) I don't know what the units of this number are, but it's apparently far too gcc-version-dependent to consider putting into our build scripts. Using gcc version 4.0.1 20050727 (current Fedora Core 4 compiler) on i386, and compiling tuplesort.c as you did, I find: -O2: warning goes away between 800 and 900 -O3: warning is always there (tried values up to 10000000) (the latter behavior may indicate a bug, not sure). What's even more interesting is that the warning does not appear in either case if I omit -finline-limit --- so the default value is plenty. At least on this particular compiler, the proposed switch would be counterproductive. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |