vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Hi, Below is a patch that tries to determine (for linux) if the pread is emulated or not (using an autoconf test), and if so use it instead of lseek. It doesn't remove all uses of lseek. For example, PostgreSQL uses lseek to find the end of the file when extending. So in a bulk load, there is one seek per block written anyway. Also, the xlog seems to be completely seperate code. Short summary: Actual benefit, (on Linux anyway) not distinguishable from the noise. Longer summary: It reduces CPU usage slightly (few percent user, overall almost nothing), but the benefit is usually swamped by the actual cost of the read from disk (system cache miss). I would actually make the change because it simplifies the code a bit. Comparisons on other systems may be worthwhile. What follows is the detailed analysis of the change. My conclusion is that on Linux at least where the syscall overhead is so low, it's not worth the change for performance. It is cleaner code-wise IMHO. ---- Initially, the table is 1.5 million rows of random integer data. Worst case scenario: ordered output run with seqscan disabled. Table is 66MB, Index is 56MB and includes all three data columns. The queries took 98 +/- 1 seconds. Over the entire query it consistantly loaded about 55MB of data from disk (about 50% system cache miss rate). Query: select count(*) from (select * from bigtable order by a,b,c) Exhibit A below is the vmstat output with 10 second averages. The interesting columns are cpu us (user) and cpu sy (system). Note, system is defined as stuff not user and not idle. Waiting for the I/O to complete is counted as system, even though it's not doing all that much. It shows a very, very slight reduction in user time, though it is measurable and repeatable (both run a number of times with same results). What this tells me is that lseek time is swamped by actual read time, so not a major benefit in most cases. Exhibit B is the same but with only 1 million rows (Table 48MB, Index 29MB), this fitting in cache memory. The original index had some slack from deleting records (I did a vacuum full though). The queries now take 60 +/- 1 seconds. As you can see even without accessing the disk, it still isn't making a big difference. Exhibit C is the output of strace -c on the second test. As you can see there is a difference, but marginal. Exhibit D is the output of a seq scan, in case anyone figured I might be buggering the queries somehow. As you can see, it shows that the index scan is extremely expensive. Only 5900 reads for the seq scan (enough to read the whole table) instead of 890000. The shared buffers is the default 1000. Upping this would close the gap somewhat but that is irrelevent for this test (lseek vs pread). Exhibit E is the strace -c output, for completeness. lseek doesn't appear in the top 10. If you read this far, thanks for the attention! If you're running another system, it would be nice to test. Remember to update the autoconf code for your system! Have a nice day, Misc details: command line: time echo "select count(*) from(select * from bigtable order by a,b,c) x"| \ ./postgres -f s -D /var/lib/postgres/data test System: Linux kleptog 2.6.8-1-686 #1 Thu Nov 25 04:34:30 UTC 2004 i686 GNU/Linux 128MB RAM Intel(R) Celeron(TM) CPU 1000MHz Exhibit (A): Index scan, dataset > memory Using ordinary llseek/read: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy idwa 1 0 4 1372 940 107564 0 0 534 18 1070 145 18 56 027 1 0 4 1648 992 107260 0 0 548 3 1071 147 18 63 019 1 0 4 1552 984 107348 0 0 542 4 1069 144 20 69 011 1 0 4 1432 980 107496 0 0 550 6 1071 146 19 69 012 1 0 4 1648 664 107600 0 0 560 3 1071 148 19 67 015 1 0 4 1576 632 107728 0 0 597 3 1076 157 19 68 013 2 0 4 1768 640 107528 0 0 618 14 1080 163 19 67 014 Using pread(): procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy idwa 1 0 4 1680 600 107616 0 0 534 18 1070 145 17 56 028 1 0 4 1656 644 107644 0 0 543 5 1070 147 16 64 019 1 0 4 1344 624 107976 0 0 540 2 1068 142 19 70 011 1 0 4 1656 632 107652 0 0 555 6 1071 147 17 70 013 1 0 4 1608 540 107796 0 0 564 13 1074 149 18 68 015 1 0 4 1392 504 108068 0 0 598 4 1076 158 17 69 014 0 1 4 1368 460 108120 0 0 626 3 1079 165 17 69 014 Exhibit (B): Index scan, dataset < memory Using ordinary llseek/read: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy idwa 1 0 4 7500 3084 98000 0 0 0 25 1002 8 22 78 00 1 0 4 7484 3100 98000 0 0 0 3 1001 5 22 78 00 1 0 4 7468 3116 98000 0 0 0 3 1001 5 22 78 00 1 0 4 7444 3140 98000 0 0 0 10 1003 6 22 78 00 Using pread(): procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy idwa 1 0 4 7116 3452 98000 0 0 0 18 1002 8 21 79 00 1 0 4 7100 3468 98000 0 0 0 8 1001 5 21 79 00 1 0 4 7080 3488 98000 0 0 0 4 1001 5 21 79 00 1 0 4 7064 3504 98000 0 0 0 10 1003 5 22 78 00 Exhibit (C): strace -c output of backend Using ordinary llseek/read: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 85.27 51.906570 58 891526 read 14.71 8.956573 10 887770 _llseek 0.00 0.001983 17 114 29 open Using pread(): % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.98 52.977031 59 891479 pread 0.00 0.002164 114 19 write Exhibit (D): sequential scan: dataset < memory Using ordinary llseek/read: $ time echo "select count(*) from(select * from bigtable) x" | ./postgres.read -D /var/lib/postgres/data test PostgreSQL stand-alone backend 8.1beta2 backend> 1: count (typeid = 20, len = 8, typmod = -1, byval = f) ---- 1: count = "1080838" (typeid = 20, len = 8, typmod = -1,byval = f) ---- backend> real 0m1.264s user 0m0.921s sys 0m0.328s Using pread(): $ time echo "select count(*) from(select * from bigtable) x" | ./postgres -D /var/lib/postgres/data test PostgreSQL stand-alone backend 8.1beta2 backend> 1: count (typeid = 20, len = 8, typmod = -1, byval = f) ---- 1: count = "1080838" (typeid = 20, len = 8, typmod = -1,byval = f) ---- backend> real 0m1.213s user 0m0.827s sys 0m0.371s Exhibit (E): strace -c of sequential scan case Using ordinary llseek/read: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 96.78 0.318179 53 5950 read 0.68 0.002249 118 19 write 0.57 0.001879 17 109 29 open <snip> 0.06 0.000197 5 40 _llseek Using pread(): % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 95.92 0.316512 54 5903 pread 0.70 0.002326 122 19 write 0.62 0.002032 406 5 fsync -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDR+pjIB7bNG8LQkwRAg2iAJ9iLrM4KoE9bOzEWoSV//TmXb5v0ACfdYzV FRv2DP5nr92oaS0b15oLhLA= =dx0w -----END PGP SIGNATURE----- |
| |||
| "Martijn van Oosterhout" <kleptog@svana.org> wrote > > Below is a patch that tries to determine (for linux) if the pread is > emulated or not (using an autoconf test), and if so use it instead of > lseek. Just FYI, this was proposed 2 years ago: http://archives.postgresql.org/pgsql...2/msg00197.php The basic problem I read from the above thread is that it does not give enough performance proofs on *various* platforms. Regards, Qingqing |
| |||
| Martijn van Oosterhout <kleptog@svana.org> writes: > What follows is the detailed analysis of the change. My conclusion is > that on Linux at least where the syscall overhead is so low, it's not > worth the change for performance. It is cleaner code-wise IMHO. How is it cleaner code-wise to debug and maintain two #ifdef'd code paths instead of one? (And when I say "debug", I mean "I don't believe FileSeek still works". One reason that the patch seems unmaintainable to me as-is is that it creates a subtle, critical, and undocumented difference in the semantic meaning of the seekPos variable: in one case it's tracking the kernel's own seek position, and in the other it isn't.) > What this tells me is that lseek time is swamped by actual read time, > so not a major benefit in most cases. It would be reasonable to check results in fully-cached cases, which would be the best real-world scenario for this to show any improvement in. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Sat, Oct 08, 2005 at 05:27:11PM -0400, Tom Lane wrote: > How is it cleaner code-wise to debug and maintain two #ifdef'd code > paths instead of one? (And when I say "debug", I mean "I don't believe > FileSeek still works". One reason that the patch seems unmaintainable > to me as-is is that it creates a subtle, critical, and undocumented > difference in the semantic meaning of the seekPos variable: in one case > it's tracking the kernel's own seek position, and in the other it isn't.) I mean in the situation where we would only use pread. As for the difference, no code outside of fd.c can assume anying about the kernel seek position because it can't see it (the file might be closed!). Unless something bypasses the abstraction? > It would be reasonable to check results in fully-cached cases, which > would be the best real-world scenario for this to show any improvement > in. If you look, I did that and even then it simply didn't make a difference. lseek is 10 microseconds, you'd need to do hundreds of thousands of them to make a difference. And any gain would disappear in just the rotational latency of a hard disk read. I satisfied my curiosity and showed it just isn't worth it at this point. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDSEYMIB7bNG8LQkwRAv23AJ9o4IWEQsEhDHywBVi0Zs xn16lLSgCfa3Ll ZUsaRmxsGMFPcKRZpsaqU/o= =i/lT -----END PGP SIGNATURE----- |
| ||||
| On Sun, Oct 09, 2005 at 12:19:57AM +0200, Martijn van Oosterhout wrote: > > It would be reasonable to check results in fully-cached cases, which > > would be the best real-world scenario for this to show any improvement > > in. > > If you look, I did that and even then it simply didn't make a > difference. lseek is 10 microseconds, you'd need to do hundreds of > thousands of them to make a difference. And any gain would disappear in > just the rotational latency of a hard disk read. Just to clarify, I reexamined this point and there is a small difference. The fully cached case went from 72 seconds to 70 over 890,000 (p)read syscalls, indicating that in that case lseek/read is slower than pread by about 2 microseconds per call. The overall benefit is about 2.5%. This doesn't change the fact that any disk access at all renders the benefit almost nil. On Linux anyway, other OSes, who knows? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.0.6 (GNU/Linux) Comment: For info see http://www.gnupg.org iD8DBQFDSoAYIB7bNG8LQkwRAkFvAJ9/+LP9IXQDxhSmW2Ie1M8G0qrrQQCffRsI ic5gmEHy/z91V6G+RmBLZ50= =eell -----END PGP SIGNATURE----- |
| Thread Tools | |
| Display Modes | |
|
|