This is a discussion on Re: [GENERAL] Slow PITR restore within the pgsql Hackers forums, part of the PostgreSQL category; --> On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > > "Joshua D. ...
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: > > > "Joshua D. Drake" <jd@commandprompt.com> writes: > > > >> Exactly. Which is the point I am making. Five minutes of transactions > >> is nothing (speaking generally).. In short, if we are in recovery, and > >> we are not saturated the I/O and at least a single CPU, there is a huge > >> amount of optimization *somewhere* to be done. > > > > You sure about that? I tested CVS HEAD just now, by setting the > > checkpoint_ parameters really high, running pgbench for awhile, and > > then killing the bgwriter to force a recovery cycle over all the WAL > > generated by the pgbench run. What I saw was that the machine was 100% > > disk write bound. Increasing shared_buffers helped, not in that the > > write rate got less according to vmstat, but the completion time did. > > There are at least three definitions of "saturating the I/O" and it sounds > like you two are using two different ones. > > 1) The processor is waiting on I/O all the time > 2) The hard drives are all always handling a request > 3) The hard drives are all handling the full bandwidth they're capable > > You would expect (1) and (2) to be the same for a single drive -- though in > practice there seems to be a gap even between them. But for a raid array there > can be a large difference, and the wider the raid stripe the larger the > difference. > > In Tom's results: > > > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ > > r b swpd free buff cache si so bi bo in cs us sy id wa st > > 0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0 > > 0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0 > > 0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0 > > > > I don't see the machine sitting around doing nothing ... > > Note that even though the processor is 99% in wait state the drive is only > handling about 3 MB/s. That translates into a seek time of 2.2ms which is > actually pretty fast. So if this is a single drive (1) and (2) seem to be > pretty much the same here. > > But note that if this were a raid array Postgres's wouldn't be getting any > better results. A Raid array wouldn't improve i/o latency at all and since > it's already 99% waiting for i/o Postgres is not going to be able to issue any > more. But only one drive in the raid array will be busy at a time which would > be far less than the maximum random access i/o the raid array is capable of. Agree with Greg's analysis here. Moving to -hackers now. I've done performance profiling also. My results replicated Tom's, but I hadn't performed them on a big enough system and so didn't realise the I/O scalability issue could be such a large problem. Koichi showed me some results on a much larger server that illustrated the I/O problem. But lets remember its only a problem on large servers with a heavy write workload and a large random I/O requirement. That's an important set of people, but much less than Josh's 10% of people even. > Heikki proposed a while back to use posix_fadvise() when processing logs to > read-ahead blocks which the recover will need before actually attempting to > recover them. On a raid array that would bring the 3MB/s above up to the > maximum number of random accesses the raid array can handle (ie, definition > (2) above). It's a good idea, but it will require more complex code. I prefer the simpler solution of using more processes to solve the I/O problem. Florian's code for Hot Standby introduces a separate recovery process, similar to an autovacuum launcher. I propose a mechanism similar to the AV solution where we have lots of recovery workers, with one recovery master reading the WAL files. We can then distribute WAL records to workers in some manner. It's true that many WAL records depend upon each other, but its also true that the performance problems only occur in the situation when they the WAL records don't depend upon each other. If they did, they would touch the same blocks and it would be cached. So as long as we have a safe mechanism for splitting up the work, everything is fine. We can divide up the WAL records this by looking at the rmgr field, plus looking deeper into the records themselves so we can touch different relations/blocks. So I'm planning to review this *after* Florian has introduced his patch, so we can build upon it. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Thu, 2007-12-13 at 09:45 +0000, Gregory Stark wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote: > >> Heikki proposed a while back to use posix_fadvise() when processing logs to > >> read-ahead blocks which the recover will need before actually attempting to > >> recover them. On a raid array that would bring the 3MB/s above up to the > >> maximum number of random accesses the raid array can handle (ie, definition > >> (2) above). > > > > It's a good idea, but it will require more complex code. I prefer the > > simpler solution of using more processes to solve the I/O problem. > > Huh, I forgot about that idea. Ironically that was what I suggested when > Heikki described the problem. > > I think it's more complex than using posix_fadvise. Some handwaving... ISTM its just autovacuum launcher + Hot Standby mixed. I guess I've added two new backends now (Tom groans...) so that part seems very straightforward. The harder part is distributing the work and the hardest part is doing that evenly enough to make a difference. It will also require rmgr changes for state handling, but then I think that needs work anyway. Maybe we don't even need a master. We would have readbuffers in shared memory, like wal_buffers in reverse. Each worker would read the next WAL record and check there is no conflict with other concurrent WAL records. If not, it will apply the record immediately, otherwise wait for the conflicting worker to complete. > But it's also more ambitious. Time is the issue, I think, so you may be right. That's always why I grunt so much about freeze dates. Anyway, I'll leave this now, since I think we need to do Florian's work first either way and that is much more eagerly awaited I think. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| "Simon Riggs" <simon@2ndquadrant.com> writes: > We would have readbuffers in shared memory, like wal_buffers in reverse. > Each worker would read the next WAL record and check there is no > conflict with other concurrent WAL records. If not, it will apply the > record immediately, otherwise wait for the conflicting worker to > complete. Well I guess you would have to bring up the locking infrastructure and lock any blocks in the record you're applying (sorted first to avoid deadlocks). As I understand it we don't use locks during recovery now but I'm not sure if that's just because we don't have to or if there are practical problems which would have to be solved to do so. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Gregory Stark wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > >> We would have readbuffers in shared memory, like wal_buffers in reverse. >> Each worker would read the next WAL record and check there is no >> conflict with other concurrent WAL records. If not, it will apply the >> record immediately, otherwise wait for the conflicting worker to >> complete. > > Well I guess you would have to bring up the locking infrastructure and lock > any blocks in the record you're applying (sorted first to avoid deadlocks). As > I understand it we don't use locks during recovery now but I'm not sure if > that's just because we don't have to or if there are practical problems which > would have to be solved to do so. We do use locks during recovery, XLogReadBuffer takes an exclusive lock on the buffer. According to the comments there, it wouldn't be strictly necessary. But I believe we do actually need it to protect from bgwriter writing out a buffer while it's been modified. We only lock one page at a time, which is good enough for WAL replay, but not to protect things like b-tree split from concurrent access. I hacked together a quick & dirty prototype of using posix_fadvise in recovery a while ago. First of all, there's the changes to the buffer manager, which we'd need anyway if we wanted to use posix_fadvise for speeding up other stuff like index scans. Then there's changes to xlog.c, to buffer a number of WAL records, so that you can read ahead the data pages needed by WAL records ahead of the WAL record you're actually replaying. I added a new function, readahead, to the rmgr API. It's similar to the redo function, but it doesn't actually replay the WAL record, but just issues the fadvise calls to the buffer manager for the pages that are needed to replay the WAL record. This needs to be implemented for each resource manager that we want to do readahead for. If we had the list of blocks in the WAL record in a rmgr-independent format, we could do that in a more generic way, like we do the backup block restoration. The multiple-process approach seems a lot more complex to me. You need a lot of bookkeeping to keep the processes from stepping on each others toes, and to choose the next WAL record to replay. I think you have the same problem that you need to have a rmgr-specific function to extract the data blocks #s required to replay that WAL record, or add that list to the WAL record header in a generic format. Multi-process approach is nice because it allows you to parallelize the CPU work of replaying the records as well, but I wonder how much that really scales given all the locking required. Also, I don't think replaying WAL records is very expensive CPU-wise. You'd need a pretty impressive RAID array to read WAL from, to saturate a single CPU. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Simon, On Dec 13, 2007 11:21 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Anyway, I'll leave this now, since I think we need to do Florian's work > first either way and that is much more eagerly awaited I think. Speaking of that, is there any news about it and about Florian? It was a really promising work. Thanks. -- Guillaume ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| Simon Riggs wrote: > ISTM its just autovacuum launcher + Hot Standby mixed. I don't think you need a launcher at all. Just get the postmaster to start a configurable number of wal-replay processes (currently the number is hardcoded to 1). -- Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4 "Someone said that it is at least an order of magnitude more work to do production software than a prototype. I think he is wrong by at least an order of magnitude." (Brian Kernighan) ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Thu, 2007-12-13 at 12:28 +0000, Heikki Linnakangas wrote: > Gregory Stark wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > >> We would have readbuffers in shared memory, like wal_buffers in reverse. > >> Each worker would read the next WAL record and check there is no > >> conflict with other concurrent WAL records. If not, it will apply the > >> record immediately, otherwise wait for the conflicting worker to > >> complete. > > > > Well I guess you would have to bring up the locking infrastructure and lock > > any blocks in the record you're applying (sorted first to avoid deadlocks). As > > I understand it we don't use locks during recovery now but I'm not sure if > > that's just because we don't have to or if there are practical problems which > > would have to be solved to do so. > > We do use locks during recovery, XLogReadBuffer takes an exclusive lock > on the buffer. According to the comments there, it wouldn't be strictly > necessary. But I believe we do actually need it to protect from > bgwriter writing out a buffer while it's been modified. We only lock one > page at a time, which is good enough for WAL replay, but not to protect > things like b-tree split from concurrent access. > > I hacked together a quick & dirty prototype of using posix_fadvise in > recovery a while ago. First of all, there's the changes to the buffer > manager, which we'd need anyway if we wanted to use posix_fadvise for > speeding up other stuff like index scans. Then there's changes to > xlog.c, to buffer a number of WAL records, so that you can read ahead > the data pages needed by WAL records ahead of the WAL record you're > actually replaying. > > I added a new function, readahead, to the rmgr API. It's similar to the > redo function, but it doesn't actually replay the WAL record, but just > issues the fadvise calls to the buffer manager for the pages that are > needed to replay the WAL record. This needs to be implemented for each > resource manager that we want to do readahead for. If we had the list of > blocks in the WAL record in a rmgr-independent format, we could do that > in a more generic way, like we do the backup block restoration. > > The multiple-process approach seems a lot more complex to me. You need a > lot of bookkeeping to keep the processes from stepping on each others > toes, and to choose the next WAL record to replay. I think you have the > same problem that you need to have a rmgr-specific function to extract > the data blocks #s required to replay that WAL record, or add that list > to the WAL record header in a generic format. Multi-process approach is > nice because it allows you to parallelize the CPU work of replaying the > records as well, but I wonder how much that really scales given all the > locking required. Also, I don't think replaying WAL records is very > expensive CPU-wise. You'd need a pretty impressive RAID array to read > WAL from, to saturate a single CPU. With all this talk, I thought of a much better way. We don't actually need to apply the changes in the order they are received, we just need to apply sufficient ordering to ensure that each block's changes are applied in LSN order. Allocate a recovery cache of size maintenance_work_mem that goes away when recovery ends. For every block mentioned in WAL record that isn't an overwrite, first check shared_buffers. If its in shared_buffers apply immediately and move on. If not in shared_buffers then put in recovery cache. When cache fills, empty it. Qsort WAL records by by rmgrid, rel, blockid, lsn. Then we scan through the records applying them in sequence. That way we will accumulate changes on each block so we only need to request it once rather than thrashing the cache. We may get lucky and pick up some OS readahead also. We would also use buffer recycling when emptying the recovery cache, to ensure that we don't trash the main cache and also gain from L2 cache efficiency. When recovery ends, empty the cache. I think that is better than both methods mentioned, and definitely simpler than my brute-force method. It also lends itself to using both previously mentioned methods as additional techniques if we really needed to. I suspect reordering the I/Os in this way is going to make a huge difference to cache hit rates. Looks like each rmgr_redo call would need to be split into two calls: rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will apply if possible, otherwise place in cache. The second gets called repeatedly during cache emptying. That sounds like it might not be I/O efficient, in that it would suffer from producer/consumer flip/flopping. But with large main work mem you'll get all the I/O from possibly hundreds of WAL files all accumulated together before it is issued - assuming only a small % of WAL records go into the cache and then many of those will have their I/O reduced to zero because of the sequential cache access. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Simon Riggs wrote: > Allocate a recovery cache of size maintenance_work_mem that goes away > when recovery ends. > > For every block mentioned in WAL record that isn't an overwrite, first > check shared_buffers. If its in shared_buffers apply immediately and > move on. If not in shared_buffers then put in recovery cache. > > When cache fills, empty it. Qsort WAL records by by rmgrid, rel, > blockid, lsn. Then we scan through the records applying them in > sequence. That way we will accumulate changes on each block so we only > need to request it once rather than thrashing the cache. We may get > lucky and pick up some OS readahead also. We would also use buffer > recycling when emptying the recovery cache, to ensure that we don't > trash the main cache and also gain from L2 cache efficiency. > > When recovery ends, empty the cache. Hmm. That assumes that nothing else than the WAL replay will read pages into shared buffers. I guess that's true at the moment, but it doesn't seem impossible that something like Florian's read-only queries on a stand by server would change that. > I think that is better than both methods mentioned, and definitely > simpler than my brute-force method. It also lends itself to using both > previously mentioned methods as additional techniques if we really > needed to. I suspect reordering the I/Os in this way is going to make a > huge difference to cache hit rates. But it won't actually do anything to scale the I/O. You're still going to be issuing only one read request at a time. The order of those requests will be better from cache hit point of view, which is good, but the problem remains that if the modified data blocks are scattered around the database, you'll be doing random I/O, one request at a time. It would be interesting to do something like that to speed up replay of long PITR archives, though. You could scan all (or at least far ahead) the WAL records, and make note of where there is full page writes for each page. Whenever there's a full page write further ahead in the log, you could ignore all changes to that page before that, because they're going to be overwritten anyway. It won't help with normal recovery, because there won't be more than one full page image of each page after the last checkpoint, but with PITR it would help. > Looks like each rmgr_redo call would need to be split into two calls: > rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will > apply if possible, otherwise place in cache. The second gets called > repeatedly during cache emptying. Yeah, much like the split I had to do for the posix_fadvise. It seems that in all the proposed schemes we need to know which blocks a given WAL record will need to access. For multiple recovery processes, you need that to figure out which WAL records you can safely replay. In the posix_fadvise scheme, you need that to issue the posix_fadvises without modifying anything. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Thu, 2007-12-13 at 16:41 -0500, Tom Lane wrote: > Recovery is inherently one of the least-exercised parts of the system, > and it gets more so with each robustness improvement we make elsewhere. > Moreover, because it is fairly dumb, anything that does go wrong will > likely result in silent data corruption that may not be noted until much > later. Any bugs we introduce into recovery will be very hard to find > ... and timing-dependent ones will be damn near impossible. > > So in my mind the watchword has got to be KISS. If that means that > recovery isn't terribly speedy, so be it. I'd far rather get the > right answer slower. Very much agreed, and really the real reason the main recovery code is essentially untouched for so long. That thought was #1 priority when writing PITR. Thanks for reminding me/us. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| ||||
| Tom Lane wrote: > Also, I have not seen anyone provide a very credible argument why > we should spend a lot of effort on optimizing a part of the system > that is so little-exercised. Don't tell me about warm standby > systems --- they are fine as long as recovery is at least as fast > as the original transactions, and no evidence has been provided to > suggest that it's not. Koichi showed me & Simon graphs of DBT-2 runs in their test lab back in May. They had setup two identical systems, one running the benchmark, and another one as a warm stand-by. The stand-by couldn't keep up; it couldn't replay the WAL as quickly as the primary server produced it. IIRC, replaying WAL generated in a 1h benchmark run took 6 hours. It sounds unbelievable at first, but the problem is that our WAL replay doesn't scale. On the primary server, you can have (and they did) a huge RAID array with dozens of disks, and a lot of concurrent activity keeping it busy. On the standby, we do all the same work, but with a single process. Every time we need to read in a page to modify it, we block. No matter how many disks you have in the array, it won't help, because we only issue one I/O request at a time. That said, I think the change we made in Spring to not read in pages for full page writes will help a lot with that. It would be nice to see some new benchmark results to measure that. However, it didn't fix the underlying scalability problem. One KISS approach would be to just do full page writes more often. It would obviously bloat the WAL, but it would make the replay faster. Another reason you would care about fast recovery is PITR. If you do base backups only once a week, for example, when you need to recover using the archive, you might have to replay a weeks worth of WAL in the worst case. You don't want to wait a week for the replay to finish. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| Thread Tools | |
| Display Modes | |
|
|