vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I don't feel this is quite ready to commit, but here it is if anyone would like to try some performance testing. Using "pgbench -s 10" on a single-CPU machine, I find this code a little slower than CVS tip at shared_buffers = 1000, but noticeably faster (~10% speedup) at 10000 buffers. So it's not a dead loss for single-CPU anyway. What we need now is some performance measurements on multi-CPU boxes. The bgwriter algorithm probably needs more work, maybe some more GUC parameters. regards, tom lane *** src/backend/catalog/index.c.orig Mon Jan 10 15:02:19 2005 --- src/backend/catalog/index.c Tue Feb 15 12:05:15 2005 *************** *** 1060,1066 **** /* Send out shared cache inval if necessary */ if (!IsBootstrapProcessingMode()) CacheInvalidateHeapTuple(pg_class, tuple); - BufferSync(-1, -1); } else if (dirty) { --- 1060,1065 ---- *** src/backend/commands/dbcommands.c.orig Fri Jan 28 10:09:11 2005 --- src/backend/commands/dbcommands.c Tue Feb 15 15:49:12 2005 *************** *** 332,338 **** * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(-1, -1); /* * Close virtual file descriptors so the kernel has more available for --- 332,338 ---- * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(); /* * Close virtual file descriptors so the kernel has more available for *************** *** 1206,1212 **** * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(-1, -1); #ifndef WIN32 --- 1206,1212 ---- * up-to-date for the copy. (We really only need to flush buffers for * the source database, but bufmgr.c provides no API for that.) */ ! BufferSync(); #ifndef WIN32 *** src/backend/commands/vacuum.c.orig Fri Dec 31 17:45:38 2004 --- src/backend/commands/vacuum.c Sun Feb 13 18:35:03 2005 *************** *** 35,41 **** #include "commands/vacuum.h" #include "executor/executor.h" #include "miscadmin.h" - #include "storage/buf_internals.h" #include "storage/freespace.h" #include "storage/sinval.h" #include "storage/smgr.h" --- 35,40 ---- *** src/backend/postmaster/bgwriter.c.orig Mon Jan 10 15:02:20 2005 --- src/backend/postmaster/bgwriter.c Tue Feb 15 15:49:23 2005 *************** *** 116,124 **** * GUC parameters */ int BgWriterDelay = 200; - int BgWriterPercent = 1; - int BgWriterMaxPages = 100; - int CheckPointTimeout = 300; int CheckPointWarning = 30; --- 116,121 ---- *************** *** 370,376 **** n = 1; } else ! n = BufferSync(BgWriterPercent, BgWriterMaxPages); /* * Nap for the configured time or sleep for 10 seconds if there --- 367,373 ---- n = 1; } else ! n = BgBufferSync(); /* * Nap for the configured time or sleep for 10 seconds if there *** src/backend/storage/buffer/README.orig Mon Apr 19 19:27:17 2004 --- src/backend/storage/buffer/README Tue Feb 15 15:54:52 2005 *************** *** 4,12 **** -------------------------------------- There are two separate access control mechanisms for shared disk buffers: ! reference counts (a/k/a pin counts) and buffer locks. (Actually, there's ! a third level of access control: one must hold the appropriate kind of ! lock on a relation before one can legally access any page belonging to the relation. Relation-level locks are not discussed here.) Pins: one must "hold a pin on" a buffer (increment its reference count) --- 4,12 ---- -------------------------------------- There are two separate access control mechanisms for shared disk buffers: ! reference counts (a/k/a pin counts) and buffer content locks. (Actually, ! there's a third level of access control: one must hold the appropriate kind ! of lock on a relation before one can legally access any page belonging to the relation. Relation-level locks are not discussed here.) Pins: one must "hold a pin on" a buffer (increment its reference count) *************** *** 26,32 **** better hold one first.) Pins may not be held across transaction boundaries, however. ! Buffer locks: there are two kinds of buffer locks, shared and exclusive, which act just as you'd expect: multiple backends can hold shared locks on the same buffer, but an exclusive lock prevents anyone else from holding either shared or exclusive lock. (These can alternatively be called READ --- 26,32 ---- better hold one first.) Pins may not be held across transaction boundaries, however. ! Buffer content locks: there are two kinds of buffer lock, shared and exclusive, which act just as you'd expect: multiple backends can hold shared locks on the same buffer, but an exclusive lock prevents anyone else from holding either shared or exclusive lock. (These can alternatively be called READ *************** *** 38,49 **** Buffer access rules: 1. To scan a page for tuples, one must hold a pin and either shared or ! exclusive lock. To examine the commit status (XIDs and status bits) of ! a tuple in a shared buffer, one must likewise hold a pin and either shared or exclusive lock. 2. Once one has determined that a tuple is interesting (visible to the ! current transaction) one may drop the buffer lock, yet continue to access the tuple's data for as long as one holds the buffer pin. This is what is typically done by heap scans, since the tuple returned by heap_fetch contains a pointer to tuple data in the shared buffer. Therefore the --- 38,49 ---- Buffer access rules: 1. To scan a page for tuples, one must hold a pin and either shared or ! exclusive content lock. To examine the commit status (XIDs and status bits) ! of a tuple in a shared buffer, one must likewise hold a pin and either shared or exclusive lock. 2. Once one has determined that a tuple is interesting (visible to the ! current transaction) one may drop the content lock, yet continue to access the tuple's data for as long as one holds the buffer pin. This is what is typically done by heap scans, since the tuple returned by heap_fetch contains a pointer to tuple data in the shared buffer. Therefore the *************** *** 52,60 **** of visibility is made. 3. To add a tuple or change the xmin/xmax fields of an existing tuple, ! one must hold a pin and an exclusive lock on the containing buffer. This ensures that no one else might see a partially-updated state of the ! tuple. 4. It is considered OK to update tuple commit status bits (ie, OR the values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or --- 52,60 ---- of visibility is made. 3. To add a tuple or change the xmin/xmax fields of an existing tuple, ! one must hold a pin and an exclusive content lock on the containing buffer. This ensures that no one else might see a partially-updated state of the ! tuple while they are doing visibility checks. 4. It is considered OK to update tuple commit status bits (ie, OR the values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or *************** *** 76,82 **** might expect to examine again. Note that another backend might pin the buffer (increment the refcount) while one is performing the cleanup, but it won't be able to actually examine the page until it acquires shared ! or exclusive lock. VACUUM FULL ignores rule #5, because it instead acquires exclusive lock at --- 76,82 ---- might expect to examine again. Note that another backend might pin the buffer (increment the refcount) while one is performing the cleanup, but it won't be able to actually examine the page until it acquires shared ! or exclusive content lock. VACUUM FULL ignores rule #5, because it instead acquires exclusive lock at *************** *** 97,245 **** single relation anyway. ! Buffer replacement strategy interface ! ------------------------------------- ! ! The file freelist.c contains the buffer cache replacement strategy. ! The interface to the strategy is: ! ! BufferDesc *StrategyBufferLookup(BufferTag *tagPtr, bool recheck, ! int *cdb_found_index) ! ! This is always the first call made by the buffer manager to check if a disk ! page is in memory. If so, the function returns the buffer descriptor and no ! further action is required. If the page is not in memory, ! StrategyBufferLookup() returns NULL. ! ! The flag recheck tells the strategy that this is a second lookup after ! flushing a dirty block. If the buffer manager has to evict another buffer, ! it will release the bufmgr lock while doing the write IO. During this time, ! another backend could possibly fault in the same page this backend is after, ! so we have to check again after the IO is done if the page is in memory now. ! ! *cdb_found_index is set to the index of the found CDB, or -1 if none. ! This is not intended to be used by the caller, except to pass to ! StrategyReplaceBuffer(). ! ! BufferDesc *StrategyGetBuffer(int *cdb_replace_index) ! ! The buffer manager calls this function to get an unpinned cache buffer whose ! content can be evicted. The returned buffer might be empty, clean or dirty. ! ! The returned buffer is only a candidate for replacement. It is possible that ! while the buffer is being written, another backend finds and modifies it, so ! that it is dirty again. The buffer manager will then have to call ! StrategyGetBuffer() again to ask for another candidate. ! ! *cdb_replace_index is set to the index of the candidate CDB, or -1 if none ! (meaning we are using a previously free buffer). This is not intended to be ! used by the caller, except to pass to StrategyReplaceBuffer(). ! ! void StrategyReplaceBuffer(BufferDesc *buf, BufferTag *newTag, ! int cdb_found_index, int cdb_replace_index) ! ! Called by the buffer manager at the time it is about to change the association ! of a buffer with a disk page. ! Before this call, StrategyBufferLookup() still has to find the buffer under ! its old tag, even if it was returned by StrategyGetBuffer() as a candidate ! for replacement. ! ! After this call, this buffer must be returned for a lookup of the new page ! identified by *newTag. ! ! cdb_found_index and cdb_replace_index must be the auxiliary values ! returned by previous calls to StrategyBufferLookup and StrategyGetBuffer. ! ! void StrategyInvalidateBuffer(BufferDesc *buf) ! ! Called by the buffer manager to inform the strategy that the content of this ! buffer is being thrown away. This happens for example in the case of dropping ! a relation. The buffer must be clean and unpinned on call. ! ! If the buffer was associated with a disk page, StrategyBufferLookup() ! must not return it for this page after the call. ! ! void StrategyHintVacuum(bool vacuum_active) ! ! Because VACUUM reads all relations of the entire database through the buffer ! manager, it can greatly disturb the buffer replacement strategy. This function ! is used by VACUUM to inform the strategy that subsequent buffer lookups are ! (or are not) caused by VACUUM scanning relations. Buffer replacement strategy --------------------------- ! The buffer replacement strategy actually used in freelist.c is a version of ! the Adaptive Replacement Cache (ARC) specially tailored for PostgreSQL. ! ! The algorithm works as follows: - C is the size of the cache in number of pages (a/k/a shared_buffers or - NBuffers). ARC uses 2*C Cache Directory Blocks (CDB). A cache directory block - is always associated with one unique file page. It may point to one shared - buffer, or may indicate that the file page is not in a buffer but has been - accessed recently. - - All CDB entries are managed in 4 LRU lists named T1, T2, B1 and B2. The T1 and - T2 lists are the "real" cache entries, linking a file page to a memory buffer - where the page is currently cached. Consequently T1len+T2len <= C. B1 and B2 - are ghost cache directories that extend T1 and T2 so that the strategy - remembers pages longer. The strategy tries to keep B1len+T1len and B2len+T2len - both at C. T1len and T2len vary over the runtime depending on the lookup - pattern and its resulting cache hits. The desired size of T1len is called - T1target. - - Assuming we have a full cache, one of 5 cases happens on a lookup: - - MISS On a cache miss, depending on T1target and the actual T1len - the LRU buffer of either T1 or T2 is evicted. Its CDB is removed - from the T list and added as MRU of the corresponding B list. - The now free buffer is replaced with the requested page - and added as MRU of T1. - - T1 hit The T1 CDB is moved to the MRU position of the T2 list. - - T2 hit The T2 CDB is moved to the MRU position of the T2 list. - - B1 hit This means that a buffer that was evicted from the T1 - list is now requested again, indicating that T1target is - too small (otherwise it would still be in T1 and thus in - memory). The strategy raises T1target, evicts a buffer - depending on T1target and T1len and places the CDB at - MRU of T2. - - B2 hit This means the opposite of B1, the T2 list is probably too - small. So the strategy lowers T1target, evicts a buffer - and places the CDB at MRU of T2. - - Thus, every page that is found on lookup in any of the four lists - ends up as the MRU of the T2 list. The T2 list therefore is the - "frequency" cache, holding frequently requested pages. - - Every page that is seen for the first time ends up as the MRU of the T1 - list. The T1 list is the "recency" cache, holding recent newcomers. - - The tailoring done for PostgreSQL has to do with the way the query executor - works. A typical UPDATE or DELETE first scans the relation, searching for the - tuples and then calls heap_update() or heap_delete(). This causes at least 2 - lookups for the block in the same statement. In the case of multiple matches - in one block even more often. As a result, every block touched in an UPDATE or - DELETE would directly jump into the T2 cache, which is wrong. To prevent this - the strategy remembers which transaction added a buffer to the T1 list and - will not promote it from there into the T2 cache during the same transaction. - - Another specialty is the change of the strategy during VACUUM. Lookups during - VACUUM do not represent application needs, and do not suggest that the page - will be hit again soon, so it would be wrong to change the cache balance - T1target due to that or to cause massive cache evictions. Therefore, a page - read in to satisfy vacuum is placed at the LRU position of the T1 list, for - immediate reuse. Also, if we happen to get a hit on a CDB entry during - VACUUM, we do not promote the page above its current position in the list. Since VACUUM usually requests many pages very fast, the effect of this is that it will get back the very buffers it filled and possibly modified on the next call and will therefore do its work in a few shared memory buffers, while being able to use whatever it finds in the cache already. This also implies that most of the write traffic caused by a VACUUM will be done by the VACUUM itself and not pushed off onto other processes. --- 97,242 ---- single relation anyway. ! Buffer manager's internal locking ! --------------------------------- ! Before PostgreSQL 8.1, all operations of the shared buffer manager itself ! were protected by a single system-wide lock, the BufMgrLock, which ! unsurprisingly proved to be a source of contention. The new locking scheme ! avoids grabbing system-wide exclusive locks in common code paths. It works ! like this: ! ! * There is a system-wide LWLock, the BufMappingLock, that notionally ! protects the mapping from buffer tags (page identifiers) to buffers. ! (Physically, it can be thought of as protecting the hash table maintained ! by buf_table.c.) To look up whether a buffer exists for a tag, it is ! sufficient to obtain share lock on the BufMappingLock. Note that one ! must pin the found buffer, if any, before releasing the BufMappingLock. ! To alter the page assignment of any buffer, one must hold exclusive lock ! on the BufMappingLock. This lock must be held across adjusting the buffer's ! header fields and changing the buf_table hash table. The only common ! operation that needs exclusive lock is reading in a page that was not ! in shared buffers already, which will require at least a kernel call ! and usually a wait for I/O, so it will be slow anyway. ! ! * A separate system-wide LWLock, the BufFreelistLock, provides mutual ! exclusion for operations that access the buffer free list or select ! buffers for replacement. This is always taken in exclusive mode since ! there are no read-only operations on those data structures. The buffer ! management policy is designed so that BufFreelistLock need not be taken ! except in paths that will require I/O, and thus will be slow anyway. ! (Details appear below.) It is never necessary to hold the BufMappingLock ! and the BufFreelistLock at the same time. ! ! * Each buffer header contains a spinlock that must be taken when examining ! or changing fields of that buffer header. This allows operations such as ! ReleaseBuffer to make local state changes without taking any system-wide ! lock. We use a spinlock, not an LWLock, since there are no cases where ! the lock needs to be held for more than a few instructions. ! ! Note that a buffer header's spinlock does not control access to the data ! held within the buffer. Each buffer header also contains an LWLock, the ! "buffer content lock", that *does* represent the right to access the data ! in the buffer. It is used per the rules above. ! ! There is yet another set of per-buffer LWLocks, the io_in_progress locks, ! that are used to wait for I/O on a buffer to complete. The process doing ! a read or write takes exclusive lock for the duration, and processes that ! need to wait for completion try to take shared locks (which they release ! immediately upon obtaining). XXX on systems where an LWLock represents ! nontrivial resources, it's fairly annoying to need so many locks. Possibly ! we could use per-backend LWLocks instead (a buffer header would then contain ! a field to show which backend is doing its I/O). Buffer replacement strategy --------------------------- ! There is a "free list" of buffers that are prime candidates for replacement. ! In particular, buffers that are completely free (contain no valid page) are ! always in this list. We may also throw buffers into this list if we ! consider their pages unlikely to be needed soon. The list is singly-linked ! using fields in the buffer headers; we maintain head and tail pointers in ! global variables. (Note: although the list links are in the buffer headers, ! they are considered to be protected by the BufFreelistLock, not the ! buffer-header spinlocks.) To choose a victim buffer to recycle when there ! are no free buffers available, we use a simple clock-sweep algorithm, which ! avoids the need to take system-wide locks during common operations. It ! works like this: ! ! Each buffer header contains a "recently used" flag bit, which is set true ! whenever the buffer is unpinned. (Setting this bit requires only the ! buffer header spinlock, which would have to be taken anyway to decrement ! the buffer reference count, so it's nearly free.) ! ! The "clock hand" is a buffer index, NextVictimBuffer, that moves circularly ! through all the available buffers. NextVictimBuffer is protected by the ! BufFreelistLock. ! ! The algorithm for a process that needs to obtain a victim buffer is: ! ! 1. Obtain BufFreelistLock. ! ! 2. If buffer free list is nonempty, remove its head buffer. If the buffer ! is pinned or has its "recently used" bit set, it cannot be used; ignore ! it and return to the start of step 2. Otherwise, pin the buffer, ! release BufFreelistLock, and return the buffer. ! ! 3. Otherwise, select the buffer pointed to by NextVictimBuffer, and ! circularly advance NextVictimBuffer for next time. ! ! 4. If the selected buffer is pinned or has its "recently used" bit set, ! it cannot be used. Clear its "recently used" bit and return to step 3 ! to examine the next buffer. ! ! 5. Pin the selected buffer, release BufFreelistLock, and return the buffer. ! ! (Note that if the selected buffer is dirty, we will have to write it out ! before we can recycle it; if someone else pins the buffer meanwhile we will ! have to give up and try another buffer. This however is not a concern ! of the basic select-a-victim-buffer algorithm.) ! ! This scheme selects only victim buffers that have gone unused since they ! were last passed over by the "clock hand". ! ! A special provision is that while running VACUUM, a backend does not set the ! "recently used" bit on buffers it accesses. In fact, if ReleaseBuffer sees ! that it is dropping the pin count to zero and the "recently used" bit is not ! set, then it appends the buffer to the tail of the free list. (This implies ! that VACUUM, but only VACUUM, must take the BufFreelistLock during ! ReleaseBuffer; this shouldn't create much of a contention problem.) This ! provision encourages VACUUM to work in a relatively small number of buffers ! rather than blowing out the entire buffer cache. It is reasonable since a ! page that has been touched only by VACUUM is unlikely to be needed again ! soon. Since VACUUM usually requests many pages very fast, the effect of this is that it will get back the very buffers it filled and possibly modified on the next call and will therefore do its work in a few shared memory buffers, while being able to use whatever it finds in the cache already. This also implies that most of the write traffic caused by a VACUUM will be done by the VACUUM itself and not pushed off onto other processes. + + + Background writer's processing + ------------------------------ + + The background writer is designed to write out pages that are likely to be + recycled soon, thereby offloading the writing work from active backends. + To do this, it scans forward circularly from the current position of + NextVictimBuffer (which it does not change!), looking for buffers that are + dirty and not pinned nor marked "recently used". It pins, writes, and + releases any such buffer. + + If we can assume that reading NextVictimBuffer is an atomic action, then + the writer doesn't even need to take the BufFreelistLock in order to look + for buffers to write; it needs only to spinlock each buffer header for long + enough to check the dirtybit. Even without that assumption, the writer + only needs to take the lock long enough to read the variable value, not + while scanning the buffers. (This is a very substantial improvement in + the contention cost of the writer compared to PG 8.0.) + + During a checkpoint, the writer's strategy must be to write every dirty + buffer (pinned or not!). We may as well make it start this scan from + NextVictimBuffer, however, so that the first-to-be-written pages are the + ones that backends might otherwise have to write for themselves soon. *** src/backend/storage/buffer/buf_init.c.orig Thu Feb 3 18:29:11 2005 --- src/backend/storage/buffer/buf_init.c Tue Feb 15 12:52:31 2005 *************** *** 22,27 **** --- 22,29 ---- Block *BufferBlockPointers; int32 *PrivateRefCount; + static char *BufferBlocks; + /* statistics counters */ long int ReadBufferCount; long int ReadLocalBufferCount; *************** *** 50,65 **** * * Synchronization/Locking: * - * BufMgrLock lock -- must be acquired before manipulating the - * buffer search datastructures (lookup/freelist, as well as the - * flag bits of any buffer). Must be released - * before exit and before doing any IO. - * * IO_IN_PROGRESS -- this is a flag in the buffer descriptor. * It must be set when an IO is initiated and cleared at * the end of the IO. It is there to make sure that one * process doesn't start to use a buffer while another is ! * faulting it in. see IOWait/IOSignal. * * refcount -- Counts the number of processes holding pins on a buffer. * A buffer is pinned during IO and immediately after a BufferAlloc(). --- 52,62 ---- * * Synchronization/Locking: * * IO_IN_PROGRESS -- this is a flag in the buffer descriptor. * It must be set when an IO is initiated and cleared at * the end of the IO. It is there to make sure that one * process doesn't start to use a buffer while another is ! * faulting it in. see WaitIO and related routines. * * refcount -- Counts the number of processes holding pins on a buffer. * A buffer is pinned during IO and immediately after a BufferAlloc(). *************** *** 85,94 **** void InitBufferPool(void) { - char *BufferBlocks; bool foundBufs, foundDescs; - int i; BufferDescriptors = (BufferDesc *) ShmemInitStruct("Buffer Descriptors", --- 82,89 ---- *************** *** 102,153 **** { /* both should be present or neither */ Assert(foundDescs && foundBufs); } else { BufferDesc *buf; ! char *block; ! ! /* ! * It's probably not really necessary to grab the lock --- if ! * there's anyone else attached to the shmem at this point, we've ! * got problems. ! */ ! LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); buf = BufferDescriptors; - block = BufferBlocks; /* * Initialize all the buffer headers. */ ! for (i = 0; i < NBuffers; block += BLCKSZ, buf++, i++) { ! Assert(ShmemIsValid((unsigned long) block)); ! /* ! * The bufNext fields link together all totally-unused buffers. ! * Subsequent management of this list is done by ! * StrategyGetBuffer(). ! */ ! buf->bufNext = i + 1; - CLEAR_BUFFERTAG(buf->tag); buf->buf_id = i; ! buf->data = MAKE_OFFSET(block); ! buf->flags = 0; ! buf->refcount = 0; buf->io_in_progress_lock = LWLockAssign(); ! buf->cntx_lock = LWLockAssign(); ! buf->cntxDirty = false; ! buf->wait_backend_id = 0; } /* Correct last entry of linked list */ ! BufferDescriptors[NBuffers - 1].bufNext = -1; ! ! LWLockRelease(BufMgrLock); } /* Init other shared buffer-management stuff */ --- 97,137 ---- { /* both should be present or neither */ Assert(foundDescs && foundBufs); + /* note: this path is only taken in EXEC_BACKEND case */ } else { BufferDesc *buf; ! int i; buf = BufferDescriptors; /* * Initialize all the buffer headers. */ ! for (i = 0; i < NBuffers; buf++, i++) { ! CLEAR_BUFFERTAG(buf->tag); ! buf->flags = 0; ! buf->refcount = 0; ! buf->wait_backend_id = 0; ! SpinLockInit(&buf->buf_hdr_lock); buf->buf_id = i; ! /* ! * Initially link all the buffers together as unused. ! * Subsequent management of this list is done by freelist.c. ! */ ! buf->freeNext = i + 1; ! buf->io_in_progress_lock = LWLockAssign(); ! buf->content_lock = LWLockAssign(); } /* Correct last entry of linked list */ ! BufferDescriptors[NBuffers - 1].freeNext = FREENEXT_END_OF_LIST; } /* Init other shared buffer-management stuff */ *************** *** 162,173 **** * buffer pool. * * NB: this is called before InitProcess(), so we do not have a PGPROC and ! * cannot do LWLockAcquire; hence we can't actually access the bufmgr's * shared memory yet. We are only initializing local data here. */ void InitBufferPoolAccess(void) { int i; /* --- 146,158 ---- * buffer pool. * * NB: this is called before InitProcess(), so we do not have a PGPROC and ! * cannot do LWLockAcquire; hence we can't actually access stuff in * shared memory yet. We are only initializing local data here. */ void InitBufferPoolAccess(void) { + char *block; int i; /* *************** *** 179,190 **** sizeof(*PrivateRefCount)); /* ! * Convert shmem offsets into addresses as seen by this process. This ! * is just to speed up the BufferGetBlock() macro. It is OK to do this ! * without any lock since the data pointers never change. */ for (i = 0; i < NBuffers; i++) ! BufferBlockPointers[i] = (Block) MAKE_PTR(BufferDescriptors[i].data); } /* --- 164,181 ---- sizeof(*PrivateRefCount)); /* ! * Construct addresses for the individual buffer data blocks. We do ! * this just to speed up the BufferGetBlock() macro. (Since the ! * addresses should be the same in every backend, we could inherit ! * this data from the postmaster --- but in the EXEC_BACKEND case ! * that doesn't work.) */ + block = BufferBlocks; for (i = 0; i < NBuffers; i++) ! { ! BufferBlockPointers[i] = (Block) block; ! block += BLCKSZ; ! } } /* *** src/backend/storage/buffer/buf_table.c.orig Thu Feb 3 18:29:11 2005 --- src/backend/storage/buffer/buf_table.c Tue Feb 15 12:52:31 2005 *************** *** 3,14 **** * buf_table.c * routines for mapping BufferTags to buffer indexes. * ! * NOTE: this module is called only by freelist.c, and the "buffer IDs" ! * it deals with are whatever freelist.c needs them to be; they may not be ! * directly equivalent to Buffer numbers. ! * ! * Note: all routines in this file assume that the BufMgrLock is held ! * by the caller, so no synchronization is needed. * * * Portions Copyright (c) 1996-2005, PostgreSQL Global Development Group --- 3,11 ---- * buf_table.c * routines for mapping BufferTags to buffer indexes. * ! * Note: the routines in this file do no locking of their own. The caller ! * must hold a suitable lock on the BufMappingLock, as specified in the ! * comments. * * * Portions Copyright (c) 1996-2005, PostgreSQL Global Development Group *************** *** 74,90 **** /* * BufTableLookup * Lookup the given BufferTag; return buffer ID, or -1 if not found */ int BufTableLookup(BufferTag *tagPtr) { BufferLookupEnt *result; - if (tagPtr->blockNum == P_NEW) - return -1; - result = (BufferLookupEnt *) hash_search(SharedBufHash, (void *) tagPtr, HASH_FIND, NULL); if (!result) return -1; --- 71,87 ---- /* * BufTableLookup * Lookup the given BufferTag; return buffer ID, or -1 if not found + * + * Caller must hold at least share lock on BufMappingLock */ int BufTableLookup(BufferTag *tagPtr) { BufferLookupEnt *result; result = (BufferLookupEnt *) hash_search(SharedBufHash, (void *) tagPtr, HASH_FIND, NULL); + if (!result) return -1; *************** *** 93,106 **** /* * BufTableInsert ! * Insert a hashtable entry for given tag and buffer ID */ ! void BufTableInsert(BufferTag *tagPtr, int buf_id) { BufferLookupEnt *result; bool found; result = (BufferLookupEnt *) hash_search(SharedBufHash, (void *) tagPtr, HASH_ENTER, &found); --- 90,112 ---- /* * BufTableInsert ! * Insert a hashtable entry for given tag and buffer ID, ! * unless an entry already exists for that tag ! * ! * Returns -1 on successful insertion. If a conflicting entry exists ! * already, returns the buffer ID in that entry. ! * ! * Caller must hold write lock on BufMappingLock */ ! int BufTableInsert(BufferTag *tagPtr, int buf_id) { BufferLookupEnt *result; bool found; + Assert(buf_id >= 0); /* -1 is reserved for not-in-table */ + Assert(tagPtr->blockNum != P_NEW); /* invalid tag */ + result = (BufferLookupEnt *) hash_search(SharedBufHash, (void *) tagPtr, HASH_ENTER, &found); *************** *** 109,123 **** (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("out of shared memory"))); ! if (found) /* found something already in the table? */ ! elog(ERROR, "shared buffer hash table corrupted"); result->id = buf_id; } /* * BufTableDelete * Delete the hashtable entry for given tag (which must exist) */ void BufTableDelete(BufferTag *tagPtr) --- 115,133 ---- (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("out of shared memory"))); ! if (found) /* found something already in the table */ ! return result->id; result->id = buf_id; + + return -1; } /* * BufTableDelete * Delete the hashtable entry for given tag (which must exist) + * + * Caller must hold write lock on BufMappingLock */ void BufTableDelete(BufferTag *tagPtr) *** src/backend/storage/buffer/bufmgr.c.orig Mon Jan 10 15:02:21 2005 --- src/backend/storage/buffer/bufmgr.c Tue Feb 15 15:49:32 2005 *************** *** 25,31 **** * * WriteBuffer() -- WriteNoReleaseBuffer() + ReleaseBuffer() * ! * BufferSync() -- flush all (or some) dirty buffers in the buffer pool. * * InitBufferPool() -- Init the buffer module. * --- 25,33 ---- * * WriteBuffer() -- WriteNoReleaseBuffer() + ReleaseBuffer() * ! * BufferSync() -- flush all dirty buffers in the buffer pool. ! * ! * BgBufferSync() -- flush some dirty buffers in the buffer pool. * * InitBufferPool() -- Init the buffer module. * *************** *** 50,65 **** #include "pgstat.h" ! #define BufferGetLSN(bufHdr) \ ! (*((XLogRecPtr*) MAKE_PTR((bufHdr)->data))) ! /* GUC variable */ bool zero_damaged_pages = false; - #ifdef NOT_USED - bool ShowPinTrace = false; - #endif long NDirectFileRead; /* some I/O's are direct file access. * bypass bufmgr */ --- 52,71 ---- #include "pgstat.h" ! /* Note: these two macros only work on shared buffers, not local ones! */ ! #define BufHdrGetBlock(bufHdr) BufferBlockPointers[(bufHdr)->buf_id] ! #define BufferGetLSN(bufHdr) (*((XLogRecPtr*) BufHdrGetBlock(bufHdr))) ! ! /* Note: this macro only works on local buffers, not shared ones! */ ! #define LocalBufHdrGetBlock(bufHdr) \ ! LocalBufferBlockPointers[-((bufHdr)->buf_id + 2)] ! /* GUC variables */ bool zero_damaged_pages = false; + int BgWriterPercent = 1; + int BgWriterMaxPages = 100; long NDirectFileRead; /* some I/O's are direct file access. * bypass bufmgr */ *************** *** 73,90 **** static BufferDesc *PinCountWaitBuf = NULL; ! static void PinBuffer(BufferDesc *buf, bool fixOwner); ! static void UnpinBuffer(BufferDesc *buf, bool fixOwner); static void WaitIO(BufferDesc *buf); ! static void StartBufferIO(BufferDesc *buf, bool forInput); ! static void TerminateBufferIO(BufferDesc *buf, int err_flag); ! static void ContinueBufferIO(BufferDesc *buf, bool forInput); static void buffer_write_error_callback(void *arg); - static Buffer ReadBufferInternal(Relation reln, BlockNumber blockNum, - bool bufferLockHeld); static BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr); ! static void FlushBuffer(BufferDesc *buf, SMgrRelation reln, bool earlylock); static void write_buffer(Buffer buffer, bool unpin); --- 79,96 ---- static BufferDesc *PinCountWaitBuf = NULL; ! static bool PinBuffer(BufferDesc *buf); ! static void PinBuffer_Locked(BufferDesc *buf); ! static void UnpinBuffer(BufferDesc *buf, bool fixOwner, bool trashOK); ! static bool SyncOneBuffer(int buf_id, bool skip_pinned); static void WaitIO(BufferDesc *buf); ! static bool StartBufferIO(BufferDesc *buf, bool forInput); ! static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, ! int set_flag_bits); static void buffer_write_error_callback(void *arg); static BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum, bool *foundPtr); ! static void FlushBuffer(BufferDesc *buf, SMgrRelation reln); static void write_buffer(Buffer buffer, bool unpin); *************** *** 106,132 **** Buffer ReadBuffer(Relation reln, BlockNumber blockNum) { - ResourceOwnerEnlargeBuffers(CurrentResourceOwner); - return ReadBufferInternal(reln, blockNum, false); - } - - /* - * ReadBufferInternal -- internal version of ReadBuffer with more options - * - * bufferLockHeld: if true, caller already acquired the bufmgr lock. - * (This is assumed never to be true if dealing with a local buffer!) - * - * The caller must have done ResourceOwnerEnlargeBuffers(CurrentResourceOwner) - */ - static Buffer - ReadBufferInternal(Relation reln, BlockNumber blockNum, - bool bufferLockHeld) - { BufferDesc *bufHdr; bool found; bool isExtend; bool isLocalBuf; isExtend = (blockNum == P_NEW); isLocalBuf = reln->rd_istemp; --- 112,126 ---- Buffer ReadBuffer(Relation reln, BlockNumber blockNum) { BufferDesc *bufHdr; + Block bufBlock; bool found; bool isExtend; bool isLocalBuf; + /* Make sure we will have room to remember the buffer pin */ + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + isExtend = (blockNum == P_NEW); isLocalBuf = reln->rd_istemp; *************** *** 137,146 **** if (isExtend) blockNum = smgrnblocks(reln->rd_smgr); if (isLocalBuf) { ReadLocalBufferCount++; - pgstat_count_buffer_read(&reln->pgstat_info, reln); bufHdr = LocalBufferAlloc(reln, blockNum, &found); if (found) LocalBufferHitCount++; --- 131,141 ---- if (isExtend) blockNum = smgrnblocks(reln->rd_smgr); + pgstat_count_buffer_read(&reln->pgstat_info, reln); + if (isLocalBuf) { ReadLocalBufferCount++; bufHdr = LocalBufferAlloc(reln, blockNum, &found); if (found) LocalBufferHitCount++; *************** *** 148,167 **** else { ReadBufferCount++; - pgstat_count_buffer_read(&reln->pgstat_info, reln); /* * lookup the buffer. IO_IN_PROGRESS is set if the requested * block is not currently in memory. */ - if (!bufferLockHeld) - LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); bufHdr = BufferAlloc(reln, blockNum, &found); if (found) BufferHitCount++; } ! /* At this point we do NOT hold the bufmgr lock. */ /* if it was already in the buffer pool, we're done */ if (found) --- 143,159 ---- else { ReadBufferCount++; /* * lookup the buffer. IO_IN_PROGRESS is set if the requested * block is not currently in memory. */ bufHdr = BufferAlloc(reln, blockNum, &found); if (found) BufferHitCount++; } ! /* At this point we do NOT hold any locks. */ /* if it was already in the buffer pool, we're done */ if (found) *************** *** 187,206 **** * same buffer (if it's not been recycled) but come right back here to * try smgrextend again. */ ! Assert(!(bufHdr->flags & BM_VALID)); if (isExtend) { /* new buffers are zero-filled */ ! MemSet((char *) MAKE_PTR(bufHdr->data), 0, BLCKSZ); ! smgrextend(reln->rd_smgr, blockNum, (char *) MAKE_PTR(bufHdr->data), reln->rd_istemp); } else { ! smgrread(reln->rd_smgr, blockNum, (char *) MAKE_PTR(bufHdr->data)); /* check for garbage data */ ! if (!PageHeaderIsValid((PageHeader) MAKE_PTR(bufHdr->data))) { /* * During WAL recovery, the first access to any data page --- 179,200 ---- * same buffer (if it's not been recycled) but come right back here to * try smgrextend again. */ ! Assert(!(bufHdr->flags & BM_VALID)); /* spinlock not needed */ ! ! bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr); if (isExtend) { /* new buffers are zero-filled */ ! MemSet((char *) bufBlock, 0, BLCKSZ); ! smgrextend(reln->rd_smgr, blockNum, (char *) bufBlock, reln->rd_istemp); } else { ! smgrread(reln->rd_smgr, blockNum, (char *) bufBlock); /* check for garbage data */ ! if (!PageHeaderIsValid((PageHeader) bufBlock)) { /* * During WAL recovery, the first access to any data page *************** *** 215,221 **** (errcode(ERRCODE_DATA_CORRUPTED), errmsg("invalid page header in block %u of relation \"%s\"; zeroing out page", blockNum, RelationGetRelationName(reln)))); ! MemSet((char *) MAKE_PTR(bufHdr->data), 0, BLCKSZ); } else ereport(ERROR, --- 209,215 ---- (errcode(ERRCODE_DATA_CORRUPTED), errmsg("invalid page header in block %u of relation \"%s\"; zeroing out page", blockNum, RelationGetRelationName(reln)))); ! MemSet((char *) bufBlock, 0, BLCKSZ); } else ereport(ERROR, *************** *** 232,247 **** } else { ! /* lock buffer manager again to update IO IN PROGRESS */ ! LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); ! ! /* IO Succeeded, so mark data valid */ ! bufHdr->flags |= BM_VALID; ! ! /* If anyone was waiting for IO to complete, wake them up now */ ! TerminateBufferIO(bufHdr, 0); ! ! LWLockRelease(BufMgrLock); } if (VacuumCostActive) --- 226,233 ---- } else { ! /* Set BM_VALID, terminate IO, and wake up any waiters */ ! TerminateBufferIO(bufHdr, false, BM_VALID); } if (VacuumCostActive) *************** *** 263,270 **** * *foundPtr is actually redundant with the buffer's BM_VALID flag, but * we keep it for simplicity in ReadBuffer. * ! * BufMgrLock must be held at entry. When this routine returns, ! * the BufMgrLock is guaranteed NOT to be held. */ static BufferDesc * BufferAlloc(Relation reln, --- 249,255 ---- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but * we keep it for simplicity in ReadBuffer. * ! * No locks are held either at entry or exit. */ static BufferDesc * BufferAlloc(Relation reln, *************** *** 272,492 **** bool *foundPtr) { BufferTag newTag; /* identity of requested block */ ! BufferDesc *buf, ! *buf2; ! int cdb_found_index, ! cdb_replace_index; ! bool inProgress; /* did we already do StartBufferIO? */ /* create a tag so we can lookup the buffer */ INIT_BUFFERTAG(newTag, reln, blockNum); /* see if the block is in the buffer pool already */ ! buf = StrategyBufferLookup(&newTag, false, &cdb_found_index); ! if (buf != NULL) { /* * Found it. Now, pin the buffer so no one can steal it from the ! * buffer pool, and check to see if someone else is still reading ! * data into the buffer. (Formerly, we'd always block here if ! * IO_IN_PROGRESS is set, but there's no need to wait when someone ! * is writing rather than reading.) */ ! *foundPtr = TRUE; ! PinBuffer(buf, true); ! if (!(buf->flags & BM_VALID)) { ! if (buf->flags & BM_IO_IN_PROGRESS) ! { ! /* someone else is reading it, wait for them */ ! WaitIO(buf); ! } ! if (!(buf->flags & BM_VALID)) { /* * If we get here, previous attempts to read the buffer * must have failed ... but we shall bravely try again. */ *foundPtr = FALSE; - StartBufferIO(buf, true); } } - LWLockRelease(BufMgrLock); - return buf; } - *foundPtr = FALSE; - /* * Didn't find it in the buffer pool. We'll have to initialize a new ! * buffer. First, grab one from the free list. If it's dirty, flush ! * it to disk. Remember to unlock BufMgrLock while doing the IO. */ ! inProgress = FALSE; ! do { ! buf = StrategyGetBuffer(&cdb_replace_index); ! /* StrategyGetBuffer will elog if it can't find a free buffer */ ! Assert(buf); /* ! * There should be exactly one pin on the buffer after it is ! * allocated -- ours. If it had a pin it wouldn't have been on ! * the free list. No one else could have pinned it between ! * StrategyGetBuffer and here because we have the BufMgrLock. * ! * (We must pin the buffer before releasing BufMgrLock ourselves, ! * to ensure StrategyGetBuffer won't give the same buffer to someone ! * else.) */ ! Assert(buf->refcount == 0); ! buf->refcount = 1; ! PrivateRefCount[BufferDescriptorGetBuffer(buf) - 1] = 1; ! ResourceOwnerRememberBuffer(CurrentResourceOwner, ! BufferDescriptorGetBuffer(buf)); ! if ((buf->flags & BM_VALID) && ! (buf->flags & BM_DIRTY || buf->cntxDirty)) { /* ! * Set BM_IO_IN_PROGRESS to show the buffer is being written. ! * It cannot already be set because the buffer would be pinned ! * if someone were writing it. ! * ! * Note: it's okay to grab the io_in_progress lock while holding ! * BufMgrLock. All code paths that acquire this lock pin the ! * buffer first; since no one had it pinned (it just came off ! * the free list), no one else can have the lock. */ ! StartBufferIO(buf, false); ! inProgress = TRUE; ! /* ! * Write the buffer out, being careful to release BufMgrLock ! * while doing the I/O. We also tell FlushBuffer to share-lock ! * the buffer before releasing BufMgrLock. This is safe because ! * we know no other backend currently has the buffer pinned, ! * therefore no one can have it locked either, so we can always ! * get the lock without blocking. It is necessary because if ! * we release BufMgrLock first, it's possible for someone else ! * to pin and exclusive-lock the buffer before we get to the ! * share-lock, causing us to block. If the someone else then ! * blocks on a lock we hold, deadlock ensues. This has been ! * observed to happen when two backends are both trying to split ! * btree index pages, and the second one just happens to be ! * trying to split the page the first one got from the freelist. ! */ ! FlushBuffer(buf, NULL, true); /* ! * Somebody could have allocated another buffer for the same ! * block we are about to read in. While we flush out the dirty ! * buffer, we don't hold the lock and someone could have ! * allocated another buffer for the same block. The problem is ! * we haven't yet inserted the new tag into the buffer table. ! * So we need to check here. -ay 3/95 ! * ! * Another reason we have to do this is to update ! * cdb_found_index, since the CDB could have disappeared from ! * B1/B2 list while we were writing. */ ! buf2 = StrategyBufferLookup(&newTag, true, &cdb_found_index); ! if (buf2 != NULL) { /* ! * Found it. Someone has already done what we were about ! * to do. We'll just handle this as if it were found in ! * the buffer pool in the first place. First, give up the ! * buffer we were planning to use. */ ! TerminateBufferIO(buf, 0); ! UnpinBuffer(buf, true); ! buf = buf2; ! /* remaining code should match code at top of routine */ ! *foundPtr = TRUE; ! PinBuffer(buf, true); ! if (!(buf->flags & BM_VALID)) ! { ! if (buf->flags & BM_IO_IN_PROGRESS) ! { ! /* someone else is reading it, wait for them */ ! WaitIO(buf); ! } ! if (!(buf->flags & BM_VALID)) ! { ! /* ! * If we get here, previous attempts to read the ! * buffer must have failed ... but we shall ! * bravely try again. ! */ ! *foundPtr = FALSE; ! StartBufferIO(buf, true); ! } ! } ! LWLockRelease(BufMgrLock); ! return buf; ! } ! /* ! * Somebody could have pinned the buffer while we were doing ! * the I/O and had given up the BufMgrLock. If so, we can't ! * recycle this buffer --- we need to clear the I/O flags, ! * remove our pin and choose a new victim buffer. Similarly, ! * we have to start over if somebody re-dirtied the buffer. ! */ ! if (buf->refcount > 1 || buf->flags & BM_DIRTY || buf->cntxDirty) ! { ! TerminateBufferIO(buf, 0); ! UnpinBuffer(buf, true); ! inProgress = FALSE; ! buf = NULL; ! } ! } ! } while (buf == NULL); /* ! * At this point we should have the sole pin on a non-dirty buffer and ! * we may or may not already have the BM_IO_IN_PROGRESS flag set. */ /* ! * Tell the buffer replacement strategy that we are replacing the ! * buffer content. Then rename the buffer. Clearing BM_VALID here is ! * necessary, clearing the dirtybits is just paranoia. */ ! StrategyReplaceBuffer(buf, &newTag, cdb_found_index, cdb_replace_index); ! buf->tag = newTag; ! buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR); ! buf->cntxDirty = false; /* ! * Buffer contents are currently invalid. Have to mark IO IN PROGRESS ! * so no one fiddles with them until the read completes. We may have ! * already marked it, in which case we just flip from write to read ! * status. */ ! if (!inProgress) ! StartBufferIO(buf, true); ! else ! ContinueBufferIO(buf, true); ! LWLockRelease(BufMgrLock); ! return buf; } /* --- 257,587 ---- bool *foundPtr) { BufferTag newTag; /* identity of requested block */ ! BufferTag oldTag; ! BufFlags oldFlags; ! int buf_id; ! BufferDesc *buf; ! bool valid; /* create a tag so we can lookup the buffer */ INIT_BUFFERTAG(newTag, reln, blockNum); /* see if the block is in the buffer pool already */ ! LWLockAcquire(BufMappingLock, LW_SHARED); ! buf_id = BufTableLookup(&newTag); ! if (buf_id >= 0) { /* * Found it. Now, pin the buffer so no one can steal it from the ! * buffer pool, and check to see if the correct data has been ! * loaded into the buffer. */ ! buf = &BufferDescriptors[buf_id]; ! ! valid = PinBuffer(buf); ! ! /* Can release the mapping lock as soon as we've pinned it */ ! LWLockRelease(BufMappingLock); ! *foundPtr = TRUE; ! if (!valid) { ! /* ! * We can only get here if (a) someone else is still reading ! * in the page, or (b) a previous read attempt failed. We ! * have to wait for any active read attempt to finish, and ! * then set up our own read attempt if the page is still not ! * BM_VALID. StartBufferIO does it all. ! */ ! if (StartBufferIO(buf, true)) { /* * If we get here, previous attempts to read the buffer * must have failed ... but we shall bravely try again. */ *foundPtr = FALSE; } } return buf; } /* * Didn't find it in the buffer pool. We'll have to initialize a new ! * buffer. Remember to unlock BufMappingLock while doing the work. */ ! LWLockRelease(BufMappingLock); ! ! for (; { ! bool isdirty; ! ! /* ! * Select a victim buffer. The buffer is returned with its ! * header spinlock still held! Also the BufFreelistLock is ! * still held, since it would be bad to hold the spinlock ! * while possibly waking up other processes. ! */ ! buf = StrategyGetBuffer(); ! Assert(buf->refcount == 0); /* ! * If the buffer is dirty, we must share-lock the contents before ! * releasing the spinlock. Otherwise it is possible for someone else ! * to pin and exclusive-lock the buffer before we get the content ! * lock, forcing us to wait for them. If the someone else then blocks ! * on a lock we hold, deadlock ensues. ! * (This has been observed to happen when two backends are both trying ! * to split btree index pages, and the second one just happens to be ! * trying to split the page the first one got from the freelist.) * ! * It's ugly to do this while we hold the buffer header spinlock, but ! * it's safe because we know no other backend currently has the buffer ! * pinned, therefore no one can have it locked either, so we can ! * always get the lock without blocking. */ ! if (buf->flags & BM_DIRTY) ! { ! LWLockAcquire(buf->content_lock, LW_SHARED); ! isdirty = true; ! } ! else ! isdirty = false; ! /* Pin the buffer and then release the buffer spinlock */ ! PinBuffer_Locked(buf); ! /* Now it's safe to release the freelist lock */ ! LWLockRelease(BufFreelistLock); ! ! /* ! * If it's dirty, try to write it out. Note that someone may dirty ! * it again while we're writing (since our share-lock doesn't block ! * hint-bit updates), so this is not certain to succeed. We will ! * check again after re-locking the buffer header. ! */ ! if (isdirty) { + FlushBuffer(buf, NULL); /* ! * Whether FlushBuffer succeeded or not, we need to drop the ! * buffer content share-lock now. */ ! LWLockRelease(buf->content_lock); ! } ! /* ! * Acquire exclusive mapping lock in preparation for changing ! * the buffer's association. ! */ ! LWLockAcquire(BufMappingLock, LW_EXCLUSIVE); ! /* ! * Try to make a hashtable entry for the buffer under its new tag. ! * This could fail because while we were writing someone else ! * allocated another buffer for the same block we want to read in. ! * Note that we have not yet removed the hashtable entry for the ! * old tag. ! */ ! buf_id = BufTableInsert(&newTag, buf->buf_id); + if (buf_id >= 0) + { /* ! * Got a collision. Someone has already done what we were about ! * to do. We'll just handle this as if it were found in ! * the buffer pool in the first place. First, give up the ! * buffer we were planning to use. Don't allow it to be ! * thrown in the free list (we don't want to hold both ! * global locks at once). */ ! UnpinBuffer(buf, true, false); ! ! /* remaining code should match code at top of routine */ ! ! buf = &BufferDescriptors[buf_id]; ! ! valid = PinBuffer(buf); ! ! /* Can release the mapping lock as soon as we've pinned it */ ! LWLockRelease(BufMappingLock); ! ! *foundPtr = TRUE; ! ! if (!valid) { /* ! * We can only get here if (a) someone else is still reading ! * in the page, or (b) a previous read attempt failed. We ! * have to wait for any active read attempt to finish, and ! * then set up our own read attempt if the page is still not ! * BM_VALID. StartBufferIO does it all. */ ! if (StartBufferIO(buf, true)) ! { ! /* ! * If we get here, previous attempts to read the buffer ! * must have failed ... but we shall bravely try again. ! */ ! *foundPtr = FALSE; ! } ! } ! return buf; ! } ! /* ! * Need to lock the buffer header too in order to change its tag. ! */ ! LockBufHdr_NoHoldoff(buf); ! /* ! * Somebody could have pinned or re-dirtied the buffer while we were ! * doing the I/O and making the new hashtable entry. If so, we ! * can't recycle this buffer; we must undo everything we've done and ! * start over with a new victim buffer. ! */ ! if (buf->refcount == 1 && !(buf->flags & BM_DIRTY)) ! break; ! UnlockBufHdr_NoHoldoff(buf); ! BufTableDelete(&newTag); ! LWLockRelease(BufMappingLock); ! UnpinBuffer(buf, true, false /* evidently recently used */ ); ! } ! /* ! * Okay, it's finally safe to rename the buffer. ! * ! * Clearing BM_VALID here is necessary, clearing the dirtybits ! * is just paranoia. We also clear BM_RECENTLY_USED since any ! * recency of use of the old content is no longer relevant. ! */ ! oldTag = buf->tag; ! oldFlags = buf->flags; ! buf->tag = newTag; ! buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | ! BM_IO_ERROR | BM_RECENTLY_USED); ! buf->flags |= BM_TAG_VALID; ! UnlockBufHdr_NoHoldoff(buf); ! if (oldFlags & BM_TAG_VALID) ! BufTableDelete(&oldTag); ! LWLockRelease(BufMappingLock); ! ! /* ! * Buffer contents are currently invalid. Try to get the io_in_progress ! * lock. If StartBufferIO returns false, then someone else managed ! * to read it before we did, so there's nothing left for BufferAlloc() ! * to do. ! */ ! if (StartBufferIO(buf, true)) ! *foundPtr = FALSE; ! else ! *foundPtr = TRUE; ! ! return buf; ! } ! ! /* ! * InvalidateBuffer -- mark a shared buffer invalid and return it to the ! * freelist. ! * ! * The buffer header spinlock must be held at entry. We drop it before ! * returning. (This is sane because the caller must have locked the ! * buffer in order to be sure it should be dropped.) ! * ! * This is used only in contexts such as dropping a relation. We assume ! * that no other backend could possibly be interested in using the page, ! * so the only reason the buffer might be pinned is if someone else is ! * trying to write it out. We have to let them finish before we can ! * reclaim the buffer. ! * ! * The buffer could get reclaimed by someone else while we are waiting ! * to acquire the necessary locks; if so, don't mess it up. ! */ ! static void ! InvalidateBuffer(BufferDesc *buf) ! { ! BufferTag oldTag; ! BufFlags oldFlags; ! ! /* Save the original buffer tag before dropping the spinlock */ ! oldTag = buf->tag; + UnlockBufHdr(buf); + + retry: /* ! * Acquire exclusive mapping lock in preparation for changing ! * the buffer's association. */ + LWLockAcquire(BufMappingLock, LW_EXCLUSIVE); + + /* Re-lock the buffer header (NoHoldoff since we have an LWLock) */ + LockBufHdr_NoHoldoff(buf); + + /* If it's changed while we were waiting for lock, do nothing */ + if (!BUFFERTAGS_EQUAL(buf->tag, oldTag)) + { + UnlockBufHdr_NoHoldoff(buf); + LWLockRelease(BufMappingLock); + return; + } /* ! * We assume the only reason for it to be pinned is that someone else ! * is flushing the page out. Wait for them to finish. (This could be ! * an infinite loop if the refcount is messed up... it would be nice ! * to time out after awhile, but there seems no way to be sure how ! * many loops may be needed. Note that if the other guy has pinned ! * the buffer but not yet done StartBufferIO, WaitIO will fall through ! * and we'll effectively be busy-looping here.) */ ! if (buf->refcount != 0) ! { ! UnlockBufHdr_NoHoldoff(buf); ! LWLockRelease(BufMappingLock); ! WaitIO(buf); ! goto retry; ! } /* ! * Clear out the buffer's tag and flags. We must do this to ensure ! * that linear scans of the buffer array don't think the buffer is valid. */ ! oldFlags = buf->flags; ! CLEAR_BUFFERTAG(buf->tag); ! buf->flags = 0; ! UnlockBufHdr_NoHoldoff(buf); ! /* ! * Remove the buffer from the lookup hashtable, if it was in there. ! */ ! if (oldFlags & BM_TAG_VALID) ! BufTableDelete(&oldTag); ! ! /* ! * Avoid accepting a cancel interrupt when we release the mapping lock; ! * that would leave the buffer free but not on the freelist. (Which would ! * not be fatal, since it'd get picked up again by the clock scanning ! * code, but we'd rather be sure it gets to the freelist.) ! */ ! HOLD_INTERRUPTS(); ! ! LWLockRelease(BufMappingLock); ! ! /* ! * Insert the buffer at the head of the list of free buffers. ! */ ! StrategyFreeBuffer(buf, true); ! ! RESUME_INTERRUPTS(); } /* *************** *** 494,500 **** * WriteBuffer and WriteNoReleaseBuffer */ static void ! write_buffer(Buffer buffer, bool release) { BufferDesc *bufHdr; --- 589,595 ---- * WriteBuffer and WriteNoReleaseBuffer */ static void ! write_buffer(Buffer buffer, bool unpin) { BufferDesc *bufHdr; *************** *** 503,509 **** if (BufferIsLocal(buffer)) { ! WriteLocalBuffer(buffer, release); return; } --- 598,604 ---- if (BufferIsLocal(buffer)) { ! WriteLocalBuffer(buffer, unpin); return; } *************** *** 511,517 **** Assert(PrivateRefCount[buffer - 1] > 0); ! LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); Assert(bufHdr->refcount > 0); /* --- 606,613 ---- Assert(PrivateRefCount[buffer - 1] > 0); ! LockBufHdr(bufHdr); ! Assert(bufHdr->refcount > 0); /* *************** *** 522,530 **** bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED); ! if (release) ! UnpinBuffer(bufHdr, true); ! LWLockRelease(BufMgrLock); } /* --- 618,627 ---- bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED); ! UnlockBufHdr(bufHdr); ! ! if (unpin) ! UnpinBuffer(bufHdr, true, true); } /* *************** *** 555,575 **** /* * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer() - * to save a lock release/acquire. * ! * Also, if the passed buffer is valid and already contains the desired block ! * number, we simply return it without ever acquiring the lock at all. ! * Since the passed buffer must be pinned, it's OK to examine its block ! * number without getting the lock first. * * Note: it is OK to pass buffer == InvalidBuffer, indicating that no old * buffer actually needs to be released. This case is the same as ReadBuffer, * but can save some tests in the caller. - * - * Also note: while it will work to call this routine with blockNum == P_NEW, - * it's best to avoid doing so, since that would result in calling - * smgrnblocks() while holding the bufmgr lock, hence some loss of - * concurrency. */ Buffer ReleaseAndReadBuffer(Buffer buffer, --- 652,667 ---- /* * ReleaseAndReadBuffer -- combine ReleaseBuffer() and ReadBuffer() * ! * Formerly, this saved one cycle of acquiring/releasing the BufMgrLock ! * compared to calling the two routines separately. Now it's mainly just ! * a convenience function. However, if the passed buffer is valid and ! * already contains the desired block, we just return it as-is; and that ! * does save considerable work compared to a full release and reacquire. * * Note: it is OK to pass buffer == InvalidBuffer, indicating that no old * buffer actually needs to be released. This case is the same as ReadBuffer, * but can save some tests in the caller. */ Buffer ReleaseAndReadBuffer(Buffer buffer, *************** *** 588,822 **** RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node)) return buffer; ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer); - /* owner now has a free slot, so no need for Enlarge() */ LocalRefCount[-buffer - 1]--; } else { Assert(PrivateRefCount[buffer - 1] > 0); bufHdr = &BufferDescriptors[buffer - 1]; if (bufHdr->tag.blockNum == blockNum && RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node)) return buffer; ! ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer); ! /* owner now has a free slot, so no need for Enlarge() */ ! if (PrivateRefCount[buffer - 1] > 1) ! PrivateRefCount[buffer - 1]--; ! else ! { ! LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); ! UnpinBuffer(bufHdr, false); ! return ReadBufferInternal(relation, blockNum, true); ! } } } - else - ResourceOwnerEnlargeBuffers(CurrentResourceOwner); ! return ReadBufferInternal(relation, blockNum, false); } /* * PinBuffer -- make buffer unavailable for replacement. * * This should be applied only to shared buffers, never local ones. - * Bufmgr lock must be held by caller. * - * Most but not all callers want CurrentResourceOwner to be adjusted. * Note that ResourceOwnerEnlargeBuffers must have been done already. */ static void ! PinBuffer(BufferDesc *buf, bool fixOwner) { ! int b = BufferDescriptorGetBuffer(buf) - 1; if (PrivateRefCount[b] == 0) buf->refcount++; PrivateRefCount[b]++; Assert(PrivateRefCount[b] > 0); ! if (fixOwner) ! ResourceOwnerRememberBuffer(CurrentResourceOwner, ! BufferDescriptorGetBuffer(buf)); } /* * UnpinBuffer -- make buffer available for replacement. * * This should be applied only to shared buffers, never local ones. - * Bufmgr lock must be held by caller. * * Most but not all callers want CurrentResourceOwner to be adjusted. */ static void ! UnpinBuffer(BufferDesc *buf, bool fixOwner) { ! int b = BufferDescriptorGetBuffer(buf) - 1; if (fixOwner) ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf)); - Assert(buf->refcount > 0); Assert(PrivateRefCount[b] > 0); PrivateRefCount[b]--; if (PrivateRefCount[b] == 0) { ! buf->refcount--; /* I'd better not still hold any locks on the buffer */ ! Assert(!LWLockHeldByMe(buf->cntx_lock)); Assert(!LWLockHeldByMe(buf->io_in_progress_lock)); - } ! if ((buf->flags & BM_PIN_COUNT_WAITER) != 0 && ! buf->refcount == 1) ! { ! /* we just released the last pin other than the waiter's */ ! buf->flags &= ~BM_PIN_COUNT_WAITER; ! ProcSendSignal(buf->wait_backend_id); } ! else { ! /* do nothing */ } } /* ! * BufferSync -- Write out dirty buffers in the pool. * ! * This is called at checkpoint time to write out all dirty shared buffers, ! * and by the background writer process to write out some of the dirty blocks. ! * percent/maxpages should be -1 in the former case, and limit values (>= 0) ! * in the latter. * * Returns the number of buffers written. */ int ! BufferSync(int percent, int maxpages) { ! BufferDesc **dirty_buffers; ! BufferTag *buftags; ! int num_buffer_dirty; ! int i; /* If either limit is zero then we are disabled from doing anything... */ ! if (percent == 0 || maxpages == 0) return 0; ! /* ! * Get a list of all currently dirty buffers and how many there are. ! * We do not flush buffers that get dirtied after we started. They ! * have to wait until the next checkpoint. ! */ ! dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *)); ! buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag)); ! ! LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); ! num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags, ! NBuffers); /* ! * If called by the background writer, we are usually asked to only ! * write out some portion of dirty buffers now, to prevent the IO ! * storm at checkpoint time. */ ! if (percent > 0) ! { ! Assert(percent <= 100); ! num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100; ! } ! if (maxpages > 0 && num_buffer_dirty > maxpages) ! num_buffer_dirty = maxpages; ! /* Make sure we can handle the pin inside the loop */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); /* ! * Loop over buffers to be written. Note the BufMgrLock is held at ! * loop top, but is released and reacquired within FlushBuffer, so we ! * aren't holding it long. */ ! for (i = 0; i < num_buffer_dirty; i++) { ! BufferDesc *bufHdr = dirty_buffers[i]; ! ! /* ! * Check it is still the same page and still needs writing. ! * ! * We can check bufHdr->cntxDirty here *without* holding any lock on ! * buffer context as long as we set this flag in access methods ! * *before* logging changes with XLogInsert(): if someone will set ! * cntxDirty just after our check we don't worry because of our ! * checkpoint.redo points before log record for upcoming changes ! * and so we are not required to write such dirty buffer. ! */ ! if (!(bufHdr->flags & BM_VALID)) ! continue; ! if (!BUFFERTAGS_EQUAL(bufHdr->tag, buftags[i])) ! continue; ! if (!(bufHdr->flags & BM_DIRTY || bufHdr->cntxDirty)) ! continue; ! ! /* ! * IO synchronization. Note that we do it with unpinned buffer to ! * avoid conflicts with FlushRelationBuffers. ! */ ! if (bufHdr->flags & BM_IO_IN_PROGRESS) { ! WaitIO(bufHdr); ! /* Still need writing? */ ! if (!(bufHdr->flags & BM_VALID)) ! continue; ! if (!BUFFERTAGS_EQUAL(bufHdr->tag, buftags[i])) ! continue; ! if (!(bufHdr->flags & BM_DIRTY || bufHdr->cntxDirty)) ! continue; } ! /* ! * Here: no one doing IO for this buffer and it's dirty. Pin ! * buffer now and set IO state for it *before* acquiring shlock to ! * avoid conflicts with FlushRelationBuffers. ! */ ! PinBuffer(bufHdr, true); ! StartBufferIO(bufHdr, false); ! ! FlushBuffer(bufHdr, NULL, false); ! TerminateBufferIO(bufHdr, 0); ! UnpinBuffer(bufHdr, true); ! } ! LWLockRelease(BufMgrLock); ! pfree(dirty_buffers); ! pfree(buftags); ! return num_buffer_dirty; } /* ! * WaitIO -- Block until the IO_IN_PROGRESS flag on 'buf' is cleared. * ! * Should be entered with buffer manager lock held; releases it before ! * waiting and re-acquires it afterwards. */ ! static void ! WaitIO(BufferDesc *buf) { /* ! * Changed to wait until there's no IO - Inoue 01/13/2000 * ! * Note this is *necessary* because an error abort in the process doing ! * I/O could release the io_in_progress_lock prematurely. See ! * AbortBufferIO. */ ! while ((buf->flags & BM_IO_IN_PROGRESS) != 0) { ! LWLockRelease(BufMgrLock); ! LWLockAcquire(buf->io_in_progress_lock, LW_SHARED); ! LWLockRelease(buf->io_in_progress_lock); ! LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); } } --- 680,994 ---- RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node)) return buffer; ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer); LocalRefCount[-buffer - 1]--; + bufHdr->flags |= BM_RECENTLY_USED; } else { Assert(PrivateRefCount[buffer - 1] > 0); bufHdr = &BufferDescriptors[buffer - 1]; + /* we have pin, so it's ok to examine tag without spinlock */ if (bufHdr->tag.blockNum == blockNum && RelFileNodeEquals(bufHdr->tag.rnode, relation->rd_node)) return buffer; ! UnpinBuffer(bufHdr, true, true); } } ! return ReadBuffer(relation, blockNum); } /* * PinBuffer -- make buffer unavailable for replacement. * * This should be applied only to shared buffers, never local ones. * * Note that ResourceOwnerEnlargeBuffers must have been done already. + * + * Returns TRUE if buffer is BM_VALID, else FALSE. This provision allows + * some callers to avoid an extra spinlock cycle. + */ + static bool + PinBuffer(BufferDesc *buf) + { + int b = buf->buf_id; + bool result; + + if (PrivateRefCount[b] == 0) + { + /* + * Use NoHoldoff here because we don't want the unlock to be a + * potential place to honor a QueryCancel request. + * (The caller should be holding off interrupts anyway.) + */ + LockBufHdr_NoHoldoff(buf); + buf->refcount++; + result = (buf->flags & BM_VALID) != 0; + UnlockBufHdr_NoHoldoff(buf); + } + else + { + /* If we previously pinned the buffer, it must surely be valid */ + result = true; + } + PrivateRefCount[b]++; + Assert(PrivateRefCount[b] > 0); + ResourceOwnerRememberBuffer(CurrentResourceOwner, + BufferDescriptorGetBuffer(buf)); + return result; + } + + /* + * PinBuffer_Locked -- as above, but caller already locked the buffer header. + * The spinlock is released before return. + * + * Note: use of this routine is frequently mandatory, not just an optimization + * to save a spin lock/unlock cycle, because we need to pin a buffer before + * its state can change under us. */ static void ! PinBuffer_Locked(BufferDesc *buf) { ! int b = buf->buf_id; if (PrivateRefCount[b] == 0) buf->refcount++; + /* NoHoldoff since we mustn't accept cancel interrupt here */ + UnlockBufHdr_NoHoldoff(buf); PrivateRefCount[b]++; Assert(PrivateRefCount[b] > 0); ! ResourceOwnerRememberBuffer(CurrentResourceOwner, ! BufferDescriptorGetBuffer(buf)); ! /* Now we can accept cancel */ ! RESUME_INTERRUPTS(); } /* * UnpinBuffer -- make buffer available for replacement. * * This should be applied only to shared buffers, never local ones. * * Most but not all callers want CurrentResourceOwner to be adjusted. + * + * If we are releasing a buffer during VACUUM, and it's not been otherwise + * used recently, and trashOK is true, send the buffer to the freelist. */ static void ! UnpinBuffer(BufferDesc *buf, bool fixOwner, bool trashOK) { ! int b = buf->buf_id; if (fixOwner) ResourceOwnerForgetBuffer(CurrentResourceOwner, BufferDescriptorGetBuffer(buf)); Assert(PrivateRefCount[b] > 0); PrivateRefCount[b]--; if (PrivateRefCount[b] == 0) { ! bool trash_buffer = false; ! /* I'd better not still hold any locks on the buffer */ ! Assert(!LWLockHeldByMe(buf->content_lock)); Assert(!LWLockHeldByMe(buf->io_in_progress_lock)); ! /* NoHoldoff ensures we don't lose control before sending signal */ ! LockBufHdr_NoHoldoff(buf); ! ! /* Decrement the shared reference count */ ! Assert(buf->refcount > 0); ! buf->refcount--; ! ! /* Mark the buffer recently used, unless we are in VACUUM */ ! if (!strategy_hint_vacuum) ! buf->flags |= BM_RECENTLY_USED; ! else if (trashOK && ! buf->refcount == 0 && ! !(buf->flags & BM_RECENTLY_USED)) ! trash_buffer = true; ! ! if ((buf->flags & BM_PIN_COUNT_WAITER) && ! buf->refcount == 1) ! { ! /* we just released the last pin other than the waiter's */ ! BackendId wait_backend_id = buf->wait_backend_id; ! ! buf->flags &= ~BM_PIN_COUNT_WAITER; ! UnlockBufHdr_NoHoldoff(buf); ! ProcSendSignal(wait_backend_id); ! } ! else ! UnlockBufHdr_NoHoldoff(buf); ! ! /* ! * If VACUUM is releasing an otherwise-unused buffer, send it to ! * the freelist for near-term reuse. We put it at the tail so that ! * it won't be used before any invalid buffers that may exist. ! */ ! if (trash_buffer) ! StrategyFreeBuffer(buf, false); } ! } ! ! /* ! * BufferSync -- Write out all dirty buffers in the pool. ! * ! * This is called at checkpoint time to write out all dirty shared buffers. ! */ ! void ! BufferSync(void) ! { ! int buf_id; ! int num_to_scan; ! ! /* ! * Find out where to start the circular scan. ! */ ! buf_id = StrategySyncStart(); ! ! /* Make sure we can handle the pin inside SyncOneBuffer */ ! ResourceOwnerEnlargeBuffers(CurrentResourceOwner); ! ! /* ! * Loop over all buffers. ! */ ! num_to_scan = NBuffers; ! while (num_to_scan-- > 0) { ! (void) SyncOneBuffer(buf_id, false); ! if (++buf_id >= NBuffers) ! buf_id = 0; } } /* ! * BgBufferSync -- Write out some dirty buffers in the pool. * ! * This is called periodically by the background writer process. * * Returns the number of buffers written. */ int ! BgBufferSync(void) { ! static int buf_id1 = 0; ! int percent = BgWriterPercent; ! int maxpages = BgWriterMaxPages; ! int buf_id2; ! int num_to_scan; ! int num_written; /* If either limit is zero then we are disabled from doing anything... */ ! if (percent <= 0 || maxpages <= 0) return 0; ! Assert(percent <= 100); ! num_to_scan = (NBuffers * percent + 99) / 100; /* ! * To minimize work at checkpoint time, we want to try to keep all the ! * buffers clean; this motivates a scan that proceeds sequentially through ! * all buffers. But we are also charged with ensuring that buffers that ! * will be recycled soon are clean when needed; these buffers are the ! * ones just ahead of the StrategySyncStart point. We make a separate ! * scan through those. ! * ! * Currently we divide our attention equally between these tasks, but ! * probably it would be better to have two sets of control parameters. */ ! buf_id2 = StrategySyncStart(); ! /* Make sure we can handle the pin inside SyncOneBuffer */ ResourceOwnerEnlargeBuffers(CurrentResourceOwner); /* ! * Loop over buffers to be written. */ ! num_written = 0; ! while (num_to_scan-- > 0) { ! /* Loop over all buffers, including pinned ones */ ! if (SyncOneBuffer(buf_id1, false)) { ! num_written++; ! if (num_written >= maxpages) ! break; } ! if (++buf_id1 >= NBuffers) ! buf_id1 = 0; ! if (num_to_scan-- == 0) ! break; ! /* Loop over soon-to-be-reused buffers, skipping pinned ones */ ! if (SyncOneBuffer(buf_id2, true)) ! { ! num_written++; ! if (num_written >= maxpages) ! break; ! } ! if (++buf_id2 >= NBuffers) ! buf_id2 = 0; ! } ! return num_written; } /* ! * SyncOneBuffer -- process a single buffer during syncing. ! * ! * If skip_pinned is true, we don't write currently-pinned buffers, nor ! * buffers marked RECENTLY_USED, as these are not replacement candidates. ! * ! * Returns true if buffer was written, else false. (This could be in error ! * if FlushBuffers finds the buffer clean after locking it, but we don't ! * care all that much.) * ! * Note: caller must have done ResourceOwnerEnlargeBuffers. */ ! static bool ! SyncOneBuffer(int buf_id, bool skip_pinned) { + BufferDesc *bufHdr = &BufferDescriptors[buf_id]; + /* ! * Check whether buffer needs writing. * ! * We can make this check without taking the buffer content lock ! * so long as we mark pages dirty in access methods *before* logging ! * changes with XLogInsert(): if someone marks the buffer dirty ! * just after our check we don't worry because our checkpoint.redo ! * points before log record for upcoming changes and so we are not ! * required to write such dirty buffer. */ ! LockBufHdr(bufHdr); ! if (!(bufHdr->flags & BM_VALID) || !(bufHdr->flags & BM_DIRTY)) { ! UnlockBufHdr(bufHdr); ! return false; ! } ! if (skip_pinned && ! (bufHdr->refcount != 0 || bufHdr->flags & BM_RECENTLY_USED)) ! { ! UnlockBufHdr(bufHdr); ! return false; } + + /* + * Pin it, share-lock it, write it. (FlushBuffer will do nothing + * if the buffer is clean by the time we've locked it.) + */ + PinBuffer_Locked(bufHdr); + LWLockAcquire(bufHdr->content_lock, LW_SHARED); + + FlushBuffer(bufHdr, NULL); + + LWLockRelease(bufHdr->content_lock); + UnpinBuffer(bufHdr, true, false /* don't change freelist */ ); + + return true; } *************** *** 888,893 **** --- 1060,1068 ---- AtEOXact_LocalBuffers(isCommit); #endif + + /* Make sure we reset the strategy hint in case VACUUM errored out */ + StrategyHintVacuum(false); } /* *************** *** 912,920 **** * here, it suggests that ResourceOwners are messed up. */ PrivateRefCount[i] = 1; /* make sure we release shared pin */ ! LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); ! UnpinBuffer(buf, false); ! LWLockRelease(BufMgrLock); Assert(PrivateRefCount[i] == 0); } } --- 1087,1093 ---- * here, it suggests that ResourceOwners are messed up. */ PrivateRefCount[i] = 1; /* make sure we release shared pin */ ! UnpinBuffer(buf, false, false /* don't change freelist */ ); Assert(PrivateRefCount[i] == 0); } } *************** *** 941,946 **** --- 1114,1120 ---- loccount = PrivateRefCount[buffer - 1]; } + /* theoretically we should lock the bufhdr here */ elog(WARNING, "buffer refcount leak: [%03d] " "(rel=%u/%u/%u, blockNum=%u, flags=0x%x, refcount=%u %d)", *************** *** 961,967 **** void FlushBufferPool(void) { ! BufferSync(-1, -1); smgrsync(); } --- 1135,1141 ---- void FlushBufferPool(void) { ! BufferSync(); smgrsync(); } *************** *** 988,999 **** BlockNumber BufferGetBlockNumber(Buffer buffer) { Assert(BufferIsPinned(buffer)); if (BufferIsLocal(buffer)) ! return LocalBufferDescriptors[-buffer - 1].tag.blockNum; else ! return BufferDescriptors[buffer - 1].tag.blockNum; } /* --- 1162,1178 ---- BlockNumber BufferGetBlockNumber(Buffer buffer) { + BufferDesc *bufHdr; + Assert(BufferIsPinned(buffer)); if (BufferIsLocal(buffer)) ! bufHdr = &(LocalBufferDescriptors[-buffer - 1]); else ! bufHdr = &BufferDescriptors[buffer - 1]; ! ! /* pinned, so OK to read tag without spinlock */ ! return bufHdr->tag.blockNum; } /* *************** *** 1013,1019 **** else bufHdr = &BufferDescriptors[buffer - 1]; ! return (bufHdr->tag.rnode); } /* --- 1192,1198 ---- else bufHdr = &BufferDescriptors[buffer - 1]; ! return bufHdr->tag.rnode; } /* *************** *** 1026,1066 **** * However, we will need to force the changes to disk via fsync before * we can checkpoint WAL. * ! * BufMgrLock must be held at entry, and the buffer must be pinned. The ! * caller is also responsible for doing StartBufferIO/TerminateBufferIO. * * If the caller has an smgr reference for the buffer's relation, pass it ! * as the second parameter. If not, pass NULL. (Do not open relation ! * while holding BufMgrLock!) ! * ! * When earlylock is TRUE, we grab the per-buffer sharelock before releasing ! * BufMgrLock, rather than after. Normally this would be a bad idea since ! * we might deadlock, but it is safe and necessary when called from ! * BufferAlloc() --- see comments therein. */ static void ! FlushBuffer(BufferDesc *buf, SMgrRelation reln, bool earlylock) { - Buffer buffer = BufferDescriptorGetBuffer(buf); XLogRecPtr recptr; ErrorContextCallback errcontext; - /* Transpose cntxDirty into flags while holding BufMgrLock */ - buf->cntxDirty = false; - buf->flags |= BM_DIRTY; - - /* To check if block content changed while flushing. - vadim 01/17/97 */ - buf->flags &= ~BM_JUST_DIRTIED; - /* ! * If earlylock, grab buffer sharelock before anyone else could re-lock ! * the buffer. */ ! if (earlylock) ! LockBuffer(buffer, BUFFER_LOCK_SHARE); ! ! /* Release BufMgrLock while doing xlog work */ ! LWLockRelease(BufMgrLock); /* Setup error traceback support for ereport() */ errcontext.callback = buffer_write_error_callback; --- 1205,1232 ---- * However, we will need to force the changes to disk via fsync before * we can checkpoint WAL. * ! * The caller must hold a pin on the buffer and have share-locked the ! * buffer contents. (Note: a share-lock does not prevent updates of ! * hint bits in the buffer, so the page could change while the write ! * is in progress, but we assume that that will not invalidate the data ! * written.) * * If the caller has an smgr reference for the buffer's relation, pass it ! * as the second parameter. If not, pass NULL. */ static void ! FlushBuffer(BufferDesc *buf, SMgrRelation reln) { XLogRecPtr recptr; ErrorContextCallback errcontext; /* ! * Acquire the buffer's io_in_progress lock. If StartBufferIO returns ! * false, then someone else flushed the buffer before we could, so ! * we need not do anything. */ ! if (!StartBufferIO(buf, false)) ! return; /* Setup error traceback support for ereport() */ errcontext.callback = buffer_write_error_callback; *************** *** 1068,1087 **** errcontext.previous = error_context_stack; error_context_stack = &errcontext; ! /* Find smgr relation for buffer while holding minimal locks */ if (reln == NULL) reln = smgropen(buf->tag.rnode); /* ! * Protect buffer content against concurrent update. (Note that ! * hint-bit updates can still occur while the write is in progress, ! * but we assume that that will not invalidate the data written.) ! */ ! if (!earlylock) ! LockBuffer(buffer, BUFFER_LOCK_SHARE); ! ! /* ! * Force XLOG flush for buffer' LSN. This implements the basic WAL * rule that log updates must hit disk before any of the data-file * changes they describe do. */ --- 1234,1245 ---- errcontext.previous = error_context_stack; error_context_stack = &errcontext; ! /* Find smgr relation for buffer */ if (reln == NULL) reln = smgropen(buf->tag.rnode); /* ! * Force XLOG flush up to buffer's LSN. This implements the basic WAL * rule that log updates must hit disk before any of the data-file * changes they describe do. */ *************** *** 1090,1124 **** /* * Now it's safe to write buffer to disk. Note that no one else should ! * have been able to write it while we were busy with locking and log ! * flushing because caller has set the IO flag. ! * ! * It would be better to clear BM_JUST_DIRTIED right here, but we'd have ! * to reacquire the BufMgrLock and it doesn't seem worth it. */ smgrwrite(reln, buf->tag.blockNum, ! (char *) MAKE_PTR(buf->data), false); - /* Pop the error context stack */ - error_context_stack = errcontext.previous; - - /* - * Release the per-buffer readlock, reacquire BufMgrLock. - */ - LockBuffer(buffer, BUFFER_LOCK_UNLOCK); - - LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); - BufferFlushCount++; /* ! * If this buffer was marked by someone as DIRTY while we were ! * flushing it out we must not clear DIRTY flag - vadim 01/17/97 */ ! if (!(buf->flags & BM_JUST_DIRTIED)) ! buf->flags &= ~BM_DIRTY; } /* --- 1248,1277 ---- /* * Now it's safe to write buffer to disk. Note that no one else should ! * have been able to write it while we were busy with log flushing ! * because we have the io_in_progress lock. */ + + /* To check if block content changes while flushing. - vadim 01/17/97 */ + LockBufHdr_NoHoldoff(buf); + buf->flags &= ~BM_JUST_DIRTIED; + UnlockBufHdr_NoHoldoff(buf); + smgrwrite(reln, buf->tag.blockNum, ! (char *) BufHdrGetBlock(buf), false); BufferFlushCount++; /* ! * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) ! * and end the io_in_progress state. */ ! TerminateBufferIO(buf, true, 0); ! ! /* Pop the error context stack */ ! error_context_stack = errcontext.previous; } /* *************** *** 1210,1271 **** bufHdr->tag.rnode.dbNode, bufHdr->tag.rnode.relNode, LocalRefCount[i]); ! bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED); ! bufHdr->cntxDirty = false; ! bufHdr->tag.rnode.relNode = InvalidOid; } } return; } ! LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); ! ! for (i = 1; i <= NBuffers; i++) { ! bufHdr = &BufferDescriptors[i - 1]; ! recheck: if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && bufHdr->tag.blockNum >= firstDelBlock) ! { ! /* ! * If there is I/O in progress, better wait till it's done; ! * don't want to delete the relation out from under someone ! * who's just trying to flush the buffer! ! */ ! if (bufHdr->flags & BM_IO_IN_PROGRESS) ! { ! WaitIO(bufHdr); ! ! /* ! * By now, the buffer very possibly belongs to some other ! * rel, so check again before proceeding. ! */ ! goto recheck; ! } ! ! /* ! * There should be no pin on the buffer. ! */ ! if (bufHdr->refcount != 0) ! elog(ERROR, "block %u of %u/%u/%u is still referenced (private %d, global %u)", ! bufHdr->tag.blockNum, ! bufHdr->tag.rnode.spcNode, ! bufHdr->tag.rnode.dbNode, ! bufHdr->tag.rnode.relNode, ! PrivateRefCount[i - 1], bufHdr->refcount); ! ! /* Now we can do what we came for */ ! bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED); ! bufHdr->cntxDirty = false; ! ! /* ! * And mark the buffer as no longer occupied by this rel. ! */ ! StrategyInvalidateBuffer(bufHdr); ! } } - - LWLockRelease(BufMgrLock); } /* --------------------------------------------------------------------- --- 1363,1385 ---- bufHdr->tag.rnode.dbNode, bufHdr->tag.rnode.relNode, LocalRefCount[i]); ! CLEAR_BUFFERTAG(bufHdr->tag); ! bufHdr->flags = 0; } } return; } ! for (i = 0; i < NBuffers; i++) { ! bufHdr = &BufferDescriptors[i]; ! LockBufHdr(bufHdr); if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) && bufHdr->tag.blockNum >= firstDelBlock) ! InvalidateBuffer(bufHdr); /* releases spinlock */ ! else ! UnlockBufHdr(bufHdr); } } /* --------------------------------------------------------------------- *************** *** 1285,1331 **** int i; BufferDesc *bufHdr; ! LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); ! for (i = 1; i <= NBuffers; i++) { ! bufHdr = &BufferDescriptors[i - 1]; ! recheck: if (bufHdr->tag.rnode.dbNode == dbid) ! { ! /* ! * If there is I/O in progress, better wait till it's done; ! * don't want to delete the database out from under someone ! * who's just trying to flush the buffer! ! */ ! if (bufHdr->flags & BM_IO_IN_PROGRESS) ! { ! WaitIO(bufHdr); ! ! /* ! * By now, the buffer very possibly belongs to some other ! * DB, so check again before proceeding. ! */ ! goto recheck; ! } ! /* Now we can do what we came for */ ! bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED); ! bufHdr->cntxDirty = false; ! ! /* ! * The thing should be free, if caller has checked that no ! * backends are running in that database. ! */ ! Assert(bufHdr->refcount == 0); ! ! /* ! * And mark the buffer as no longer occupied by this page. ! */ ! StrategyInvalidateBuffer(bufHdr); ! } } - - LWLockRelease(BufMgrLock); } /* ----------------------------------------------------------------- --- 1399,1418 ---- int i; BufferDesc *bufHdr; ! /* ! * We needn't consider local buffers, since by assumption the target ! * database isn't our own. ! */ ! for (i = 0; i < NBuffers; i++) { ! bufHdr = &BufferDescriptors[i]; ! LockBufHdr(bufHdr); if (bufHdr->tag.rnode.dbNode == dbid) ! InvalidateBuffer(bufHdr); /* releases spinlock */ ! else ! UnlockBufHdr(bufHdr); } } /* ----------------------------------------------------------------- *************** *** 1342,1373 **** int i; BufferDesc *buf = BufferDescriptors; ! if (IsUnderPostmaster) ! { ! LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); ! for (i = 0; i < NBuffers; ++i, ++buf) ! { ! elog(LOG, ! "[%02d] (freeNext=%d, freePrev=%d, rel=%u/%u/%u, " ! "blockNum=%u, flags=0x%x, refcount=%u %d)", ! i, buf->freeNext, buf->freePrev, ! buf->tag.rnode.spcNode, buf->tag.rnode.dbNode, ! buf->tag.rnode.relNode, ! buf->tag.blockNum, buf->flags, ! buf->refcount, PrivateRefCount[i]); ! } ! LWLockRelease(BufMgrLock); ! } ! else { ! /* interactive backend */ ! for (i = 0; i < NBuffers; ++i, ++buf) ! { ! printf("[%-2d] (%u/%u/%u, %u) flags=0x%x, refcount=%u %d)\n", ! i, buf->tag.rnode.spcNode, buf->tag.rnode.dbNode, ! buf->tag.rnode.relNode, buf->tag.blockNum, ! buf->flags, buf->refcount, PrivateRefCount[i]); ! } } } #endif --- 1429,1445 ---- int i; BufferDesc *buf = BufferDescriptors; ! for (i = 0; i < NBuffers; ++i, ++buf) { ! /* theoretically we should lock the bufhdr here */ ! elog(LOG, ! "[%02d] (freeNext=%d, rel=%u/%u/%u, " ! "blockNum=%u, flags=0x%x, refcount=%u %d)", ! i, buf->freeNext, ! buf->tag.rnode.spcNode, buf->tag.rnode.dbNode, ! buf->tag.rnode.relNode, ! buf->tag.blockNum, buf->flags, ! buf->refcount, PrivateRefCount[i]); } } #endif *************** *** 1379,1398 **** int i; BufferDesc *buf = BufferDescriptors; - LWLockAcquire(BufMgrLock, LW_EXCLUSIVE); for (i = 0; i < NBuffers; ++i, ++buf) { if (PrivateRefCount[i] > 0) ! elog(NOTICE, ! "[%02d] (freeNext=%d, freePrev=%d, rel=%u/%u/%u, " "blockNum=%u, flags=0x%x, refcount=%u %d)", ! i, buf->freeNext, buf->freePrev, buf->tag.rnode.spcNode, buf->tag.rnode.dbNode, buf->tag.rnode.relNode, buf->tag.blockNum, buf->flags, buf->refcount, PrivateRefCount[i]); } - LWLockRelease(BufMgrLock); } #endif --- 1451,1471 ---- int i; BufferDesc *buf = BufferDescriptors; for (i = 0; i < NBuffers; ++i, ++buf) { if (PrivateRefCount[i] > 0) ! { ! /* theoretically we should lock the bufhdr here */ ! elog(LOG, ! "[%02d] (freeNext=%d, rel=%u/%u/%u, " "blockNum=%u, flags=0x%x, refcount=%u %d)", ! i, buf->freeNext, buf->tag.rnode.spcNode, buf->tag.rnode.dbNode, buf->tag.rnode.relNode, buf->tag.blockNum, buf->flags, buf->refcount, PrivateRefCount[i]); + } } } #endif *************** *** 1451,1458 **** bufHdr = &LocalBufferDescriptors[i]; if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node)) { ! if ((bufHdr->flags & BM_VALID) && ! (bufHdr->flags & BM_DIRTY || bufHdr->cntxDirty)) { ErrorContextCallback errcontext; --- 1524,1530 ---- bufHdr = &LocalBufferDescriptors[i]; if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node)) { ! if ((bufHdr->flags & BM_VALID) && (bufHdr->flags & BM_DIRTY)) { ErrorContextCallback errcontext; *************** *** 1464,1474 **** smgrwrite(rel->rd_smgr, bufHdr->tag.blockNum, ! (char *) MAKE_PTR(bufHdr->data), true); bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED); - bufHdr->cntxDirty = false; /* Pop the error context stack */ error_context_stack = errcontext.previous; --- 1536,1545 ---- smgrwrite(rel->rd_smgr, bufHdr->tag.blockNum, ! (char *) LocalBufHdrGetBlock(bufHdr), true); bufHdr->flags &= ~(BM_DIRTY | BM_JUST_DIRTIED); /* Pop the error context stack */ error_context_stack = errcontext.previous; *************** *** 1478,1484 **** RelationGetRelationName(rel), firstDelBlock, |