Unix Technical Forum

SEO

vBulletin Search Engine Optimization


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 10:45 AM
James Mansion
 
Posts: n/a
Default POSIX file updates

(Declaration of interest: I'm researching for a publication
on OLTP system design)

I have a question about file writes, particularly on POSIX.
This arose while considering the extent to which cache memory
and command queueing on disk
drives can help improve performance.

Is it correct that POSIX requires that the updates to a single
file are serialised in the filesystem layer?

So, if we have a number of dirty pages to write back to a single
file in the database (whether a table or index) then we cannot
pass these through the POSIX filesystem layer into the TCQ/NCQ
system on the disk drive, so it can reorder them?

I have seen suggestions that on Solaris this can be relaxed.

I *assume* that PostgreSQL's lack of threads or AIO and the
single bgwriter means that PostgreSQL 8.x does not normally
attempt to make any use of such a relaxation but could do so if the
bgwriter fails to keep up and other backends initiate flushes.

Does anyone know (perhaps from other systems) whether it is
valuable to attempt to take advantage of such a relaxation
where it is available?

Does the serialisation for file update apply in the case
where the file contents have been memory-mapped and we
try an msync (or equivalent)?



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 10:46 AM
Mark Mielke
 
Posts: n/a
Default Re: POSIX file updates

James Mansion wrote:
> (Declaration of interest: I'm researching for a publication
> on OLTP system design)
>
> I have a question about file writes, particularly on POSIX.
> This arose while considering the extent to which cache memory
> and command queueing on disk
> drives can help improve performance.
>
> Is it correct that POSIX requires that the updates to a single
> file are serialised in the filesystem layer?


Is there anything in POSIX that seems to suggest this? :-) (i.e. why are
you going under the assumption that the answer is yes - did you read
something?)

> So, if we have a number of dirty pages to write back to a single
> file in the database (whether a table or index) then we cannot
> pass these through the POSIX filesystem layer into the TCQ/NCQ
> system on the disk drive, so it can reorder them?


I don't believe POSIX has any restriction such as you describe - or if
it does, and I don't know about it, then most UNIX file systems (if not
most file systems on any platform) are not POSIX compliant.

Linux itself, even without NCQ, might choose to reorder the writes. If
you use ext2, the pressure to push pages out is based upon last used
time rather than last write time. It can choose to push out pages at any
time, and it's only every 5 seconds or so the the system task (bdflush?)
tries to force out all dirty file system pages. NCQ exaggerates the
situation, but I believe the issue pre-exists NCQ or the SCSI equivalent
of years past.

The rest of your email relies on the premise that POSIX enforces such a
thing, or that systems are POSIX compliant. :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 10:46 AM
James Mansion
 
Posts: n/a
Default Re: POSIX file updates

Mark Mielke wrote:
> Is there anything in POSIX that seems to suggest this? :-) (i.e. why
> are you going under the assumption that the answer is yes - did you
> read something?)
>

It was something somewhere on the Sun web site, relating to tuning Solaris
filesystems. Or databases. Or ZFS. :-(

Needless to say I can't find a search string that finds it now. I
remember being surprised
though, since I wasn't aware of it either.
> I don't believe POSIX has any restriction such as you describe - or if
> it does, and I don't know about it, then most UNIX file systems (if
> not most file systems on any platform) are not POSIX compliant.

That, I can believe.

>
> Linux itself, even without NCQ, might choose to reorder the writes. If
> you use ext2, the pressure to push pages out is based upon last used
> time rather than last write time. It can choose to push out pages at
> any time, and it's only every 5 seconds or so the the system task
> (bdflush?) tries to force out all dirty file system pages. NCQ
> exaggerates the situation, but I believe the issue pre-exists NCQ or
> the SCSI equivalent of years past.

Indeed there do seem to be issues with Linux and fsync. Its one of
things I'm trying to get a
handle on as well - the relationship between fsync and flushes of
controller and/or disk caches.
>
> The rest of your email relies on the premise that POSIX enforces such
> a thing, or that systems are POSIX compliant. :-)
>

True. I'm hoping someone (Jignesh?) will be prompted to remember.

It may have been something in a blog related to ZFS vs other
filesystems, but so far I'm coming
up empty in google. doesn't feel like something I imagined though.

James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 10:46 AM
James Mansion
 
Posts: n/a
Default Re: POSIX file updates


> I don't believe POSIX has any restriction such as you describe - or if
> it does, and I don't know about it, then most UNIX file systems (if
> not most file systems on any platform) are not POSIX compliant.
>

I suspect that indeed there are two different issues here in that the
file mutex relates to updates
to the file, not passing the buffers through into the drive, which
indeed might be delayed.

Been using direct io too much recently. :-(


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 10:46 AM
James Mansion
 
Posts: n/a
Default Re: POSIX file updates

Mark Mielke wrote:
> Is there anything in POSIX that seems to suggest this? :-) (i.e. why
> are you going under the assumption that the answer is yes - did you
> read something?)
>

Perhaps it was just this:

http://kevinclosson.wordpress.com/20...rite-ordering/

Whichof course isn't on Sun.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 10:46 AM
Greg Smith
 
Posts: n/a
Default Re: POSIX file updates

On Mon, 31 Mar 2008, James Mansion wrote:

> Is it correct that POSIX requires that the updates to a single
> file are serialised in the filesystem layer?


Quoting from Lewine's "POSIX Programmer's Guide":

"After a write() to a regular file has successfully returned, any
successful read() from each byte position in the file that was modified by
that write() will return the data that was written by the write()...a
similar requirement applies to multiple write operations to the same file
position"

That's the "contract" that has to be honored. How your filesystem
actually implements this contract is none of a POSIX write() call's
business, so long as it does.

It is the case that multiple writers to the same file can get serialized
somewhere because of how this call is implemented though, so you're
correct about that aspect of the practical impact being a possibility.

> So, if we have a number of dirty pages to write back to a single
> file in the database (whether a table or index) then we cannot
> pass these through the POSIX filesystem layer into the TCQ/NCQ
> system on the disk drive, so it can reorder them?


As long as the reordering mechanism also honors that any reads that come
after a write to a block reflect that write, they can be reordered. The
filesystem and drives are already doing elevator sorting and similar
mechanisms underneath you to optimize things. Unless you use a sync
operation or some sort of write barrier, you don't really know what has
happened.

> I have seen suggestions that on Solaris this can be relaxed.


There's some good notes in this area at:

http://www.solarisinternals.com/wiki...php/Direct_I/O and
http://www.solarisinternals.com/wiki...FS_Performance

It's clear that such relaxation has benefits with some of Oracle's
mechanisms as described. But amusingly, PostgreSQL doesn't even support
Solaris's direct I/O method right now unless you override the filesystem
mounting options, so you end up needing to split it out and hack at that
level regardless.

> I *assume* that PostgreSQL's lack of threads or AIO and the
> single bgwriter means that PostgreSQL 8.x does not normally
> attempt to make any use of such a relaxation but could do so if the
> bgwriter fails to keep up and other backends initiate flushes.


PostgreSQL writes transactions to the WAL. When they have reached disk,
confirmed by a successful f[data]sync or a completed syncronous write,
that transactions is now committed. Eventually the impacted items in the
buffer cache will be written as well. At checkpoint time, things are
reconciled such that all dirty buffers at that point have been written,
and now f[data]sync is called on each touched file to make sure those
changes have made it to disk.

Writes are assumed to be lost in some memory (kernel, filesystem or disk
cache) until they've been confirmed to be written to disk via the sync
mechanism. When a backend flushes a buffer out, as soon as the OS caches
that write the database backend moves on without being concerned about how
it's eventually going to get to disk one day. As long as the newly
written version comes back again if it's read, the database doesn't worry
about what's happening until it specifically asks for a sync that proves
everything is done. So if the backends or the background writer are
spewing updates out, they don't care if the OS doesn't guarantee the order
they hit disk until checkpoint time; it's only the synchronous WAL writes
that do.

Also note that it's usually the case that backends write a substantial
percentage of the buffers out themselves. You should assume that's the
case unless you've done some work to prove the background writer is
handling most writes (which is difficult to even know before 8.3, much
less tune for).

That how I understand everything to work at least. I will add the
disclaimer that I haven't looked at the archive recovery code much yet.
Maybe there's some expectation it has for general database write ordering
in order for the WAL replay mechanism to work correctly, I can't imagine
how that could work though.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 10:46 AM
Tom Lane
 
Posts: n/a
Default Re: POSIX file updates

Greg Smith <gsmith@gregsmith.com> writes:
> Quoting from Lewine's "POSIX Programmer's Guide":


> "After a write() to a regular file has successfully returned, any
> successful read() from each byte position in the file that was modified by
> that write() will return the data that was written by the write()...a
> similar requirement applies to multiple write operations to the same file
> position"


Yeah, I imagine this is what the OP is thinking of. But note that what
it describes is the behavior of concurrent write() and read() calls
within a normally-functioning system. I see nothing there that
constrains the order in which writes hit physical disk, nor (to put it
another way) that promises anything much about the state of the
filesystem after a crash.

As you stated, PG is largely independent of these issues anyway. As
long as the filesystem honors its spec-required contract that it won't
claim fsync() is complete before all the referenced data is safely on
persistent store, we are OK.

regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 10:46 AM
James Mansion
 
Posts: n/a
Default Re: POSIX file updates

Greg Smith wrote:
> "After a write() to a regular file has successfully returned, any
> successful read() from each byte position in the file that was
> modified by that write() will return the data that was written by the
> write()...a similar requirement applies to multiple write operations
> to the same file position"
>

Yes, but that doesn't say anything about simultaneous read and write
from multiple threads from
the same or different processes with descriptors on the same file.

No matter, I was thinking about a case with direct unbuffered IO. Too
many years
using Sybase on raw devices. :-(

Though, some of the performance studies relating to UFS directio suggest
that there
are indeed benefits to managing the write through rather than using the
OS as a poor
man's background thread to do it. SQLServer allows config based on deadline
scheduling for checkpoint completion I believe. This seems to me a very
desirable
feature, but it does need more active scheduling of the write-back.

>
> It's clear that such relaxation has benefits with some of Oracle's
> mechanisms as described. But amusingly, PostgreSQL doesn't even
> support Solaris's direct I/O method right now unless you override the
> filesystem mounting options, so you end up needing to split it out and
> hack at that level regardless.

Indeed that's a shame. Why doesn't it use the directio?
> PostgreSQL writes transactions to the WAL. When they have reached
> disk, confirmed by a successful f[data]sync or a completed syncronous
> write, that transactions is now committed. Eventually the impacted
> items in the buffer cache will be written as well. At checkpoint
> time, things are reconciled such that all dirty buffers at that point
> have been written, and now f[data]sync is called on each touched file
> to make sure those changes have made it to disk.

Yes but fsync and stable on disk isn't the same thing if there is a
cache anywhere is it?
Hence the fuss a while back about Apple's control of disk caches.
Solaris and Windows
do it too.

Isn't allowing the OS to accumulate an arbitrary number of dirty blocks
without
control of the rate at which they spill to media just exposing a
possibility of an IO
storm when it comes to checkpoint time? Does bgwriter attempt to
control this
with intermediate fsync (and push to media if available)?

It strikes me as odd that fsync_writethrough isn't the most preferred
option where
it is implemented. The postgres approach of *requiring* that there be no
cache
below the OS is problematic, especially since the battery backup on internal
array controllers is hardly the handiest solution when you find the mobo
has died.
And especially when the inability to flush caches on modern SATA and SAS
drives would appear to be more a failing in some operating systems than in
the drives themselves..

The links I've been accumulating into my bibliography include:

http://www.h2database.com/html/advan...tion_isolation
http://lwn.net/Articles/270891/
http://article.gmane.org/gmane.linux.kernel/646040
http://lists.apple.com/archives/darw.../msg00072.html
http://brad.livejournal.com/2116715.html

And your handy document on wal tuning, of course.

James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 10:46 AM
Andreas Kostyrka
 
Posts: n/a
Default Re: POSIX file updates


Am Mittwoch, den 02.04.2008, 20:10 +0100 schrieb James Mansion:
> It strikes me as odd that fsync_writethrough isn't the most preferred
> option where
> it is implemented. The postgres approach of *requiring* that there be no
> cache
> below the OS is problematic, especially since the battery backup on internal
> array controllers is hardly the handiest solution when you find the mobo
> has died.


Well, that might sound brutal, but I'm having today a brute day.

There are some items here.

1.) PostgreSQL relies on filesystem semantics. Which might be better or
worse then the raw devices other RDBMS use as an interface, but in the
end it is just an interface. How well that works out depends strongly on
your hardware selection, your OS selection and so on. DB tuning is an
scientific art form Worse the fact that raw devices work better on
hardware X/os Y than say filesystems is only of limited interest, only
if you happen to have already an investement in X or Y. In the end the
questions are is the performance good enough, is the data safety good
enough, and at which cost (in money, work, ...).

2.) data safety requirements vary strongly. In many (if not most) cases,
the recovery of the data on a failed hardware is not critical. Hint:
being down till somebody figures out what failed, if the rest of the
system is still stable, and so on are not acceptable at all. Meaning the
moment that the database server has any problem, one of the hot standbys
takes over. The thing you worry about is if all data has made it to the
replication servers, not if some data might get lost in the hardware
cache of a controller. (Actually, talk to your local computer forensics
guru, there are a number of way to keep the current to electronics while
moving them.)

3.) a controller cache is an issue if you have a filesystem in your data
path or not. If you do raw io, and the stupid hardware do cache writes,
well it's about as stupid as it would be if it would have cached
filesystem writes.

Andreas


> And especially when the inability to flush caches on modern SATA and SAS
> drives would appear to be more a failing in some operating systems than in
> the drives themselves..
>
> The links I've been accumulating into my bibliography include:
>
> http://www.h2database.com/html/advan...tion_isolation
> http://lwn.net/Articles/270891/
> http://article.gmane.org/gmane.linux.kernel/646040
> http://lists.apple.com/archives/darw.../msg00072.html
> http://brad.livejournal.com/2116715.html
>
> And your handy document on wal tuning, of course.
>
> James
>
>


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQBH8+tsHJdudm4KnO0RAh5uAKC6u6OuKyr0y4wIMQmDuG Tym0gDqACg3R/Q
0Ac+DQBuEP/eImpiu28gRRk=
=jEWx
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 10:46 AM
James Mansion
 
Posts: n/a
Default Re: POSIX file updates

Andreas Kostyrka wrote:
> takes over. The thing you worry about is if all data has made it to the
> replication servers, not if some data might get lost in the hardware
> cache of a controller. (Actually, talk to your local computer forensics
> guru, there are a number of way to keep the current to electronics while
> moving them.)
>

But it doesn't, unless you use a synchronous rep at block level - which
is why we have SRDF.
Log-based reps are async and will lose committed transactions. Even if
you failed over, its
still extra-ordinarily useful to be able to see what the primary tried
to do - its the only place
the e-comm transactions are stored, and the customer will still expect
delivery.

I'm well aware that there are battery-backed caches that can be detached
from controllers
and moved. But you'd better make darn sure you move all the drives and
plug them in in
exactly the right order and make sure they all spin up OK with the
replaced cache, because
its expecting them to be exactly as they were last time they were on the
bus.

> 3.) a controller cache is an issue if you have a filesystem in your data
> path or not. If you do raw io, and the stupid hardware do cache writes,
> well it's about as stupid as it would be if it would have cached
> filesystem writes.
>

Only if the OS doesn't know how to tell the cache to flush. SATA and
SAS both have that facility.
But the semantics of *sync don't seem to be defined to require it being
exercised, at least as far
as many operating systems implement it.

You would think hard drives could have enough capacitor store to dump
cache to flash or the
drive - if only to a special dump zone near where the heads park. They
are spinning already
after all.

On small systems in SMEs its inevitable that large drives will be shared
with filesystem use, even
if the database files are on their own slice. If you can allow the drive
to run with writeback cache
turned on, then the users will be a lot happier, even if dbms commits
force *all* that cache to
flush to the platters.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 07:08 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
UnixAdminTalk.com

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466