Unix Technical Forum

SEO

vBulletin Search Engine Optimization


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Patches

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-17-2008, 11:22 PM
Neil Conway
 
Posts: n/a
Default Re: O_DIRECT for WAL writes

ITAGAKI Takahiro wrote:
> The patch adds a new choice "open_direct" to wal_sync_method.
> It uses O_DIRECT flags for WAL writes, like O_SYNC.


Have you looked at what the performance difference of this option is?
For example, these benchmark results seem to indicate that an older
version of the patch is not a performance win, at least for OSDL's workload:

http://www.mail-archive.com/pgsql-pa.../msg07186.html

Is this data still applicable to the revised patch?

-Neil

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-17-2008, 11:23 PM
Neil Conway
 
Posts: n/a
Default Re: O_DIRECT for WAL writes

On Mon, 2005-05-30 at 10:59 +0900, ITAGAKI Takahiro wrote:
> Yes, I've tested pgbench and dbt2 and their performances have improved.
> The two results are as follows:
>
> 1. pgbench -s 100 on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8
> (attached image)
> tps | wal_sync_method
> -------+-------------------------------------------------------
> 147.0 | open_direct + write multipage (previous patch)
> 147.2 | open_direct (this patch)
> 109.9 | open_sync


I'm surprised this makes as much of a difference as that benchmark would
suggest. I wonder if we're benchmarking the right thing, though: is
opening a file with O_DIRECT sufficient to ensure that a write(2) does
not return until the data has hit disk? (As would be the case with
O_SYNC.) O_DIRECT means the OS will attempt to minimize caching, but
that is not necessarily the same thing: for example, I can imagine an
implementation in which the kernel would submit the appropriate I/O to
the disk when it sees a write(2) on a file opened with O_DIRECT, but
then let the write(2) return before getting confirmation from the disk
that the I/O has succeeded or failed. From googling, the MySQL
documentation for innodb_flush_method notes:

This option is only relevant on Unix systems. If set to
fdatasync, InnoDB uses fsync() to flush both the data and log
files. If set to O_DSYNC, InnoDB uses O_SYNC to open and flush
the log files, but uses fsync() to flush the datafiles. If
O_DIRECT is specified (available on some GNU/Linux versions
starting from MySQL 4.0.14), InnoDB uses O_DIRECT to open the
datafiles, and uses fsync() to flush both the data and log
files.

That would suggest O_DIRECT by itself is not sufficient to force a flush
to disk -- if anyone has some more definitive evidence that would be
welcome.

Anyway, if the above is true, we'll need to use O_DIRECT as well as one
of the existing wal_sync_methods.

BTW, from the patch:

+ /* TODO: Aligment depends on OS and filesystem. */
+ #define O_DIRECT_BUFFER_ALIGN 4096

I suppose there's no reasonable way to autodetect this, so we'll need to
expose it as a GUC variable (or perhaps a configure option), which is a
bit unfortunate.

-Neil



---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-17-2008, 11:23 PM
Tom Lane
 
Posts: n/a
Default Re: O_DIRECT for WAL writes

Neil Conway <neilc@samurai.com> writes:
> I wonder if we're benchmarking the right thing, though: is
> opening a file with O_DIRECT sufficient to ensure that a write(2) does
> not return until the data has hit disk?


Some googling suggests so, eg
http://www.die.net/doc/linux/man/man2/open.2.html

There are several useful tidbits about O_DIRECT on that page,
including this quote:

> "The thing that has always disturbed me about O_DIRECT is that the whole
> interface is just stupid, and was probably designed by a deranged monkey
> on some serious mind-controlling substances." -- Linus


Somehow I find that less than confidence-building...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-17-2008, 11:23 PM
Neil Conway
 
Posts: n/a
Default Re: O_DIRECT for WAL writes

On Mon, 2005-05-30 at 02:52 -0400, Tom Lane wrote:
> Some googling suggests so, eg
> http://www.die.net/doc/linux/man/man2/open.2.html


Well, that claims that "data is guaranteed to have been transferred",
but transferred to *where* is the question Transferring data to the
disk's buffers and then not asking for the buffer to be flushed is not
sufficient, for example. IMHO the fact that InnoDB uses both O_DIRECT
and fsync() is more convincing. I'm still looking for a definitive
answer, though.

The other question is whether these semantics are identical among the
various O_DIRECT implementations (e.g. Linux, FreeBSD, AIX, IRIX, and
others).

-Neil



---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-17-2008, 11:23 PM
Ron Mayer
 
Posts: n/a
Default Re: O_DIRECT for WAL writes

Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
>>is opening a file with O_DIRECT sufficient to ensure that
>>a write(2) does not return until the data has hit disk?

>
> Some googling suggests so, eg
> http://www.die.net/doc/linux/man/man2/open.2.html


Really? On that page I read:
"O_DIRECT...at the completion of the read(2) or write(2)
system call, data is guaranteed to have been transferred."
which sounds to me like transfered to the device's cache
but not necessarily flushed through the device's cache.
It says nothing about physical media. That wording feels
different to me from O_SYNC which reads:
"O_SYNC will block the calling process until the data has
been physically written to the underlying hardware."
which does suggest to me that it writes to physical media.
Or am I reading that wrong?



PS: I've gotten way out of my depth here, but...

...attempting to browse the Linux source(!!)

Looking at the O_SYNC stuff in ext3:
http://lxr.linux.no/source/fs/ext3/file.c#L67
it looks like in this conditional:
if (file->f_flags & O_SYNC) {
...
goto force_commit;
}
the goto branch calls ext3_force_commit() in much the
same way that it seems fsync() does here:
http://lxr.linux.no/source/fs/ext3/fsync.c#L71
so I believe O_SYNC does at least as much as fsync().

However I can't find O_DIRECT anywhere in the ext3 stuff,
so if it does work it's less obvious how or if it could.

Moreover I see O_SYNC used lots of places:
http://lxr.linux.no/ident?i=O_SYNC
in various places like fs/ext3/; and and I don't
see O_DIRECT in nearly as many places
http://lxr.linux.no/ident?i=O_DIRECT
It looks like reiserfs and xfs seem look at O_DIRECT,
but ext3 doesn't appear to unless it's somewhere
outside the fs/ext3 directory.


PPS: Of course not even fsync() flushed correctly until very recent kernels:
http://hardware.slashdot.org/comment...9&cid=12519114
In that article Jeff Garzik (the linux SATA driver guy) suggests
that until very recent kernels ext3 did not have write barrier
support that issues the FLUSH CACHE (IDE) or SYNCHRONIZE CACHE
(SCSI) commands even on fsync.


PPPS: No, I don't understand the kernel - I'm just showing what quick
grep commands showed without any deep understanding.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-17-2008, 11:23 PM
Tom Lane
 
Posts: n/a
Default Re: O_DIRECT for WAL writes

Neil Conway <neilc@samurai.com> writes:
> On Mon, 2005-05-30 at 02:52 -0400, Tom Lane wrote:
> Well, that claims that "data is guaranteed to have been transferred",
> but transferred to *where* is the question


Oh, I see what you are worried about. I think you are right: what the
doc promises is only that the DMA transfer has finished (ie, it's safe
to scribble on your buffer again). So you'd still need an fsync;
which makes O_DIRECT orthogonal to wal_sync_method rather than a
valid choice for it. (Hm, I wonder if specifying both O_DIRECT and
O_SYNC works ...)

> The other question is whether these semantics are identical among the
> various O_DIRECT implementations (e.g. Linux, FreeBSD, AIX, IRIX, and
> others).


Wouldn't count on it :-(. One thing I'm particularly worried about is
buffer cache consistency: does the kernel guarantee to flush any buffers
it has that overlap the O_DIRECT write operation? Without this, an
application reading the WAL using normal non-O_DIRECT I/O might see the
wrong data; which is bad news for PITR.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-17-2008, 11:23 PM
Neil Conway
 
Posts: n/a
Default Re: O_DIRECT for WAL writes

On Mon, 2005-05-30 at 11:24 -0400, Tom Lane wrote:
> Wouldn't count on it :-(. One thing I'm particularly worried about is
> buffer cache consistency: does the kernel guarantee to flush any buffers
> it has that overlap the O_DIRECT write operation?


At least on Linux I believe the kernel guarantees consistency between
O_DIRECT and non-O_DIRECT operations. From googling, it seems AIX also
provides consistency, albeit not for free[1]:

To avoid consistency issues, if there are multiple calls to open
a file and one or more of the calls did not specify O_DIRECT and
another open specified O_DIRECT, the file stays in the normal
cached I/O mode. Similarly, if the file is mapped into memory
through the shmat() or mmap() system calls, it stays in normal
cached mode. If the last conflicting, non-direct access is
eliminated, then the file system will move the file into direct
I/O mode (either by using the close(), munmap(), or shmdt()
subroutines). Changing from normal mode to direct I/O mode can
be expensive because all modified pages in memory will have to
be flushed to disk at that point.

-Neil

[1]
http://publib16.boulder.ibm.com/pser.../diskperf9.htm


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-17-2008, 11:24 PM
Mary Edie Meredith
 
Posts: n/a
Default Re: O_DIRECT for WAL writes

On Mon, 2005-05-30 at 16:29 +1000, Neil Conway wrote:
> On Mon, 2005-05-30 at 10:59 +0900, ITAGAKI Takahiro wrote:
> > Yes, I've tested pgbench and dbt2 and their performances have improved.
> > The two results are as follows:
> >
> > 1. pgbench -s 100 on one Pentium4, 1GB mem, 2 ATA disks, Linux 2.6.8
> > (attached image)
> > tps | wal_sync_method
> > -------+-------------------------------------------------------
> > 147.0 | open_direct + write multipage (previous patch)
> > 147.2 | open_direct (this patch)
> > 109.9 | open_sync

>
> I'm surprised this makes as much of a difference as that benchmark would
> suggest. I wonder if we're benchmarking the right thing, though: is
> opening a file with O_DIRECT sufficient to ensure that a write(2) does
> not return until the data has hit disk? (As would be the case with
> O_SYNC.) O_DIRECT means the OS will attempt to minimize caching, but
> that is not necessarily the same thing: for example, I can imagine an
> implementation in which the kernel would submit the appropriate I/O to
> the disk when it sees a write(2) on a file opened with O_DIRECT, but
> then let the write(2) return before getting confirmation from the disk
> that the I/O has succeeded or failed. From googling, the MySQL
> documentation for innodb_flush_method notes:
>
> This option is only relevant on Unix systems. If set to
> fdatasync, InnoDB uses fsync() to flush both the data and log
> files. If set to O_DSYNC, InnoDB uses O_SYNC to open and flush
> the log files, but uses fsync() to flush the datafiles. If
> O_DIRECT is specified (available on some GNU/Linux versions
> starting from MySQL 4.0.14), InnoDB uses O_DIRECT to open the
> datafiles, and uses fsync() to flush both the data and log
> files.
>
> That would suggest O_DIRECT by itself is not sufficient to force a flush
> to disk -- if anyone has some more definitive evidence that would be
> welcome.


I know I'm late to this discussion, and I haven't made it all the way
through this thread to see if your questions on Linux writes were
resolved. If you are still interested, I recommend read a very good
one page description of reliable writes buried in the Data Center Linux
Goals and Capabilities document. It is on page 159 of the document, the
item is "R.ReliableWrites" in this _giant PDF file (do a wget and open
it locally ; don't try to read it directly):

http://www.osdlab.org/lab_activities...lities_1.1.pdf

The information came from me interviewing Daniel McNeil, an OSDL
Engineer who wrote and tested much of the Linux async IO code, after I
was similarly confused about when a write is "guaranteed". Reliable
writes, as you can imagine, are very important to Data Center folks,
which is how it happens to be in this document.

Hope this helps.
>
> Anyway, if the above is true, we'll need to use O_DIRECT as well as one
> of the existing wal_sync_methods.
>
> BTW, from the patch:
>
> + /* TODO: Aligment depends on OS and filesystem. */
> + #define O_DIRECT_BUFFER_ALIGN 4096
>
> I suppose there's no reasonable way to autodetect this, so we'll need to
> expose it as a GUC variable (or perhaps a configure option), which is a
> bit unfortunate.
>
> -Neil
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

--
Mary Edie Meredith
maryedie@osdl.org
503-906-1942
Data Center Linux Initiative Manager
Open Source Development Labs


---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-17-2008, 11:24 PM
Neil Conway
 
Posts: n/a
Default Re: O_DIRECT for WAL writes

On Wed, 2005-06-01 at 17:08 -0700, Mary Edie Meredith wrote:
> I know I'm late to this discussion, and I haven't made it all the way
> through this thread to see if your questions on Linux writes were
> resolved. If you are still interested, I recommend read a very good
> one page description of reliable writes buried in the Data Center Linux
> Goals and Capabilities document.


This suggests that on Linux a write() on a file opened with O_DIRECT has
the same synchronization guarantees as a write() on a file opened with
O_SYNC, which is precisely the opposite of what was concluded down
thread. So now I'm more confused

(Regardless of behavior on Linux, I would guess O_DIRECT doesn't behave
this way on all platforms -- for example, FreeBSD's open(2) manpage does
not mention I/O synchronization when referring to O_DIRECT. So even if
we can skip the fsync() with O_DIRECT on Linux, I doubt we'll be able to
do that on all platforms.)

-Neil



---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-17-2008, 11:24 PM
Mary Edie Meredith
 
Posts: n/a
Default Re: O_DIRECT for WAL writes

On Thu, 2005-06-02 at 11:39 +1000, Neil Conway wrote:
> On Wed, 2005-06-01 at 17:08 -0700, Mary Edie Meredith wrote:
> > I know I'm late to this discussion, and I haven't made it all the way
> > through this thread to see if your questions on Linux writes were
> > resolved. If you are still interested, I recommend read a very good
> > one page description of reliable writes buried in the Data Center Linux
> > Goals and Capabilities document.

>
> This suggests that on Linux a write() on a file opened with O_DIRECT has
> the same synchronization guarantees as a write() on a file opened with
> O_SYNC, which is precisely the opposite of what was concluded down
> thread. So now I'm more confused
>
> (Regardless of behavior on Linux, I would guess O_DIRECT doesn't behave
> this way on all platforms -- for example, FreeBSD's open(2) manpage does
> not mention I/O synchronization when referring to O_DIRECT. So even if
> we can skip the fsync() with O_DIRECT on Linux, I doubt we'll be able to
> do that on all platforms.)


My understanding is that O_DIRECT means "direct" as in "no buffering by
the OS" which implies that if you write from your buffer, the write is
not going to return unless the OS thinks the write is completed (or
unless you are using Async IO). Otherwise you might reuse your buffer
(there _is no other buffer after all) and if the write were incomplete
before refill you buffer for another, the first write might go from your
buffer with wrong data.

Now if you want to avoid _waiting for the write to complete, you need to
employ async io, which is why most databases that support direct io for
their datafiles also have implemented some form of async io as well
(either via OS calls or some built-in mechanism as is the case with
SAP-DB). With AIO you have to manage your buffers so that you reuse them
only when you are notified the IO is completed. Historically this was
done with raw datafiles, but currently (at least for Linux) you can also
do this with files. For logging, though, I think you want synchronous
IO to guarantee order.

The cool thing about buffering the datafile data yourself is that _you
(the database engine) can control what stays in (shared) memory and what
does not. You can add configuration options or add intelligence, so
that frequently used data (like hot indexes) can stay in memory
indefinitely. The OS can never do that so specifically. In addition,
you can avoid having data from table scans overwrite hot objects. Of
course, at the moment you are discussing the use for logging, but there
should be benefits to extending this to datafiles as well, assuming you
also implement async io.

Bottom line: if you do not implement direct/async IO so that you
optimize caching of hot database objects and minimize memory utilization
of objects used once, you are probably leaving performance on the table
for datafiles.

Daniel is on vacation, but I will ask him to confirm once he returns.
>
> -Neil
>

--
Mary Edie Meredith
maryedie@osdl.org
503-906-1942
Data Center Linux Initiative Manager
Open Source Development Labs


---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 10:34 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0
UnixAdminTalk.com

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503