vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Ok, just wanted a sanity check, here. That is to say, find out if anyone else has experienced this issue. Any way, at my environment, we have a number of different clustered systems. Partners in each cluster are identically configured (same hardware and same OS and patches). However, we do run several different versions of VCS and Sun Cluster. Ocasionally, if a node is really busy and panics, when the filesystem gets re-assembled on the standby node, the fsck sometimes results in some of the active/open files disappearing into the ether. Anyone ever witnessed this? Is there any documentation out there on this subject (tried Googling, but nothing useful/recent turned up). -tom |
| |||
| Thomas H Jones II wrote: > Ok, just wanted a sanity check, here. That is to say, find out if anyone > else has experienced this issue. > > Any way, at my environment, we have a number of different clustered systems. > Partners in each cluster are identically configured (same hardware and same > OS and patches). However, we do run several different versions of VCS and > Sun Cluster. > > Ocasionally, if a node is really busy and panics, when the filesystem gets > re-assembled on the standby node, the fsck sometimes results in some of the > active/open files disappearing into the ether. > > Anyone ever witnessed this? Is there any documentation out there on this > subject (tried Googling, but nothing useful/recent turned up). > > -tom > Tom, fsck will do its darndest to rescue and sanitize any corrupt/munged filesystems after a panic/powerfail, but exactly how it does it is a black art - its source was not made available during the Solaris 8 Source Code availability program Recoverable files are obviously recovered to preallocated slots in the lost+found directory in each filesystem, and named after their inode number. This happens in the case where a directory owning files is lost. If you can stand the potential overhead, turn on jounaling/logging on the filesystems that are crucial to you; it will offer more protection. Also make sure that you are up to date on Solaris patches, and anything Veritas puts out. Also if you have a panicking system, make sure you have crashdumps correctly configured (dumpadm), and even consider booting kadb. These will aid analysis. Last of all, watch out for situations where fsck will not go to completion automatically, and will require manual execution (eg. from single-user mode or booted off CD). I have seen this if a directory's link count needs to be increased, rather than decreased. If this is all Master of the Bleedin' Obvious stuff, then apols - I'll probably get flamed by the clever posters in this NG anyways |
| |||
| In comp.unix.solaris Thomas H Jones II <ferric@xanthia.com> wrote: > Ok, just wanted a sanity check, here. That is to say, find out if anyone > else has experienced this issue. > Any way, at my environment, we have a number of different clustered systems. > Partners in each cluster are identically configured (same hardware and same > OS and patches). However, we do run several different versions of VCS and > Sun Cluster. > Ocasionally, if a node is really busy and panics, when the filesystem gets > re-assembled on the standby node, the fsck sometimes results in some of the > active/open files disappearing into the ether. What filesystem are you using: VxFS or UFS? If you're using UFS, are you mounting it with 'logging' enabled? -- Darren Dunham ddunham@taos.com Unix System Administrator Taos - The SysAdmin Company Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. > |
| |||
| Thomas H Jones II wrote: > Ok, just wanted a sanity check, here. That is to say, find out if anyone > else has experienced this issue. > > Any way, at my environment, we have a number of different clustered systems. > Partners in each cluster are identically configured (same hardware and same > OS and patches). However, we do run several different versions of VCS and > Sun Cluster. > > Ocasionally, if a node is really busy and panics, when the filesystem gets > re-assembled on the standby node, the fsck sometimes results in some of the > active/open files disappearing into the ether. Thinking from the Sun Cluster side, and I'm sure VCS too, no amount of regular system "busy-ness" should result in a cluster failover and panic. Unless, that is, your service monitors include a recipe for system being busy and deem that as a reason to failover (which should still be elegant without damaging filesystems). So if you're really witnessing unexplained "busy leads to panic/failover" get in contact with SunService for someone to investigate. Gavin > Anyone ever witnessed this? Is there any documentation out there on this > subject (tried Googling, but nothing useful/recent turned up). > > -tom > |
| |||
| * Gavin Maltby wrote: > Thinking from the Sun Cluster side, and I'm sure VCS too, no amount of > regular system "busy-ness" should result in a cluster failover and panic. > Unless, that is, your service monitors include a recipe for system being > busy and deem that as a reason to failover (which should still be elegant > without damaging filesystems). Can't a stupidly highly-loaded system result in it failing to send heartbeats in time, and the cluster monitor then deciding it's gone away and failing over? Obviously a properly designed system wouldn't suffer from this - apart from anything else you'd put the heartbeat stuff in some kind of scheduling class on their own so they can always get time, but I'm not convinced that this is always done right. I'm thinking of things like: system becomes catastrophically short of memory, swapping heavily, heartbeat code is swapped out somewhere, fails to get back in in time. I know these things can be coped with in theory, but are they in practice? --tim |
| |||
| Tim Bradshaw wrote: > * Gavin Maltby wrote: > > >>Thinking from the Sun Cluster side, and I'm sure VCS too, no amount of >>regular system "busy-ness" should result in a cluster failover and panic. >>Unless, that is, your service monitors include a recipe for system being >>busy and deem that as a reason to failover (which should still be elegant >>without damaging filesystems). > > > Can't a stupidly highly-loaded system result in it failing to send > heartbeats in time, and the cluster monitor then deciding it's gone > away and failing over? Obviously a properly designed system wouldn't > suffer from this - apart from anything else you'd put the heartbeat > stuff in some kind of scheduling class on their own so they can always > get time, but I'm not convinced that this is always done right. If this happens then it's a bug and you should complain to the cluster vendor. It is possible to knobble some heartbeat designs with device interrupts, but you can design around that. > I'm thinking of things like: system becomes catastrophically short of > memory, swapping heavily, heartbeat code is swapped out somewhere, > fails to get back in in time. I know these things can be coped with > in theory, but are they in practice? Any respectable heartbeat would allow for all of that, and Solaris provides all the infrastructure to avoid it all. You can place heartbeat code in realtime class, shelter interrupts, lock your mappings etc. Gavin Speaking for myself. |
| |||
| * Gavin Maltby wrote: > If this happens then it's a bug and you should complain to the > cluster vendor. It is possible to knobble some heartbeat designs > with device interrupts, but you can design around that. Yes, I'm not disputing it's a bug: I'm just wondering whether it can happen in current cluster systems. Complaining to the vendor may not get a fix to something like this very fast, if it happens, because I suspect it's not something a three-line change will fix. And just to play devil's advocate: *is* it a bug? We've all dealt with systems where some horrible thing has happened (runaway memory use, fork bomb or what have you), and while the system is in theory still up, it's not actually doing any useful work or likely to any time soon, and the kindest thing is to put it out of its misery with the big red button. OK, *those* problems can, in theory, be worked around as well, but they still happen. Maybe the right thing in that case is for the cluster to just decide that machine is gone, and fail over? Of course it should probably make a conscious decision (`load average is 903, memory shortfall is 20GB, last user code ran 10mins ago, time to die') rather than just failing to respond... --tim |
| |||
| In article <KNCpb.2510$mo7.75@newssvr25.news.prodigy.com>, Darren Dunham <ddunham@redwood.taos.com> did thusly spew forth: >In comp.unix.solaris Thomas H Jones II <ferric@xanthia.com> wrote: >What filesystem are you using: VxFS or UFS? If you're using UFS, are >you mounting it with 'logging' enabled? Two of the incidents in question were on Veritas Clusters. Veritas only supports the use of VxFS on their clusters. So, these inicidents occured on VxFS filesystems. As a side note, all of our Solaris 7 or higher systems (which is probably better than 80%, at this point) have logging enabled as part of our default jumpstart. -tom |
| ||||
| In article <bo7td2$fac$1@new-usenet.uk.sun.com>, Gavin Maltby <G_a_v_i_n.M_a_l_t_b_y@sun.com> did thusly spew forth: >Thomas H Jones II wrote: >> Ocasionally, if a node is really busy and panics, when the filesystem gets >> re-assembled on the standby node, the fsck sometimes results in some of the >> active/open files disappearing into the ether. > >Thinking from the Sun Cluster side, and I'm sure VCS too, no amount of >regular system "busy-ness" should result in a cluster failover and panic. >Unless, that is, your service monitors include a recipe for system being >busy and deem that as a reason to failover (which should still be elegant >without damaging filesystems). > >So if you're really witnessing unexplained "busy leads to panic/failover" >get in contact with SunService for someone to investigate. Err... looks like I need to clarify my wording a bit. Sorry. At any rate, what I meant to communicate was, if a node in the cluster panics and happens to be very busy writing to database files, sometimes there will be loss of files on fail-over. In other words, it's not business causing failover, it's simply that if a failover occurs concurrent to the filesystem being very active that the described lossage issues can manifest themselves. -tom |