View Single Post

   
  #6 (permalink)  
Old 01-12-2008, 05:50 AM
Gavin Maltby
 
Posts: n/a
Default Re: Missing Files After Panic/Fsck?

Tim Bradshaw wrote:
> * Gavin Maltby wrote:
>
>
>>Thinking from the Sun Cluster side, and I'm sure VCS too, no amount of
>>regular system "busy-ness" should result in a cluster failover and panic.
>>Unless, that is, your service monitors include a recipe for system being
>>busy and deem that as a reason to failover (which should still be elegant
>>without damaging filesystems).

>
>
> Can't a stupidly highly-loaded system result in it failing to send
> heartbeats in time, and the cluster monitor then deciding it's gone
> away and failing over? Obviously a properly designed system wouldn't
> suffer from this - apart from anything else you'd put the heartbeat
> stuff in some kind of scheduling class on their own so they can always
> get time, but I'm not convinced that this is always done right.


If this happens then it's a bug and you should complain to the cluster vendor.
It is possible to knobble some heartbeat designs with device
interrupts, but you can design around that.

> I'm thinking of things like: system becomes catastrophically short of
> memory, swapping heavily, heartbeat code is swapped out somewhere,
> fails to get back in in time. I know these things can be coped with
> in theory, but are they in practice?


Any respectable heartbeat would allow for all of that, and Solaris
provides all the infrastructure to avoid it all. You can place
heartbeat code in realtime class, shelter interrupts, lock
your mappings etc.

Gavin

Speaking for myself.

Reply With Quote