Editing Watchdog timer (section)

==Fault detection==
A watchdog timer provides automatic detection of catastrophic malfunctions that prevent the computer from kicking it. However, computers can have other, less-severe types of faults which do not interfere with kicking, but which still require watchdog oversight. To support these, a computer system is typically designed so that its watchdog timer will be kicked only if the computer deems the system functional. The computer determines whether the system is functional by conducting one or more fault detection tests and will kick the watchdog only if all tests have passed.<ref name="ganssle"/>

[[File:Wdctl screenshot.png|upright=1.5|thumb|Screenshot of <code>[[wdctl]]</code>, a program that shows watchdog status]]
In computers that are running an operating system and multiple [[Process (computing)|processes]], a single, simple test might be insufficient to guarantee normal operation, as it could fail to detect a subtle fault condition and consequently kick the watchdog even though a fault condition exists. For example, in the case of the Linux operating system, a user-space watchdog [[Daemon (computing)|daemon]] may simply kick the watchdog periodically without performing any tests. As long as the daemon runs normally, the system will be protected against serious system crashes such as a [[kernel panic]]. To detect less severe faults, the daemon<ref name="LinuxWatchdogManpage"/> can perform tests that cover various aspects of the system condition, including resource availability (e.g., [[Computer memory|memory]], [[file handles]], CPU time), evidence of expected process activity (e.g., system daemons running, specific files being present or updated), overheating, and network activity.<ref name="LinuxWatchdogTests"/>

Upon discovery of a failed test, the computer may attempt to perform a sequence of corrective actions under software control, culminating with a software-initiated reboot. If the software fails to invoke a reboot, the hardware watchdog timer — if available — will timeout and invoke a hardware reset. In effect, this is a multistage watchdog timer in which the software comprises the first and the hardware WDT the final stage. In a Linux system, for example, the watchdog daemon can be configured to attempt to perform a software-initiated reboot, which may be preferable to a hardware reset as it allows file systems to be safely [[Mount (computing)|unmounted]] and fault information to be logged prior to the reboot. It is essential, however, to have the insurance provided by a hardware WDT, to allow for the case in which a fault causes the daemon itself to malfunction, and thus become unable to invoke a reboot.<ref name="ganssle"/>