Eliminate software disasters with a self-repairing OS

December 9th, 2013, Published in Articles: EngineerIT

by Sun Lee and Bruce Chen, Moxa

Although reliance on computers is now ubiquitous, in the industrial context these machines are literally of capital significance. Automated system crashes due to software failures are not mere annoyances; they are destructive events that cost money and may threaten a business’ survival.

System failure can occur at any point in the system at any time, and a big part of planning for such disasters includes preparations for a timely recovery from catastrophic software failures. With a little ingenuity and automation, business losses due to software failure can be practically tossed aside as a thing of the past.

Failures in large, networked systems may even spread, increasing the costs and danger. For these reasons, system recovery has become a pressing concern for any business that relies on extensive automation, but is of the greatest importance in remote installations like oil or gas pipelines, mines, offshore environments, or power substations. In places such as these, timely arrival at the equipment site may be difficult or even impossible, so effective disaster recovery plans that take these limitations into account are imperative. For software failures, automation is the answer. The key is to streamline system recoveries so they are rapid, simple, and reliable–particularly for mass deployments–and do not require the on-site presence of maintenance staff.

This paper shows how easily an automated software recovery system can be implemented by exploring how it is designed and operates.

The challenge posed by software failures

System crashes attributable to software failure are far more common than business owners expect, with the worst crashes occurring at remote, hard-to-get-to field sites or across massive, networked deployments. Without properly planned advance preparations in place, either is a nightmare.

Take an environmental monitoring system as an example: such systems often require dozens of computers that connect and manage a wide variety of devices and sensors. When one computer crashes, the data which it manages cannot be transmitted to the control centre. Even though some computers are equipped with extra storage capacity for these eventualities these measures are only temporary, and only meant as a last resort. There is no guarantee that the data will be properly maintained, and the longer recovery is put off, the more likely it becomes that storage capacity is exceeded, drives fail, or other disastrous events occur.

Fig. 1: Smart recovery concept.

Up until relatively recently, the only way to cope with any failure, whether hardware or software, was to send an engineer to the field site to troubleshoot and fix the problem. Repairs ranged from simple tweaks to firmware upgrades or complete system re-installs. With the time spent traveling and troubleshooting added in, such procedures could be quite time-consuming; should the technician also need to restore or recover lost data, then the complexity and costs of the event grow ever larger, very quickly.

Mass deployments complicated things even further. The challenge faced by an engineer single-handedly troubleshooting hundreds of computers would be multiplied by the number of disabled machines, and if the deployment involves many different computing platforms, then the problems would grow exponentially. Repairs may require bringing all the computers back to the staff worksite, to either perform recoveries one-by-one or return them to the manufacturer for replacements or repair. Regardless of how the recovery proceeds, it remains an expensive and troublesome task.

Fig. 2: A flowchart showing how an automated
software recovery system is engineered.

With computers now forming such a critical part of our economic and business infrastructure, the losses from system downtime are so far-reaching they are hard to calculate. Catastrophic system failures affect productivity, revenue, and reputation, and can even provoke legal action. With the dangers so great, the threat of software failures should be a priority concern for every business.

The automated recovery solution

With the potential costs in time and money so dear, it is clear that every business needs to make preparations for that inevitable, cataclysmic system crash. The need is even more imperative for businesses which rely heavily on automation. Fortunately, there already exists a simple, effective means of coping with software failure: automated recovery systems.

An automated system recovery is built around three fundamental points:

  • Preparation of an independent storage device on which a copy of the entire system may be stored alongside an automated recovery program.

  • Configuration of the computer so that, should a system attempt to reboot and fail, the BIOS will automatically convert to recovery mode and perform a complete system re-write.

  • When the system recovery is done, the BIOS will once again re-start the computer, this time loading the newly re-written operating system from the original, primary storage drive.

Fig. 3: Automated software recovery comes into its own when
restoring systems across complex mass deployments with a wide
variety of OS and application configurations.

To achieve a fully automated recovery the computer’s BIOS must be modified. First, the BIOS should monitor the computer as it reboots. By implementing a reboot timeout and a user-defined auto recovery flag, the system recovery may be configured to begin once the computer fails to achieve a full reboot within a specified time. For the sake of argument, let’s say the flagged time is 40 seconds. Once the BIOS sees 40 seconds pass without receiving the user-defined recovery success flag, it automatically initiates the system recovery process by rebooting the system off of the secondary, backup storage device. The backup device then loads the recovery system, which over-writes the primary device with a complete, pre-recorded system image while preserving localized data and system preferences. Then, a reboot is attempted.

In the event of a successful auto recovery, the primary user space applications will initialise; at this point, the recovery flag must be called. Once the BIOS receives the flag it resets the recovery process and the computer returns to its normal operating state. Should the flag not be sent, the BIOS will reinitialise the entire recovery process by once again restarting the computer and calling the automated recovery system. Should the recovery process continue to fail, then system administrators at a remote site will know that the system is suffering from a hardware failure, or from some other malfunction unrelated to software.

One key component of automated software recovery is the preparation of the system image, and in particular the choice of storage device. This device may be a disk on module, a second storage device like a CompactFlash or SD card, or even a removable USB drive. Any of these storage media may be  configured in the BIOS as the image storage device. Yet, even if the computer comes with only a single storage medium, users may still resort to partitioning the main drive into two sections, configuring one partition to serve the root system, and reserving the other only for storing the system image.

In addition to these on-board storage devices, there is also another option available. If the host computer is connected to a network, then the system recovery may implemented over that, automatically via the local LAN or even the Internet. This is of considerable benefit when automating recoveries across large-scale deployments. For such network recovery configurations, a server is be designated as the recovery image repository, and all remote computers are configured to automatically look up the image repository once automated recovery has been initialised. Again, this is achieved by configuring the BIOS for a network boot via PXE, and by using the FOSS disaster recovery tool, Clonezilla, it is easily implemented with secure, proven, and reliable software.

The benefits of smart recovery

According to PriceWaterhouseCoopers’ Information Security Breaches Survey 2010 four-fifths of respondents that suffered a breach had a contingency plan in place, up from three-fifths two years ago. However, roughly a quarter of these proved ineffective at addressing the incident. In the past, large companies have been better at contingency planning than small ones; this gap appears now to have closed. The incidents with the highest total cost were those without an effective contingency plan.

To summarise, the benefits of an automated recovery system are so great that it is hard to estimate precisely how valuable it can be. Such automated systems can prevent system downtime from minor annoyances like file system corruption, I/O errors, kernel crashes, and other niggling hazards. In turn, this lowers the RMA rate for online devices, and significantly reduces system maintenance effort and costs. Finally, by providing a large-scale, networked, fully automated system recovery process for mass deployments installation uptimes are significantly increased. Moxa’s Rcore technology, smart recovery for industrial computers,  is a reliable solution that increases system stability while significantly cutting costs from maintenance and downtime.

Contact Sales, RJ Connect, Tel 011 781-0777, info@rjconnect.co.za