News | June 18, 1999

The Softer Side of Hot-Swap

Source: Enea OSE Systems, Inc.

Enea OSE Systems, Inc.>CompactPCI's high availability requirements present an interesting challenge for RTOS designers.

By: Robert Largren, <%=company%>

For years, developers have used proprietary technology to create hot-swappable applications, allowing users to remove, replace and upgrade hardware and software in run-time without lapses in system availability. While CompactPCI is a key component of the "Plug and Play" concept and has taken the lead in creating a standard solution that supports hardware run-time swapping, significant challenges remain on the software side. Hot swap in the CompactPCI specification defines pin sequencing and other enabling hardware technologies, as well as the software architecture required to support live insertion and extraction of boards in a running CompactPCI system. However, removal of a CompactPCI card also means the removal of the software components running on that card. Well-designed system software must provide the intelligence to monitor and deal with the removal of, or failure of, dependent software components. A real-time operating system must have the appropriate architecture and features to fully support hot swapping and enable continuous operation (high availability).

Hot-swapping system design presents significant obstacles to high-availability CompactPCI configurations, where the system must avoid failures that can cause enormous losses in revenue and/or safety degradation. For example, when swapping boards and switching original executing software to an upgraded version, the transaction must take place without disrupting the application's duties and timing requirements. An upgrade operation needs to include stopping the existing software, starting the new software, notifying the parts of the system that are not being upgraded as to when they can no longer communicate with the original software, and finally, helping them locate and establish communication with the upgraded version.

On the hardware side, special circuitry (hot plug connectors) must be developed so that a board can be inserted and removed from a live, operating PCI bus. DC power to boards being hot swapped must be properly ramped up and down to avoid introducing "glitches" or voltage spikes onto the system's power bus. In addition, dual-bus architecture and redundant hardware remains a necessity in order to be able to remove failed or outdated devices while the system keeps running. The CompactPCI model handles all these situations.

On the software side, application software and operating system code must be able to recognize when a board is removed and another inserted and take appropriate action. Dedicated watchdog modules are required to reset processes in the event that faults are detected and to alert the rest of the system of any problems. Mirror processes or modules also are a necessity to reestablish links or continue operating if parts of the system fail, are removed, or are unavailable. In addition, the software must be able to download device drivers in run-time.

In order to maximize the hardware/software coordination when building high-availability applications capable of hot-swapping on CompactPCI, a system is best designed using a top down approach that focuses primarily on the application and its requirements.

An important requirement will be check-pointing, which allows for the automatic detection of faults and the notification of the application when a fault occurs. This way, the user does not have to take the time to search for faults. For example, even if a link to another board is down, each board can be automatically notified, and recovery can be initiated asynchronously relative to the application.

More importantly, the system must include extensive and reliable memory protection. Ideally, memory should be protected in conjunction with a memory management unit (MMU). Processes should be grouped into blocks that can be isolated and protected from other parts of the application (Figure 1). This way, if an application generates a memory address that attempts to write to the kernel memory or another process' memory, an error will be generated, and the MMU will detect the fault and allow the process to restart without disabling the system.

An often overlooked part of controlling memory usage is the necessity to clean up memory after program loading or unloading. In other words, when an application is upgraded or changed in some manner, whether it be mostly on the hardware side, mostly on the software side, or as a factor of both hardware and software, there is a strong likelihood that certain processes will continue to use memory where it is no longer necessary. This problem may be insignificant over the course of a single upgrade operation, but if you consider the size of the systems that often use CompactPCI, it is not unusual to have hundreds of upgrades or maintenance operations in a relatively short period. Multiplying a small memory leak hundreds or thousands of times can add up to a debilitating problem, resulting in costly shut-downs or crashes.

Proprietary and commercial RTOSs provide a variety of ways to handle these critical memory issues. Enea OSE Systems' OSE RTOS offers a Memory Management System (MMS) that uses the block and segment methodology (Figure 1). To enhance the system's capabilities, the MMS works in conjunction with a Program Handler that is charged with carrying out the essential tasks of loading and unloading programs. This way, memory protection as well as memory allocation and de-allocation, become an integral part of the hot-swapping process.

Proper memory handling must work in conjunction with seamless communication between the parts of the system that are running and the parts that are being upgraded or replaced. During a hot-swap operation, new software operating on the replaced or upgraded board must automatically re-establish communication with the rest of the system, and vise versa. While the system keeps running, the new software should be loaded without interruption. At times, the new software must execute in parallel with the older version in order to provide the most rapid switchover during the transition. Then, facilitated by the system's memory-handling capabilities, resources allocated by the older version are automatically reclaimed and put to use for the new version.

In the OSE RTOS, much of the effectiveness of such an operation can be attributed to the duplication and check-pointing of primary processes, provided in this case by the OSE Link Handler. The Link Handler allows interprocess communication to happen in the same manner whether the application uses a single CPU or a vast number of CPUs distributed throughout a system.

For the application designer, the Link Handler allows the use of "logical" versus "physical" channels. In other words, the designer need only think of which processes must communicate, not how they must communicate. To do this, the Link Handler creates local images of processes happening in remote locations, on other CPUs (Figure 2). The application designer need only deal with the "logical" channel and not worry about where particular processes are located in the distributed system. However, the processes need to know when anything happens that breaks the corresponding physical channel – for example, if a board is removed during an upgrade operation. To make this possible, the local image of a process will only exist when the entire physical channel is functional, such that notification is only necessary when the local image is removed. In this case, supervising a process becomes a local request, easily handled by the operating system. Using local images, the Link Handlers can establish complete supervision of the logical channels between processes.

Click here to see Figure 2.

OSE's Link Handler is only one component of a system that supports hot-swapping in CompactPCI. The system also should have an open interface to protocol stacks, to allow for any kind of physical link and transmission protocol (Figure 3).

CompactPCI architecture has been carefully planned to simplify the integration, exchange and upgrade of peripheral devices. However, while the power and multi-length signal pins on the CompactPCI connector are staged to support hardware hot swapping, it is the RTOS that enables complete hardware and software hot-swap capability and allows design engineers to build high-availability systems quickly and efficiently. The design team using CompactPCI should ensure at the outset that the RTOS to be built or bought is fully able to support hot-swapping. Otherwise, the final application will not benefit from CompactPCI's technology advantages, advantages that were designed specifically for telecommunications applications and other applications requiring high speed computing and modular and robust packaging. By spending the time to identify the right RTOS for the application, be it proprietary or commercial, the user will benefit from reduced time-to-market, reduced costs and a more powerful, reliable system.

Contributed by: Enea OSE Systems Inc., 5949 Sherry Lane, Suite 625, Dallas, TX 75225. Tel: (214) 346-9339; Fax: (214) 346-9344; Email: info@enea.com