Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing(2009)

Note: Please Scroll Down to See the Download Link.

Abstract

      Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The sources of the problems are node failures and the need for dynamic configuration over extensive runtime. This paper presents two fault-tolerance mechanisms called Theft-Induced Check pointing and Systematic Event Logging. These are transparent protocols capable of overcoming problems associated with both benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multithreaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with a need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small, and the maximum work lost by a crashed process is small and bounded. 

Existing System:

  Network routers occupy a unique role in modern distributed systems. They are responsible for cooperatively shuttling packets amongst themselves in order to provide the illusion of a network with universal point-to-point connectivity. However, this illusion is shattered - as are implicit assumptions of availability, confidentiality, or integrity - when network routers are subverted to act in a malicious fashion. By manipulating, diverting, or dropping packets arriving at a compromised router, an attacker can trivially mount denial-of-service, surveillance, or man-in-the-middle attacks on end host systems.

 Consequently, Internet routers have become a choice target for would-be attackers and thousands have been subverted to these ends. In this paper, we specify this problem of detecting routers with incorrect packet forwarding behavior and we explore the design space of protocols that implement such a detector. We further present a concrete protocol that is likely inexpensive enough for practical implementation at scale. Finally, we present a prototype system, called Fatih, that implements this approach on a PC router and describe our experiences with it. We show that Fatih is able to detect and isolate a range of malicious router actions with acceptable overhead and complexity. We believe our work is an important step in being able to tolerate attacks on key network infrastructure components

Proposed System:

  We have designed, developed, and implemented a compromised router detection protocol that dynamically infers, based on measured traffic rates and buffer sizes, the number of congestive packet losses that will occur.

  Once the ambiguity from congestion is removed, subsequent packet losses can be attributed to malicious actions. We have tested our protocol in Emulab and have studied its effectiveness in differentiating attacks from legitimate network behavior.

Modules:

1.     Network Module

2.     Logging Module

3.     Check-pointing Module

4.     Work Stealing Module

5.     Fault and Fault Free Module

Module Description:

Network Module

Client-server computing or networking is a distributed application architecture that partitions tasks or workloads between service providers (servers) and service requesters, called clients. Often clients and servers operate over a computer network on separate hardware. A server machine is a high-performance host that is running one or more server programs which share its resources with clients. A client also shares any of its resources; Clients therefore initiate communication sessions with servers which await (listen to) incoming requests

  Logging Module     

Logging can be classified as pessimistic, optimistic, or causal. It is based on the fact that the execution of a process can be modeled as a sequence of state intervals. The execution during a state interval is deterministic. However, each state interval is initiated by a nondeterministic event. Now, assume that the system can capture and log sufficient information about the nondeterministic events that initiated the state interval. This is called the piecewise deterministic (PWD) assumption .Then, a crashed process can be recovered by 1) restoring it to the initial state and 2) replaying the logged events to it in the same order they appeared in the execution before the crash. To avoid a rollback to the

initial state of a process and to limit the amount of nondeterministic events that need to be replayed, each process periodically saves its local state. Log-based mechanisms in which the only nondeterministic events in a system are the reception of messages is usually referred to as message logging.

Check-pointing Module

                                               

                                   Rather than logging events, check pointing relies on periodically saving the state of the computation to stable storage. If a fault occurs, the computation is restarted from one of the previously saved states. Since the computation is distributed, one has to consider the tradeoff space of local and global check pointing strategies and their resulting recovery cost. Thus, check pointing based methods differs in the way processes are coordinated and in the derivation of a consistent global state. The consistent global state can be achieved either at the time of check pointing or at the time of rollback recovery. The two approaches are called coordinated and uncoordinated check pointing, respectively.

 Work Stealing Module

    The runtime environment and primary mechanism for load distribution is based on a scheduling algorithm called work-stealing .The principle is simple: when a process becomes idle it tries to steal work from another process called victim. The initiating process is called thief. Work-stealing is the only mechanism for distributing the workload constituting the application, i.e., an idle process seeks to steal work from another process. From a practical point of view, the application starts with the process executing main (), which creates tasks. Typically, some of these tasks are then stolen by idle processes, which are either local or on other processors. Thus, the principal mechanism for dispatching tasks in the distributed environment is task stealing

Fault and Fault Free Module

   We add a check pointing mechanism; it is of special interest to analyze its overhead associated with fault-free execution, since the occurrence of faults is considered to be the rare exception rather than the norm.

Hardware Requirements:

•         System                 : Pentium IV 2.4 GHz.

•         Hard Disk            : 40 GB.

•         Floppy Drive       : 1.44 Mb.

•         Monitor                : 15 VGA Colour.

•         Mouse                  : Logitech.

•         Ram                     : 256 Mb.

Software Requirements:

•         Operating system           : - Windows XP Professional.

•         Coding Language : - Java.

Click here to download Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing(2009) source code