Research Groups | Fault-tolerant Systems

The Fault Tolerance group focuses on the development of robust and reliable systems, aiming mainly at the solution of problems targeted for the context of distributed systems and networks.
The group assumes the potential occurrence of accidental faults and faults caused by intrusion to propose solutions that offer properties such as availability, integrity, and confidentiality



Research Themes

  • Distributed systems for high availability :Development of  robust algorithms (resilient to failures) for distributed systems and clusters, and mechanisms to support fault-tolerant operation, such as those for control of participants, detection of defects, and information storage aiming at high performance, availability, and security.
  • Evaluation and experimental validation of safety in operation: Aims at identifying the interference of failures in distributed architectures, clusters, grids, and peer-to-peer systems, through injection of faults, simulation, and testing, offering alternatives for obtaining the desired availability.
  • Injection of faults: Modeling, management, monitoring, and instrumentation of fault injection experiments. Development of new software injectors to evaluate the occurrence of communication failures in distributed systems. Reduction of the spatial and temporal intrusion on the target system by injectors.
  • Security of computer systems: Mechanisms aimed at identifying malicious code, monitoring and controlling the communication between processes, to ensure system security, both in conventional and wireless networks.
  • Security policies: Research and development of tools and techniques aiming at ensuring aspects of integrity, authenticity, confidentiality, and irrefutability in processes dealing with data storage and transmission.
  • Cryptographic protocols: Application of single key and public key encryption algorithms to ensure secure communication between processes and/or certified partners, including the use of digital certificates.
  • Live streaming: Search for mechanisms related to the dissemination of data generated in real time, on peer-to-peer large-scale networks, aiming at ensuring quality of the received information, to avoid the non-cooperative behavior (free loaders) and to build systems able to self adapt to the occurrence of defects and variation in its functional parameters.
  • Fault tolerance in networks: Utilization of fault tolerance techniques in the system’s components to ensure connectivity and reliability to the various services made available on networks. Use of peer-to-peer technology in the management of networks, with the support of techniques for group communication and fault tolerance.

Recent Research Projects

  • AC-RS – Development of a Center for Digital Certification in the State of Rio Grande do Sul. Participants: UFRGS / Procergs / Banrisul / ACRS. Funded by: ITI. Started in 2006.
  • P2PSE – Support for P2P applications, especially multiplayer games, aiming at providing communications and safety middleware. Participants: UFRGS / UFSC / Mimetic Entretenimento. Funded by: FINEP/CT-Info. Started in 2006.
  • PROCAD – Large scale dynamic distributed systems – solutions and methods for validation that allow a robust execution environment for distributed applications in large scale heterogeneous and dynamic systems. Participants: UFRGS, UFCG, UFPR. Funded by: CAPES/Procad. Started in 2006. Solutions for Energy Management through computer networks and for monitoring energy conditioning equipments through intranets and Internet. Participants: UFRGS/ CP Eletrônica. Funded by: CP Eletrônica. Started in 2005.
  • ACERTE – Certification of availability of mission critical systems. Participants: UFRGS/ UNICAMP/ UFSC. Funded by: CNPq. Concluded in 2006.
  • Benchmark of trust in the operation of software components. Participants: UNICAMP/ UFRGS / Universidade de Coimbra. Funded by: CAPES/Grices. Concluded in 2006.
  • CLP-SIL-2 – Programmable Logic Controllers with SIL 2 Security level – Application of safety standards for programmable logic controllers used in critical applications in the oil extraction industry. Calculation of the SIL safety level. Design of input and output modules that comply with safety standards.
  • Participants: UFRGS (Embedded Systems and Fault Tolerance research groups), Altus.
  • Funded by: FINEP/ CT-Petro, RBT call for proposals. Concluded in 2006.
  • Dependable Grid – Development and validation of fault tolerance mechanisms for grid type platforms. Funded by: HP Brazil. Concluded in 2005.

Results from recent research

  • Development of several prototypes of tools for fault injection, including INFIMO and ComFIRM, for real-time communication systems, in DBMSs, and in high-availability systems, used by the Cofimax company since 1995.
  • Development of management and monitoring software (such as CP Agent and CP Manager), for no-break equipments. Technology transferred to CP Eletrônica S. A. company, through a joint development.
  • Development of protocols for secure communication in ATMs and of subsystems (thermal printer) for use in bank tellers. Technology transferred to the Perto S. A. company.