dbe41bd1a00200627566b02523acfd669485845b
[doctor.git] / docs / requirements / 07-annex.rst
1 Annex: NFVI Faults
2 =================================================
3
4 Faults in the listed elements need to be immediately notified to the Consumer in
5 order to perform an immediate action like live migration or switch to a hot
6 standby entity. In addition, the Administrator of the host should trigger a
7 maintenance action to, e.g., reboot the server or replace a defective hardware
8 element.
9
10 Faults can be of different severity, i.e., critical, warning, or
11 info. Critical faults require immediate action as a severe degradation of the
12 system has happened or is expected. Warnings indicate that the system
13 performance is going down: related actions include closer (e.g. more frequent)
14 monitoring of that part of the system or preparation for a cold migration to a
15 backup VM. Info messages do not require any action. We also consider a type
16 "maintenance", which is no real fault, but may trigger maintenance actions
17 like a re-boot of the server or replacement of a faulty, but redundant HW.
18
19 Faults can be gathered by, e.g., enabling SNMP and installing some open source
20 tools to catch and poll SNMP. When using for example Zabbix one can also put an
21 agent running on the hosts to catch any other fault. In any case of failure, the
22 Administrator should be notified. The following table provides a list of high
23 level faults that are considered within the scope of the Doctor project
24 requiring immediate action by the Consumer.
25
26
27
28 +------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
29 | Service          | Fault                                                                                                                     | Severity         | How to detect?    | Comment                                                                                  | Action to recover                                                    |
30 +------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
31 | Compute Hardware | Processor/CPU failure, CPU condition not ok                                                                               | Critical         | Zabbix            |                                                                                          | Switch to hot standby                                                |
32 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
33 |                  | Memory failure/Memory condition not ok                                                                                    | Critical         | Zabbix (IPMI)     |                                                                                          | Switch to hot standby                                                |
34 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
35 |                  | Network card failure, e.g. network adapter connectivity lost                                                              | Critical         | Zabbix/Ceilometer |                                                                                          | Switch to hot standby                                                |
36 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
37 |                  | Disk crash                                                                                                                | Info             | RAID monitoring   | Network storage is very redundant (e.g. RAID system) and can guarantee high availability | Inform OAM                                                           |
38 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
39 |                  | Storage controller                                                                                                        | Critical         | Zabbix (IPMI)     |                                                                                          | Live migration if storage is still accessible; otherwise hot standby |
40 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
41 |                  | PDU/power failure, power off, server reset                                                                                | Critical         | Zabbix/Ceilometer |                                                                                          | Switch to hot standby                                                |
42 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
43 |                  | Power degradation, power redundancy lost, power threshold exceeded                                                        | Warning          | SNMP              |                                                                                          | Live migration                                                       |
44 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
45 |                  | Chassis problem (.e.g fan degraded/failed, chassis power degraded), CPU fan problem, temperature/thermal condition not ok | Warning          | SNMP              |                                                                                          | Live migration                                                       |
46 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
47 |                  | Mainboard failure                                                                                                         | Critical         | Zabbix (IPMI)     |                                                                                          | Switch to hot standby                                                |
48 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
49 |                  | OS crash (e.g. kernel panic)                                                                                              | Critical         | Zabbix            |                                                                                          | Switch to hot standby                                                |
50 +------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
51 | Hypervisor       | System has restarted                                                                                                      | Critical         | Zabbix            |                                                                                          | Switch to hot standby                                                |
52 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
53 |                  | Hypervisor failure                                                                                                        | Warning/Critical | Zabbix/Ceilometer |                                                                                          | Evacuation/switch to hot standby                                     |
54 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
55 |                  | Zabbix/Ceilometer is unreachable                                                                                          | Warning          | ?                 |                                                                                          | Live migration                                                       |
56 +------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
57 | Network          | SDN/OpenFlow switch, controller degraded/failed                                                                           | Critical         | ?                 |                                                                                          | Switch to hot standby or reconfigure virtual network topology        |
58 +                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
59 |                  | Hardware failure of physical switch/router                                                                                | Warning          | SNMP              | Redundancy of physical infrastructure is reduced or no longer available                  | Live migration if possible, otherwise evacuation                     |
60 +------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
61
62 ..
63  vim: set tabstop=4 expandtab textwidth=80:
64