bf65ff7ce69cade67d702882ced3aedf00a5588f
[doctor.git] / docs / requirements / 07-annex.rst
1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
2 .. http://creativecommons.org/licenses/by/4.0
3
4 Annex: NFVI Faults
5 =================================================
6
7 Faults in the listed elements need to be immediately notified to the Consumer in
8 order to perform an immediate action like live migration or switch to a hot
9 standby entity. In addition, the Administrator of the host should trigger a
10 maintenance action to, e.g., reboot the server or replace a defective hardware
11 element.
12
13 Faults can be of different severity, i.e., critical, warning, or
14 info. Critical faults require immediate action as a severe degradation of the
15 system has happened or is expected. Warnings indicate that the system
16 performance is going down: related actions include closer (e.g. more frequent)
17 monitoring of that part of the system or preparation for a cold migration to a
18 backup VM. Info messages do not require any action. We also consider a type
19 "maintenance", which is no real fault, but may trigger maintenance actions
20 like a re-boot of the server or replacement of a faulty, but redundant HW.
21
22 Faults can be gathered by, e.g., enabling SNMP and installing some open source
23 tools to catch and poll SNMP. When using for example Zabbix one can also put an
24 agent running on the hosts to catch any other fault. In any case of failure, the
25 Administrator should be notified. The following tables provide a list of high
26 level faults that are considered within the scope of the Doctor project
27 requiring immediate action by the Consumer.
28
29 **Compute Hardware**
30
31 +-------------------+----------+------------+-----------------+----------------+
32 | Fault             | Severity | How to     | Comment         | Action to      |
33 |                   |          | detect?    |                 | recover        |
34 +===================+==========+============+=================+================+
35 | Processor/CPU     | Critical | Zabbix     |                 | Switch to      |
36 | failure, CPU      |          |            |                 | hot standby    |
37 | condition not ok  |          |            |                 |                |
38 +-------------------+----------+------------+-----------------+----------------+
39 | Memory failure/   | Critical | Zabbix     |                 | Switch to      |
40 | Memory condition  |          | (IPMI)     |                 | hot standby    |
41 | not ok            |          |            |                 |                |
42 +-------------------+----------+------------+-----------------+----------------+
43 | Network card      | Critical | Zabbix/    |                 | Switch to      |
44 | failure, e.g.     |          | Ceilometer |                 | hot standby    |
45 | network adapter   |          |            |                 |                |
46 | connectivity lost |          |            |                 |                |
47 +-------------------+----------+------------+-----------------+----------------+
48 | Disk crash        | Info     | RAID       | Network storage | Inform OAM     |
49 |                   |          | monitoring | is very         |                |
50 |                   |          |            | redundant (e.g. |                |
51 |                   |          |            | RAID system)    |                |
52 |                   |          |            | and can         |                |
53 |                   |          |            | guarantee high  |                |
54 |                   |          |            | availability    |                |
55 +-------------------+----------+------------+-----------------+----------------+
56 | Storage           | Critical | Zabbix     |                 | Live migration |
57 | controller        |          | (IPMI)     |                 | if storage     |
58 |                   |          |            |                 | is still       |
59 |                   |          |            |                 | accessible;    |
60 |                   |          |            |                 | otherwise hot  |
61 |                   |          |            |                 | standby        |
62 +-------------------+----------+------------+-----------------+----------------+
63 | PDU/power         | Critical | Zabbix/    |                 | Switch to      |
64 | failure, power    |          | Ceilometer |                 | hot standby    |
65 | off, server reset |          |            |                 |                |
66 +-------------------+----------+------------+-----------------+----------------+
67 | Power             | Warning  | SNMP       |                 | Live migration |
68 | degration, power  |          |            |                 |                |
69 | redundancy lost,  |          |            |                 |                |
70 | power threshold   |          |            |                 |                |
71 | exceeded          |          |            |                 |                |
72 +-------------------+----------+------------+-----------------+----------------+
73 | Chassis problem   | Warning  | SNMP       |                 | Live migration |
74 | (e.g. fan         |          |            |                 |                |
75 | degraded/failed,  |          |            |                 |                |
76 | chassis power     |          |            |                 |                |
77 | degraded), CPU    |          |            |                 |                |
78 | fan problem,      |          |            |                 |                |
79 | temperature/      |          |            |                 |                |
80 | thermal condition |          |            |                 |                |
81 | not ok            |          |            |                 |                |
82 +-------------------+----------+------------+-----------------+----------------+
83 | Mainboard failure | Critical | Zabbix     |                 | Switch to      |
84 |                   |          | (IPMI)     |                 | hot standby    |
85 +-------------------+----------+------------+-----------------+----------------+
86 | OS crash (e.g.    | Critical | Zabbix     |                 | Switch to      |
87 | kernel panic)     |          |            |                 | hot standby    |
88 +-------------------+----------+------------+-----------------+----------------+
89
90 **Hypervisor**
91
92 +----------------+----------+------------+---------+-------------------+
93 | Fault          | Severity | How to     | Comment | Action to         |
94 |                |          | detect?    |         | recover           |
95 +================+==========+============+=========+===================+
96 | System has     | Critical | Zabbix     |         | Switch to         |
97 | restarted      |          |            |         | hot standby       |
98 +----------------+----------+------------+---------+-------------------+
99 | Hypervisor     | Warning/ | Zabbix/    |         | Evacuation/switch |
100 | failure        | Critical | Ceilometer |         | to hot standby    |
101 +----------------+----------+------------+---------+-------------------+
102 | Zabbix/        | Warning  | ?          |         | Live migration    |
103 | Ceilometer     |          |            |         |                   |
104 | is unreachable |          |            |         |                   |
105 +----------------+----------+------------+---------+-------------------+
106
107 **Network**
108
109
110 +------------------+----------+---------+----------------+---------------------+
111 | Fault            | Severity | How to  | Comment        | Action to           |
112 |                  |          | detect? |                | recover             |
113 +==================+==========+=========+================+=====================+
114 | SDN/OpenFlow     | Critical | ?       |                | Switch to           |
115 | switch,          |          |         |                | hot standby         |
116 | controller       |          |         |                | or reconfigure      |
117 | degraded/failed  |          |         |                | virtual network     |
118 |                  |          |         |                | topology            |
119 +------------------+----------+---------+----------------+---------------------+
120 | Hardware failure | Warning  | SNMP    | Redundancy of  | Live migration if   |
121 | of physical      |          |         | physical       | possible  otherwise |
122 | switch/router    |          |         | infrastructure | evacuation          |
123 |                  |          |         | is reduced or  |                     |
124 |                  |          |         | no longer      |                     |
125 |                  |          |         | available      |                     |
126 +------------------+----------+---------+----------------+---------------------+