Fix an issue with the latest build as proposed by Aric in
[doctor.git] / docs / requirements / 07-annex.rst
index dbe41bd..c3a7899 100644 (file)
@@ -1,3 +1,8 @@
+.. This work is licensed under a Creative Commons Attribution 4.0 International License.
+.. http://creativecommons.org/licenses/by/4.0
+
+.. _nfvi_faults:
+
 Annex: NFVI Faults
 =================================================
 
@@ -19,46 +24,106 @@ like a re-boot of the server or replacement of a faulty, but redundant HW.
 Faults can be gathered by, e.g., enabling SNMP and installing some open source
 tools to catch and poll SNMP. When using for example Zabbix one can also put an
 agent running on the hosts to catch any other fault. In any case of failure, the
-Administrator should be notified. The following table provides a list of high
+Administrator should be notified. The following tables provide a list of high
 level faults that are considered within the scope of the Doctor project
 requiring immediate action by the Consumer.
 
+**Compute/Storage**
+
++-------------------+----------+------------+-----------------+------------------+
+| Fault             | Severity | How to     | Comment         | Immediate action |
+|                   |          | detect?    |                 | to recover       |
++===================+==========+============+=================+==================+
+| Processor/CPU     | Critical | Zabbix     |                 | Switch to hot    |
+| failure, CPU      |          |            |                 | standby          |
+| condition not ok  |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Memory failure/   | Critical | Zabbix     |                 | Switch to hot    |
+| Memory condition  |          | (IPMI)     |                 | standby          |
+| not ok            |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Network card      | Critical | Zabbix/    |                 | Switch to hot    |
+| failure, e.g.     |          | Ceilometer |                 | standby          |
+| network adapter   |          |            |                 |                  |
+| connectivity lost |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Disk crash        | Info     | RAID       | Network storage | Inform OAM       |
+|                   |          | monitoring | is very         |                  |
+|                   |          |            | redundant (e.g. |                  |
+|                   |          |            | RAID system)    |                  |
+|                   |          |            | and can         |                  |
+|                   |          |            | guarantee high  |                  |
+|                   |          |            | availability    |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Storage           | Critical | Zabbix     |                 | Live migration   |
+| controller        |          | (IPMI)     |                 | if storage       |
+|                   |          |            |                 | is still         |
+|                   |          |            |                 | accessible;      |
+|                   |          |            |                 | otherwise hot    |
+|                   |          |            |                 | standby          |
++-------------------+----------+------------+-----------------+------------------+
+| PDU/power         | Critical | Zabbix/    |                 | Switch to hot    |
+| failure, power    |          | Ceilometer |                 | standby          |
+| off, server reset |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Power             | Warning  | SNMP       |                 | Live migration   |
+| degration, power  |          |            |                 |                  |
+| redundancy lost,  |          |            |                 |                  |
+| power threshold   |          |            |                 |                  |
+| exceeded          |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Chassis problem   | Warning  | SNMP       |                 | Live migration   |
+| (e.g. fan         |          |            |                 |                  |
+| degraded/failed,  |          |            |                 |                  |
+| chassis power     |          |            |                 |                  |
+| degraded), CPU    |          |            |                 |                  |
+| fan problem,      |          |            |                 |                  |
+| temperature/      |          |            |                 |                  |
+| thermal condition |          |            |                 |                  |
+| not ok            |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Mainboard failure | Critical | Zabbix     | e.g. PCIe, SAS  | Switch to hot    |
+|                   |          | (IPMI)     | link failure    | standby          |
++-------------------+----------+------------+-----------------+------------------+
+| OS crash (e.g.    | Critical | Zabbix     |                 | Switch to hot    |
+| kernel panic)     |          |            |                 | standby          |
++-------------------+----------+------------+-----------------+------------------+
 
+**Hypervisor**
 
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Service          | Fault                                                                                                                     | Severity         | How to detect?    | Comment                                                                                  | Action to recover                                                    |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Compute Hardware | Processor/CPU failure, CPU condition not ok                                                                               | Critical         | Zabbix            |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Memory failure/Memory condition not ok                                                                                    | Critical         | Zabbix (IPMI)     |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Network card failure, e.g. network adapter connectivity lost                                                              | Critical         | Zabbix/Ceilometer |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Disk crash                                                                                                                | Info             | RAID monitoring   | Network storage is very redundant (e.g. RAID system) and can guarantee high availability | Inform OAM                                                           |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Storage controller                                                                                                        | Critical         | Zabbix (IPMI)     |                                                                                          | Live migration if storage is still accessible; otherwise hot standby |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | PDU/power failure, power off, server reset                                                                                | Critical         | Zabbix/Ceilometer |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Power degradation, power redundancy lost, power threshold exceeded                                                        | Warning          | SNMP              |                                                                                          | Live migration                                                       |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Chassis problem (.e.g fan degraded/failed, chassis power degraded), CPU fan problem, temperature/thermal condition not ok | Warning          | SNMP              |                                                                                          | Live migration                                                       |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Mainboard failure                                                                                                         | Critical         | Zabbix (IPMI)     |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | OS crash (e.g. kernel panic)                                                                                              | Critical         | Zabbix            |                                                                                          | Switch to hot standby                                                |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Hypervisor       | System has restarted                                                                                                      | Critical         | Zabbix            |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Hypervisor failure                                                                                                        | Warning/Critical | Zabbix/Ceilometer |                                                                                          | Evacuation/switch to hot standby                                     |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Zabbix/Ceilometer is unreachable                                                                                          | Warning          | ?                 |                                                                                          | Live migration                                                       |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Network          | SDN/OpenFlow switch, controller degraded/failed                                                                           | Critical         | ?                 |                                                                                          | Switch to hot standby or reconfigure virtual network topology        |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Hardware failure of physical switch/router                                                                                | Warning          | SNMP              | Redundancy of physical infrastructure is reduced or no longer available                  | Live migration if possible, otherwise evacuation                     |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
++----------------+----------+------------+-------------+-------------------+
+| Fault          | Severity | How to     | Comment     | Immediate action  |
+|                |          | detect?    |             | to recover        |
++================+==========+============+=============+===================+
+| System has     | Critical | Zabbix     |             | Switch to hot     |
+| restarted      |          |            |             | standby           |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor     | Warning/ | Zabbix/    |             | Evacuation/switch |
+| failure        | Critical | Ceilometer |             | to hot standby    |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor     | Warning  | Alarming   | Zabbix/     | Rebuild VM        |
+| status not     |          | service    | Ceilometer  |                   |
+| retrievable    |          |            | unreachable |                   |
+| after certain  |          |            |             |                   |
+| period         |          |            |             |                   |
++----------------+----------+------------+-------------+-------------------+
 
-..
- vim: set tabstop=4 expandtab textwidth=80:
+**Network**
 
++------------------+----------+---------+----------------+---------------------+
+| Fault            | Severity | How to  | Comment        | Immediate action to |
+|                  |          | detect? |                | recover             |
++==================+==========+=========+================+=====================+
+| SDN/OpenFlow     | Critical | Ceilo-  |                | Switch to           |
+| switch,          |          | meter   |                | hot standby         |
+| controller       |          |         |                | or reconfigure      |
+| degraded/failed  |          |         |                | virtual network     |
+|                  |          |         |                | topology            |
++------------------+----------+---------+----------------+---------------------+
+| Hardware failure | Warning  | SNMP    | Redundancy of  | Live migration if   |
+| of physical      |          |         | physical       | possible otherwise  |
+| switch/router    |          |         | infrastructure | evacuation          |
+|                  |          |         | is reduced or  |                     |
+|                  |          |         | no longer      |                     |
+|                  |          |         | available      |                     |
++------------------+----------+---------+----------------+---------------------+