Revisit fault table 85/14685/7
authorYujun Zhang <zhang.yujunz@zte.com.cn>
Thu, 26 May 2016 03:36:12 +0000 (11:36 +0800)
committerYujun Zhang <zhang.yujunz@zte.com.cn>
Fri, 3 Jun 2016 01:01:06 +0000 (09:01 +0800)
- add storage fault (same content as compute)
- add example of mainboard error
- rename fault Hypervisor status not retrievable
- editorial fix
- complete "how to detect"

Change-Id: I9ec849014d18823f9a309396542e89a4e6da8b6c
Signed-off-by: Yujun Zhang <zhang.yujunz@zte.com.cn>
docs/requirements/07-annex.rst

index bf65ff7..8cb1961 100644 (file)
@@ -26,99 +26,100 @@ Administrator should be notified. The following tables provide a list of high
 level faults that are considered within the scope of the Doctor project
 requiring immediate action by the Consumer.
 
 level faults that are considered within the scope of the Doctor project
 requiring immediate action by the Consumer.
 
-**Compute Hardware**
+**Compute/Storage**
 
 
-+-------------------+----------+------------+-----------------+----------------+
-| Fault             | Severity | How to     | Comment         | Action to      |
-|                   |          | detect?    |                 | recover        |
-+===================+==========+============+=================+================+
-| Processor/CPU     | Critical | Zabbix     |                 | Switch to      |
-| failure, CPU      |          |            |                 | hot standby    |
-| condition not ok  |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Memory failure/   | Critical | Zabbix     |                 | Switch to      |
-| Memory condition  |          | (IPMI)     |                 | hot standby    |
-| not ok            |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Network card      | Critical | Zabbix/    |                 | Switch to      |
-| failure, e.g.     |          | Ceilometer |                 | hot standby    |
-| network adapter   |          |            |                 |                |
-| connectivity lost |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Disk crash        | Info     | RAID       | Network storage | Inform OAM     |
-|                   |          | monitoring | is very         |                |
-|                   |          |            | redundant (e.g. |                |
-|                   |          |            | RAID system)    |                |
-|                   |          |            | and can         |                |
-|                   |          |            | guarantee high  |                |
-|                   |          |            | availability    |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Storage           | Critical | Zabbix     |                 | Live migration |
-| controller        |          | (IPMI)     |                 | if storage     |
-|                   |          |            |                 | is still       |
-|                   |          |            |                 | accessible;    |
-|                   |          |            |                 | otherwise hot  |
-|                   |          |            |                 | standby        |
-+-------------------+----------+------------+-----------------+----------------+
-| PDU/power         | Critical | Zabbix/    |                 | Switch to      |
-| failure, power    |          | Ceilometer |                 | hot standby    |
-| off, server reset |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Power             | Warning  | SNMP       |                 | Live migration |
-| degration, power  |          |            |                 |                |
-| redundancy lost,  |          |            |                 |                |
-| power threshold   |          |            |                 |                |
-| exceeded          |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Chassis problem   | Warning  | SNMP       |                 | Live migration |
-| (e.g. fan         |          |            |                 |                |
-| degraded/failed,  |          |            |                 |                |
-| chassis power     |          |            |                 |                |
-| degraded), CPU    |          |            |                 |                |
-| fan problem,      |          |            |                 |                |
-| temperature/      |          |            |                 |                |
-| thermal condition |          |            |                 |                |
-| not ok            |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Mainboard failure | Critical | Zabbix     |                 | Switch to      |
-|                   |          | (IPMI)     |                 | hot standby    |
-+-------------------+----------+------------+-----------------+----------------+
-| OS crash (e.g.    | Critical | Zabbix     |                 | Switch to      |
-| kernel panic)     |          |            |                 | hot standby    |
-+-------------------+----------+------------+-----------------+----------------+
++-------------------+----------+------------+-----------------+------------------+
+| Fault             | Severity | How to     | Comment         | Immediate action |
+|                   |          | detect?    |                 | to recover       |
++===================+==========+============+=================+==================+
+| Processor/CPU     | Critical | Zabbix     |                 | Switch to hot    |
+| failure, CPU      |          |            |                 | standby          |
+| condition not ok  |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Memory failure/   | Critical | Zabbix     |                 | Switch to        |
+| Memory condition  |          | (IPMI)     |                 | hot standby      |
+| not ok            |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Network card      | Critical | Zabbix/    |                 | Switch to        |
+| failure, e.g.     |          | Ceilometer |                 | hot standby      |
+| network adapter   |          |            |                 |                  |
+| connectivity lost |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Disk crash        | Info     | RAID       | Network storage | Inform OAM       |
+|                   |          | monitoring | is very         |                  |
+|                   |          |            | redundant (e.g. |                  |
+|                   |          |            | RAID system)    |                  |
+|                   |          |            | and can         |                  |
+|                   |          |            | guarantee high  |                  |
+|                   |          |            | availability    |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Storage           | Critical | Zabbix     |                 | Live migration   |
+| controller        |          | (IPMI)     |                 | if storage       |
+|                   |          |            |                 | is still         |
+|                   |          |            |                 | accessible;      |
+|                   |          |            |                 | otherwise hot    |
+|                   |          |            |                 | standby          |
++-------------------+----------+------------+-----------------+------------------+
+| PDU/power         | Critical | Zabbix/    |                 | Switch to        |
+| failure, power    |          | Ceilometer |                 | hot standby      |
+| off, server reset |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Power             | Warning  | SNMP       |                 | Live migration   |
+| degration, power  |          |            |                 |                  |
+| redundancy lost,  |          |            |                 |                  |
+| power threshold   |          |            |                 |                  |
+| exceeded          |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Chassis problem   | Warning  | SNMP       |                 | Live migration   |
+| (e.g. fan         |          |            |                 |                  |
+| degraded/failed,  |          |            |                 |                  |
+| chassis power     |          |            |                 |                  |
+| degraded), CPU    |          |            |                 |                  |
+| fan problem,      |          |            |                 |                  |
+| temperature/      |          |            |                 |                  |
+| thermal condition |          |            |                 |                  |
+| not ok            |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Mainboard failure | Critical | Zabbix     | e.g. PCIe, SAS  | Switch to        |
+|                   |          | (IPMI)     | link failure    | hot standby      |
++-------------------+----------+------------+-----------------+------------------+
+| OS crash (e.g.    | Critical | Zabbix     |                 | Switch to        |
+| kernel panic)     |          |            |                 | hot standby      |
++-------------------+----------+------------+-----------------+------------------+
 
 **Hypervisor**
 
 
 **Hypervisor**
 
-+----------------+----------+------------+---------+-------------------+
-| Fault          | Severity | How to     | Comment | Action to         |
-|                |          | detect?    |         | recover           |
-+================+==========+============+=========+===================+
-| System has     | Critical | Zabbix     |         | Switch to         |
-| restarted      |          |            |         | hot standby       |
-+----------------+----------+------------+---------+-------------------+
-| Hypervisor     | Warning/ | Zabbix/    |         | Evacuation/switch |
-| failure        | Critical | Ceilometer |         | to hot standby    |
-+----------------+----------+------------+---------+-------------------+
-| Zabbix/        | Warning  | ?          |         | Live migration    |
-| Ceilometer     |          |            |         |                   |
-| is unreachable |          |            |         |                   |
-+----------------+----------+------------+---------+-------------------+
++----------------+----------+------------+-------------+-------------------+
+| Fault          | Severity | How to     | Comment     | Immediate action  |
+|                |          | detect?    |             | to recover        |
++================+==========+============+=============+===================+
+| System has     | Critical | Zabbix     |             | Switch to         |
+| restarted      |          |            |             | hot standby       |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor     | Warning/ | Zabbix/    |             | Evacuation/switch |
+| failure        | Critical | Ceilometer |             | to hot standby    |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor     | Warning  | Alarming   | Zabbix/     | Rebuild VM        |
+| status not     |          | service    | Ceilometer  |                   |
+| retrievable    |          |            | unreachable |                   |
+| after certain  |          |            |             |                   |
+| period         |          |            |             |                   |
++----------------+----------+------------+-------------+-------------------+
 
 **Network**
 
 
 **Network**
 
-
 +------------------+----------+---------+----------------+---------------------+
 +------------------+----------+---------+----------------+---------------------+
-| Fault            | Severity | How to  | Comment        | Action to           |
+| Fault            | Severity | How to  | Comment        | Immediate action to |
 |                  |          | detect? |                | recover             |
 +==================+==========+=========+================+=====================+
 |                  |          | detect? |                | recover             |
 +==================+==========+=========+================+=====================+
-| SDN/OpenFlow     | Critical | ?       |                | Switch to           |
-| switch,          |          |         |                | hot standby         |
+| SDN/OpenFlow     | Critical | Ceilo-  |                | Switch to           |
+| switch,          |          | meter   |                | hot standby         |
 | controller       |          |         |                | or reconfigure      |
 | degraded/failed  |          |         |                | virtual network     |
 |                  |          |         |                | topology            |
 +------------------+----------+---------+----------------+---------------------+
 | Hardware failure | Warning  | SNMP    | Redundancy of  | Live migration if   |
 | controller       |          |         |                | or reconfigure      |
 | degraded/failed  |          |         |                | virtual network     |
 |                  |          |         |                | topology            |
 +------------------+----------+---------+----------------+---------------------+
 | Hardware failure | Warning  | SNMP    | Redundancy of  | Live migration if   |
-| of physical      |          |         | physical       | possible  otherwise |
+| of physical      |          |         | physical       | possible otherwise  |
 | switch/router    |          |         | infrastructure | evacuation          |
 |                  |          |         | is reduced or  |                     |
 |                  |          |         | no longer      |                     |
 | switch/router    |          |         | infrastructure | evacuation          |
 |                  |          |         | is reduced or  |                     |
 |                  |          |         | no longer      |                     |