From: Yujun Zhang Date: Thu, 26 May 2016 03:36:12 +0000 (+0800) Subject: Revisit fault table X-Git-Tag: colorado.1.0~64^2 X-Git-Url: https://gerrit.opnfv.org/gerrit/gitweb?p=doctor.git;a=commitdiff_plain;h=b4180a43b1350e02ddc124dbaaa8b3885e78a26e Revisit fault table - add storage fault (same content as compute) - add example of mainboard error - rename fault Hypervisor status not retrievable - editorial fix - complete "how to detect" Change-Id: I9ec849014d18823f9a309396542e89a4e6da8b6c Signed-off-by: Yujun Zhang --- diff --git a/docs/requirements/07-annex.rst b/docs/requirements/07-annex.rst index bf65ff7c..8cb19612 100644 --- a/docs/requirements/07-annex.rst +++ b/docs/requirements/07-annex.rst @@ -26,99 +26,100 @@ Administrator should be notified. The following tables provide a list of high level faults that are considered within the scope of the Doctor project requiring immediate action by the Consumer. -**Compute Hardware** +**Compute/Storage** -+-------------------+----------+------------+-----------------+----------------+ -| Fault | Severity | How to | Comment | Action to | -| | | detect? | | recover | -+===================+==========+============+=================+================+ -| Processor/CPU | Critical | Zabbix | | Switch to | -| failure, CPU | | | | hot standby | -| condition not ok | | | | | -+-------------------+----------+------------+-----------------+----------------+ -| Memory failure/ | Critical | Zabbix | | Switch to | -| Memory condition | | (IPMI) | | hot standby | -| not ok | | | | | -+-------------------+----------+------------+-----------------+----------------+ -| Network card | Critical | Zabbix/ | | Switch to | -| failure, e.g. | | Ceilometer | | hot standby | -| network adapter | | | | | -| connectivity lost | | | | | -+-------------------+----------+------------+-----------------+----------------+ -| Disk crash | Info | RAID | Network storage | Inform OAM | -| | | monitoring | is very | | -| | | | redundant (e.g. | | -| | | | RAID system) | | -| | | | and can | | -| | | | guarantee high | | -| | | | availability | | -+-------------------+----------+------------+-----------------+----------------+ -| Storage | Critical | Zabbix | | Live migration | -| controller | | (IPMI) | | if storage | -| | | | | is still | -| | | | | accessible; | -| | | | | otherwise hot | -| | | | | standby | -+-------------------+----------+------------+-----------------+----------------+ -| PDU/power | Critical | Zabbix/ | | Switch to | -| failure, power | | Ceilometer | | hot standby | -| off, server reset | | | | | -+-------------------+----------+------------+-----------------+----------------+ -| Power | Warning | SNMP | | Live migration | -| degration, power | | | | | -| redundancy lost, | | | | | -| power threshold | | | | | -| exceeded | | | | | -+-------------------+----------+------------+-----------------+----------------+ -| Chassis problem | Warning | SNMP | | Live migration | -| (e.g. fan | | | | | -| degraded/failed, | | | | | -| chassis power | | | | | -| degraded), CPU | | | | | -| fan problem, | | | | | -| temperature/ | | | | | -| thermal condition | | | | | -| not ok | | | | | -+-------------------+----------+------------+-----------------+----------------+ -| Mainboard failure | Critical | Zabbix | | Switch to | -| | | (IPMI) | | hot standby | -+-------------------+----------+------------+-----------------+----------------+ -| OS crash (e.g. | Critical | Zabbix | | Switch to | -| kernel panic) | | | | hot standby | -+-------------------+----------+------------+-----------------+----------------+ ++-------------------+----------+------------+-----------------+------------------+ +| Fault | Severity | How to | Comment | Immediate action | +| | | detect? | | to recover | ++===================+==========+============+=================+==================+ +| Processor/CPU | Critical | Zabbix | | Switch to hot | +| failure, CPU | | | | standby | +| condition not ok | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Memory failure/ | Critical | Zabbix | | Switch to | +| Memory condition | | (IPMI) | | hot standby | +| not ok | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Network card | Critical | Zabbix/ | | Switch to | +| failure, e.g. | | Ceilometer | | hot standby | +| network adapter | | | | | +| connectivity lost | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Disk crash | Info | RAID | Network storage | Inform OAM | +| | | monitoring | is very | | +| | | | redundant (e.g. | | +| | | | RAID system) | | +| | | | and can | | +| | | | guarantee high | | +| | | | availability | | ++-------------------+----------+------------+-----------------+------------------+ +| Storage | Critical | Zabbix | | Live migration | +| controller | | (IPMI) | | if storage | +| | | | | is still | +| | | | | accessible; | +| | | | | otherwise hot | +| | | | | standby | ++-------------------+----------+------------+-----------------+------------------+ +| PDU/power | Critical | Zabbix/ | | Switch to | +| failure, power | | Ceilometer | | hot standby | +| off, server reset | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Power | Warning | SNMP | | Live migration | +| degration, power | | | | | +| redundancy lost, | | | | | +| power threshold | | | | | +| exceeded | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Chassis problem | Warning | SNMP | | Live migration | +| (e.g. fan | | | | | +| degraded/failed, | | | | | +| chassis power | | | | | +| degraded), CPU | | | | | +| fan problem, | | | | | +| temperature/ | | | | | +| thermal condition | | | | | +| not ok | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Mainboard failure | Critical | Zabbix | e.g. PCIe, SAS | Switch to | +| | | (IPMI) | link failure | hot standby | ++-------------------+----------+------------+-----------------+------------------+ +| OS crash (e.g. | Critical | Zabbix | | Switch to | +| kernel panic) | | | | hot standby | ++-------------------+----------+------------+-----------------+------------------+ **Hypervisor** -+----------------+----------+------------+---------+-------------------+ -| Fault | Severity | How to | Comment | Action to | -| | | detect? | | recover | -+================+==========+============+=========+===================+ -| System has | Critical | Zabbix | | Switch to | -| restarted | | | | hot standby | -+----------------+----------+------------+---------+-------------------+ -| Hypervisor | Warning/ | Zabbix/ | | Evacuation/switch | -| failure | Critical | Ceilometer | | to hot standby | -+----------------+----------+------------+---------+-------------------+ -| Zabbix/ | Warning | ? | | Live migration | -| Ceilometer | | | | | -| is unreachable | | | | | -+----------------+----------+------------+---------+-------------------+ ++----------------+----------+------------+-------------+-------------------+ +| Fault | Severity | How to | Comment | Immediate action | +| | | detect? | | to recover | ++================+==========+============+=============+===================+ +| System has | Critical | Zabbix | | Switch to | +| restarted | | | | hot standby | ++----------------+----------+------------+-------------+-------------------+ +| Hypervisor | Warning/ | Zabbix/ | | Evacuation/switch | +| failure | Critical | Ceilometer | | to hot standby | ++----------------+----------+------------+-------------+-------------------+ +| Hypervisor | Warning | Alarming | Zabbix/ | Rebuild VM | +| status not | | service | Ceilometer | | +| retrievable | | | unreachable | | +| after certain | | | | | +| period | | | | | ++----------------+----------+------------+-------------+-------------------+ **Network** - +------------------+----------+---------+----------------+---------------------+ -| Fault | Severity | How to | Comment | Action to | +| Fault | Severity | How to | Comment | Immediate action to | | | | detect? | | recover | +==================+==========+=========+================+=====================+ -| SDN/OpenFlow | Critical | ? | | Switch to | -| switch, | | | | hot standby | +| SDN/OpenFlow | Critical | Ceilo- | | Switch to | +| switch, | | meter | | hot standby | | controller | | | | or reconfigure | | degraded/failed | | | | virtual network | | | | | | topology | +------------------+----------+---------+----------------+---------------------+ | Hardware failure | Warning | SNMP | Redundancy of | Live migration if | -| of physical | | | physical | possible otherwise | +| of physical | | | physical | possible otherwise | | switch/router | | | infrastructure | evacuation | | | | | is reduced or | | | | | | no longer | |