Merge "Fencing requirement."
[doctor.git] / requirements / 03-architecture.rst
index 984f254..9b618e0 100644 (file)
@@ -47,7 +47,7 @@ Architecture Overview
 ---------------------
 
 NFV and the Cloud platform provide virtual resources and related control
-functionality to users and administrators. :num:`Figure #figure3` shows the high
+functionality to users and administrators. :numref:`figure3` shows the high
 level architecture of NFV focusing on the NFVI, i.e., the virtualized
 infrastructure. The NFVI provides virtual resources, such as virtual machines
 (VM) and virtual networks. Those virtual resources are used to run applications,
@@ -79,12 +79,10 @@ applications (e.g., MME, S/P-GW) and the Network Services:
 
 The time interval between the instant that an event is detected by the
 monitoring system and the Consumer notification of unavailable resources shall
-be < 1 second (e.g., Step 1 to Step 4 in :num:`Figure #figure4` and :num:`Figure
-#figure5`).
-
-.. _figure3:
+be < 1 second (e.g., Step 1 to Step 4 in :numref:`figure4` and :numref:`figure5`).
 
 .. figure:: images/figure3.png
+   :name: figure3
    :width: 100%
 
    High level architecture
@@ -128,9 +126,12 @@ affected by failures on the physical resources under them. Unavailability of a
 virtualized resource is determined by referring to the mapping of physical and
 virtualized resources.
 
-The relation from physical resources to virtualized resources shall be
-configurable, as the cause of unavailability of virtualized resources can be
-different in technologies and policies of deployment.
+VIM shall allow configuration of fault correlation between physical and
+virtual resources. VIM shall support correlating faults:
+
+* between a physical resource and another physical resource
+* between a physical resource and a virtual resource
+* between a virtual resource and another virtual resource
 
 Failure aggregation is also required in this feature, e.g., a user may request
 to be only notified if failures on more than two standby VMs in an (N+M)
@@ -158,15 +159,19 @@ would lead to heavy signaling traffic. Thus, a publication/subscription
 messaging model is better suited for these notifications, as notifications are
 only sent to subscribed consumers.
 
-Note: the VIM should only accept individual notification URLs for each resource
-by its owner or administrator.
+Notifications will be send out along with the configuration by the consumer.
+The configuration includes endpoint(s) in which the consumers can specify
+multiple targets for the notification subscription, so that various and
+multiple receiver functions can consume the notification message.
+Also, the conditions for notifications shall be configurable, such that
+the consumer can set according policies, e.g. whether it wants to receive
+fault notifications or not.
 
+Note: the VIM should only accept notification subscriptions for each resource
+by its owner or administrator.
 Notifications to the Consumer about the unavailability of virtualized
 resources will include a description of the fault, preferably with sufficient
-abstraction rather than detailed physical fault information. Flexibility in
-notifications is important. For example, the receiver function in the
-consumer-side implementation could have different schema, location, and policies
-(e.g. receive or not, aggregate events with the same cause, etc.).
+abstraction rather than detailed physical fault information.
 
 .. _fencing:
 
@@ -178,16 +183,25 @@ Without fencing -- when the perceived disconnection is due to some transient
 or partial failure -- the evacuation might lead into two identical instances
 running together and having a dangerous conflict.
 
-There is a cross-project effort in OpenStack ongoing to implement fencing. A
-general description of fencing in OpenStack is available here:
-https://wiki.openstack.org/wiki/Fencing_Instances_of_an_Unreachable_Host .
+There is a cross-project definition in OpenStack of how to implement
+fencing, but there has not been any progress. The general description is
+available here:
+https://wiki.openstack.org/wiki/Fencing_Instances_of_an_Unreachable_Host
+
+As OpenStack does not cover fencing it is in the responsibility of the Doctor
+project to make sure fencing is done by using tools like pacemaker and by
+calling OpenStack APIs. Only after fencing is done OpenStack resources can be
+marked as down. In case there are gaps in OpenStack projects to have all
+relevant resources marked as down, those gaps need to be identified and fixed.
+The Doctor Inspector component will be responsible of marking resources down in
+the OpenStack and back up if necessary.
 
 Recovery Action
 ^^^^^^^^^^^^^^^
 
-In the basic "Fault management using ACT-STBY configuration" use case, no
-automatic actions will be taken by the VIM, but all recovery actions executed by
-the VIM and the NFVI will be instructed and coordinated by the Consumer.
+In the basic :ref:`uc-fault1` use case, no automatic actions will be taken by
+the VIM, but all recovery actions executed by the VIM and the NFVI will be
+instructed and coordinated by the Consumer.
 
 In a more advanced use case, the VIM shall be able to recover the failed virtual
 resources according to a pre-defined behavior for that resource. In principle
@@ -211,15 +225,14 @@ request/response message exchange allows the Consumer to find out about active
 alarms at the VIM. A filter can be used to narrow down the alarms returned in
 the response message.
 
-.. _figure4:
-
 .. figure:: images/figure4.png
+   :name: figure4
    :width: 100%
 
    High-level message flow for fault management
 
 The high level message flow for the fault management use case is shown in
-:num:`Figure #figure4`.
+:numref:`figure4`.
 It consists of the following steps:
 
 1. The VIM monitors the physical and virtual resources and the fault management
@@ -249,15 +262,14 @@ maintenance action to be executed. After the request was executed successfully
 error state, the VIM sends a MaintenanceResponse message back to the
 Administrator.
 
-.. _figure5:
-
 .. figure:: images/figure5.png
+   :name: figure5
    :width: 100%
 
    High-level message flow for NFVI maintenance
 
 The high level message flow for the NFVI maintenance use case is shown in
-:num:`Figure #figure5`.
+:numref:`figure5`.
 It consists of the following steps:
 
 1. Maintenance trigger received from administrator.
@@ -281,65 +293,6 @@ It consists of the following steps:
 8. The Administrator is coordinating and executing the maintenance
    operation/work on the NFVI. Note: this step is out of scope of Doctor.
 
-Faults
-------
-
-Faults in the listed elements need to be immediately notified to the Consumer in
-order to perform an immediate action like live migration or switch to a hot
-standby entity. In addition, the Administrator of the host should trigger a
-maintenance action to, e.g., reboot the server or replace a defective hardware
-element.
-
-Faults can be of different severity, i.e., critical, warning, or
-info. Critical faults require immediate action as a severe degradation of the
-system has happened or is expected. Warnings indicate that the system
-performance is going down: related actions include closer (e.g. more frequent)
-monitoring of that part of the system or preparation for a cold migration to a
-backup VM. Info messages do not require any action. We also consider a type
-"maintenance", which is no real fault, but may trigger maintenance actions
-like a re-boot of the server or replacement of a faulty, but redundant HW.
-
-Faults can be gathered by, e.g., enabling SNMP and installing some open source
-tools to catch and poll SNMP. When using for example Zabbix one can also put an
-agent running on the hosts to catch any other fault. In any case of failure, the
-Administrator should be notified. Table 1 provides a list of high level faults
-that are considered within the scope of the Doctor project requiring immediate
-action by the Consumer.
-
-
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Service          | Fault                                                                                                                     | Severity         | How to detect?    | Comment                                                                                  | Action to recover                                                    |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Compute Hardware | Processor/CPU failure, CPU condition not ok                                                                               | Critical         | Zabbix            |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Memory failure/Memory condition not ok                                                                                    | Critical         | Zabbix (IPMI)     |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Network card failure, e.g. network adapter connectivity lost                                                              | Critical         | Zabbix/Ceilometer |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Disk crash                                                                                                                | Info             | RAID monitoring   | Network storage is very redundant (e.g. RAID system) and can guarantee high availability | Inform OAM                                                           |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Storage controller                                                                                                        | Critical         | Zabbix (IPMI)     |                                                                                          | Live migration if storage is still accessible; otherwise hot standby |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | PDU/power failure, power off, server reset                                                                                | Critical         | Zabbix/Ceilometer |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Power degradation, power redundancy lost, power threshold exceeded                                                        | Warning          | SNMP              |                                                                                          | Live migration                                                       |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Chassis problem (.e.g fan degraded/failed, chassis power degraded), CPU fan problem, temperature/thermal condition not ok | Warning          | SNMP              |                                                                                          | Live migration                                                       |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Mainboard failure                                                                                                         | Critical         | Zabbix (IPMI)     |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | OS crash (e.g. kernel panic)                                                                                              | Critical         | Zabbix            |                                                                                          | Switch to hot standby                                                |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Hypervisor       | System has restarted                                                                                                      | Critical         | Zabbix            |                                                                                          | Switch to hot standby                                                |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Hypervisor failure                                                                                                        | Warning/Critical | Zabbix/Ceilometer |                                                                                          | Evacuation/switch to hot standby                                     |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Zabbix/Ceilometer is unreachable                                                                                          | Warning          | ?                 |                                                                                          | Live migration                                                       |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Network          | SDN/OpenFlow switch, controller degraded/failed                                                                           | Critical         | ?                 |                                                                                          | Switch to hot standby or reconfigure virtual network topology        |
-+                  +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-|                  | Hardware failure of physical switch/router                                                                                | Warning          | SNMP              | Redundancy of physical infrastructure is reduced or no longer available                  | Live migration if possible, otherwise evacuation                     |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-
 ..
  vim: set tabstop=4 expandtab textwidth=80:
+