requirements/02-use_cases.rst

   1 Use cases and scenarios
   2 =======================
   3
   4 Telecom services often have very high requirements on service performance. As a
   5 consequence they often utilize redundancy and high availability (HA) mechanisms
   6 for both the service and the platform. The HA support may be built-in or
   7 provided by the platform. In any case, the HA support typically has a very fast
   8 detection and reaction time to minimize service impact. The main changes
   9 proposed in this document are about making a clear distinction between fault
  10 management and recovery a) within the VIM/NFVI and b) High Availability support
  11 for VNFs on the other, claiming that HA support within a VNF or as a service
  12 from the platform is outside the scope of Doctor and is discussed in the High
  13 Availability for OPNFV project. Doctor should focus on detecting and remediating
  14 faults in the NFVI. This will ensure that applications come back to a fully
  15 redundant configuration faster than before.
  16
  17 As an example, Telecom services can come with an Active-Standby (ACT-STBY)
  18 configuration which is a (1+1) redundancy scheme. ACT and STBY nodes (aka
  19 Physical Network Function (PNF) in ETSI NFV terminology) are in a hot standby
  20 configuration. If an ACT node is unable to function properly due to fault or any
  21 other reason, the STBY node is instantly made ACT, and affected services can be
  22 provided without any service interruption.
  23
  24 The ACT-STBY configuration needs to be maintained. This means, when a STBY node
  25 is made ACT, either the previously ACT node, after recovery, shall be made STBY,
  26 or, a new STBY node needs to be configured. The actual operations to
  27 instantiate/configure a new STBY are similar to instantiating a new VNF and
  28 therefore are outside the scope of this project.
  29
  30 The NFVI fault management and maintenance requirements aim at providing fast
  31 failure detection of physical and virtualized resources and remediation of the
  32 virtualized resources provided to Consumers according to their predefined
  33 request to enable applications to recover to a fully redundant mode of
  34 operation.
  35
  36 1. Fault management/recovery using ACT-STBY configuration (Triggered by critical
  37    error)
  38 2. Preventive actions based on fault prediction (Preventing service stop by
  39    handling warnings)
  40 3. VM Retirement (Managing service during NFVI maintenance, i.e. H/W,
  41    Hypervisor, Host OS, maintenance)
  42
  43 Faults
  44 ------
  45
  46 Fault management using ACT-STBY configuration
  47 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  48
  49 In :num:`Figure #figure1`, a system-wide view of relevant functional blocks is
  50 presented. OpenStack is considered as the VIM implementation (aka Controller)
  51 which has interfaces with the NFVI and the Consumers. The VNF implementation is
  52 represented as different virtual resources marked by different colors. Consumers
  53 (VNFM or NFVO in ETSI NFV terminology) own/manage the respective virtual
  54 resources (VMs in this example) shown with the same colors.
  55
  56 The first requirement in this use case is that the Controller needs to detect
  57 faults in the NVFI ("1. Fault Notification" in :num:`Figure #figure1`) affecting
  58 the proper functioning of the virtual resources (labelled as VM-x) running on
  59 top of it. It should be possible to configure which relevant fault items should
  60 be detected. The VIM (e.g. OpenStack) itself could be extended to detect such
  61 faults. Alternatively, a third party fault monitoring tool could be used which
  62 then informs the VIM about such faults; this third party fault monitoring
  63 element can be considered as a component of VIM from an architectural point of
  64 view.
  65
  66 Once such fault is detected, the VIM shall find out which virtual resources are
  67 affected by this fault. In the example in :num:`Figure #figure1`, VM-4 is
  68 affected by a fault in the Hardware Server-3. Such mapping shall be maintained
  69 in the VIM, depicted as the "Server-VM info" table inside the VIM.
  70
  71 Once the VIM has identified which virtual resources are affected by the fault,
  72 it needs to find out who is the Consumer (i.e. the owner/manager) of the
  73 affected virtual resources (Step 2). In the example shown in :num:`Figure
  74 #figure1`, the VIM knows that for the red VM-4, the manager is the red Consumer
  75 through an Ownership info table. The VIM then notifies (Step 3 "Fault
  76 Notification") the red Consumer about this fault, preferably with sufficient
  77 abstraction rather than detailed physical fault information.
  78
  79 .. _figure1:
  80
  81 .. figure:: images/figure1.png
  82    :width: 100%
  83
  84    Fault management/recovery use case
  85
  86 The Consumer then switches to STBY configuration by switching the STBY node to
  87 ACT state (Step 4). It further initiates a process to instantiate/configure a
  88 new STBY. However, switching to STBY mode and creating a new STBY machine is a
  89 VNFM/NFVO level operation and therefore outside the scope of this project.
  90 Doctor project does not create interfaces for such VNFM level configuration
  91 operations. Yet, since the total failover time of a consumer service depends on
  92 both the delay of such processes as well as the reaction time of Doctor
  93 components, minimizing Doctor's reaction time is a necessary basic ingredient to
  94 fast failover times in general.
  95
  96 Once the Consumer has switched to STBY configuration, it notifies (Step 5
  97 "Instruction" in :num:`Figure #figure1`) the VIM. The VIM can then take
  98 necessary (e.g. pre-determined by the involved network operator) actions on how
  99 to clean up the fault affected VMs (Step 6 "Execute Instruction").
 100
 101 The key issue in this use case is that a VIM (OpenStack in this context) shall
 102 not take a standalone fault recovery action (e.g. migration of the affected VMs)
 103 before the ACT-STBY switching is complete, as that might violate the ACT-STBY
 104 configuration and render the node out of service.
 105
 106 As an extension of the 1+1 ACT-STBY resilience pattern, a STBY instance can act as
 107 backup to N ACT nodes (N+1). In this case, the basic information flow remains
 108 the same, i.e., the consumer is informed of a failure in order to activate the
 109 STBY node. However, in this case it might be useful for the failure notification
 110 to cover a number of failed instances due to the same fault (e.g., more than one
 111 instance might be affected by a switch failure). The reaction of the consumer
 112 might depend on whether only one active instance has failed (similar to the
 113 ACT-STBY case), or if more active instances are needed as well.
 114
 115 Preventive actions based on fault prediction
 116 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 117
 118 The fault management scenario explained in Clause 2.1.1 can also be performed
 119 based on fault prediction. In such cases, in VIM, there is an intelligent fault
 120 prediction module which, based on its NFVI monitoring information, can predict
 121 an imminent fault in the elements of NFVI. A simple example is raising
 122 temperature of a Hardware Server which might trigger a pre-emptive recovery
 123 action. The requirements of such fault prediction in the VIM are investigated in
 124 the OPNFV project "Data Collection for Failure Prediction" [PRED]_.
 125
 126 This use case is very similar to "Fault management using ACT-STBY configuration"
 127 in Clause 2.1.1. Instead of a fault detection (Step 1 "Fault Notification in"
 128 :num:`Figure #figure1`), the trigger comes from a fault prediction module in the
 129 VIM, or from a third party module which notifies the VIM about an imminent
 130 fault. From Step 2~5, the work flow is the same as in the "Fault management
 131 using ACT-STBY configuration" use case, except in this case, the Consumer of a
 132 VM/VNF switches to STBY configuration based on a predicted fault, rather than an
 133 occurred fault.
 134
 135 NVFI Maintenance
 136 ----------------
 137
 138 VM Retirement
 139 ^^^^^^^^^^^^^
 140
 141 All network operators perform maintenance of their network infrastructure, both
 142 regularly and irregularly. Besides the hardware, virtualization is expected to
 143 increase the number of elements subject to such maintenance as NFVI holds new
 144 elements like the hypervisor and host OS. Maintenance of a particular resource
 145 element e.g. hardware, hypervisor etc. may render a particular server hardware
 146 unusable until the maintenance procedure is complete.
 147
 148 However, the Consumer of VMs needs to know that such resources will be
 149 unavailable because of NFVI maintenance. The following use case is again to
 150 ensure that the ACT-STBY configuration is not violated. A stand-alone action
 151 (e.g. live migration) from VIM/OpenStack to empty a physical machine so that
 152 consequent maintenance procedure could be performed may not only violate the
 153 ACT-STBY configuration, but also have impact on real-time processing scenarios
 154 where dedicated resources to virtual resources (e.g. VMs) are necessary and a
 155 pause in operation (e.g. vCPU) is not allowed. The Consumer is in a position to
 156 safely perform the switch between ACT and STBY nodes, or switch to an
 157 alternative VNF forwarding graph so the hardware servers hosting the ACT nodes
 158 can be emptied for the upcoming maintenance operation. Once the target hardware
 159 servers are emptied (i.e. no virtual resources are running on top), the VIM can
 160 mark them with an appropriate flag (i.e. "maintenance" state) such that these
 161 servers are not considered for hosting of virtual machines until the maintenance
 162 flag is cleared (i.e. nodes are back in "normal" status).
 163
 164 A high-level view of the maintenance procedure is presented in :num:`Figure
 165 #figure2`. VIM/OpenStack, through its northbound interface, receives a
 166 maintenance notification (Step 1 "Maintenance Request") from the Administrator
 167 (e.g. a network operator) including information about which hardware is subject
 168 to maintenance. Maintenance operations include replacement/upgrade of hardware,
 169 update/upgrade of the hypervisor/host OS, etc.
 170
 171 The consequent steps to enable the Consumer to perform ACT-STBY switching are
 172 very similar to the fault management scenario. From VIM/OpenStack's internal
 173 database, it finds out which virtual resources (VM-x) are running on those
 174 particular Hardware Servers and who are the managers of those virtual resources
 175 (Step 2). The VIM then informs the respective Consumer (VNFMs or NFVO) in Step 3
 176 "Maintenance Notification". Based on this, the Consumer takes necessary actions
 177 (Step 4, e.g. switch to STBY configuration or switch VNF forwarding graphs) and
 178 then notifies (Step 5 "Instruction") the VIM. Upon receiving such notification,
 179 the VIM takes necessary actions (Step 6 "Execute Instruction" to empty the
 180 Hardware Servers so that consequent maintenance operations could be performed.
 181 Due to the similarity for Steps 2~6, the maintenance procedure and the fault
 182 management procedure are investigated in the same project.
 183
 184 .. _figure2:
 185
 186 .. figure:: images/figure2.png
 187    :width: 100%
 188
 189    Maintenance use case
 190
 191 ..
 192  vim: set tabstop=4 expandtab textwidth=80: