docs/requirements/03-architecture.rst

   1 High level architecture and general features
   2 ============================================
   3
   4 Functional overview
   5 -------------------
   6
   7 The Doctor project circles around two distinct use cases: 1) management of
   8 failures of virtualized resources and 2) planned maintenance, e.g. migration, of
   9 virtualized resources. Both of them may affect a VNF/application and the network
  10 service it provides, but there is a difference in frequency and how they can be
  11 handled.
  12
  13 Failures are spontaneous events that may or may not have an impact on the
  14 virtual resources. The Consumer should as soon as possible react to the failure,
  15 e.g., by switching to the STBY node. The Consumer will then instruct the VIM on
  16 how to clean up or repair the lost virtual resources, i.e. restore the VM, VLAN
  17 or virtualized storage. How much the applications are affected varies.
  18 Applications with built-in HA support might experience a short decrease in
  19 retainability (e.g. an ongoing session might be lost) while keeping availability
  20 (establishment or re-establishment of sessions are not affected), whereas the
  21 impact on applications without built-in HA may be more serious. How much the
  22 network service is impacted depends on how the service is implemented. With
  23 sufficient network redundancy the service may be unaffected even when a specific
  24 resource fails.
  25
  26 On the other hand, planned maintenance impacting virtualized resources are events
  27 that are known in advance. This group includes e.g. migration due to software
  28 upgrades of OS and hypervisor on a compute host. Some of these might have been
  29 requested by the application or its management solution, but there is also a
  30 need for coordination on the actual operations on the virtual resources. There
  31 may be an impact on the applications and the service, but since they are not
  32 spontaneous events there is room for planning and coordination between the
  33 application management organization and the infrastructure management
  34 organization, including performing whatever actions that would be required to
  35 minimize the problems.
  36
  37 Failure prediction is the process of pro-actively identifying situations that
  38 may lead to a failure in the future unless acted on by means of maintenance
  39 activities. From applications' point of view, failure prediction may impact them
  40 in two ways: either the warning time is so short that the application or its
  41 management solution does not have time to react, in which case it is equal to
  42 the failure scenario, or there is sufficient time to avoid the consequences by
  43 means of maintenance activities, in which case it is similar to planned
  44 maintenance.
  45
  46 Architecture Overview
  47 ---------------------
  48
  49 NFV and the Cloud platform provide virtual resources and related control
  50 functionality to users and administrators. :numref:`figure3` shows the high
  51 level architecture of NFV focusing on the NFVI, i.e., the virtualized
  52 infrastructure. The NFVI provides virtual resources, such as virtual machines
  53 (VM) and virtual networks. Those virtual resources are used to run applications,
  54 i.e. VNFs, which could be components of a network service which is managed by
  55 the consumer of the NFVI. The VIM provides functionalities of controlling and
  56 viewing virtual resources on hardware (physical) resources to the consumers,
  57 i.e., users and administrators. OpenStack is a prominent candidate for this VIM.
  58 The administrator may also directly control the NFVI without using the VIM.
  59
  60 Although OpenStack is the target upstream project where the new functional
  61 elements (Controller, Notifier, Monitor, and Inspector) are expected to be
  62 implemented, a particular implementation method is not assumed. Some of these
  63 elements may sit outside of OpenStack and offer a northbound interface to
  64 OpenStack.
  65
  66 General Features and Requirements
  67 ---------------------------------
  68
  69 The following features are required for the VIM to achieve high availability of
  70 applications (e.g., MME, S/P-GW) and the Network Services:
  71
  72 1. Monitoring: Monitor physical and virtual resources.
  73 2. Detection: Detect unavailability of physical resources.
  74 3. Correlation and Cognition: Correlate faults and identify affected virtual
  75    resources.
  76 4. Notification: Notify unavailable virtual resources to their Consumer(s).
  77 5. Fencing: Shut down or isolate a faulty resource
  78 6. Recovery action: Execute actions to process fault recovery and maintenance.
  79
  80 The time interval between the instant that an event is detected by the
  81 monitoring system and the Consumer notification of unavailable resources shall
  82 be < 1 second (e.g., Step 1 to Step 4 in :numref:`figure4` and :numref:`figure5`).
  83
  84 .. figure:: images/figure3.png
  85    :name: figure3
  86    :width: 100%
  87
  88    High level architecture
  89
  90 Monitoring
  91 ^^^^^^^^^^
  92
  93 The VIM shall monitor physical and virtual resources for unavailability and
  94 suspicious behavior.
  95
  96 Detection
  97 ^^^^^^^^^
  98
  99 The VIM shall detect unavailability and failures of physical resources that
 100 might cause errors/faults in virtual resources running on top of them.
 101 Unavailability of physical resource is detected by various monitoring and
 102 managing tools for hardware and software components. This may include also
 103 predicting upcoming faults. Note, fault prediction is out of scope of this
 104 project and is investigated in the OPNFV "Data Collection for Failure
 105 Prediction" project [PRED]_.
 106
 107 The fault items/events to be detected shall be configurable.
 108
 109 The configuration shall enable Failure Selection and Aggregation. Failure
 110 aggregation means the VIM determines unavailability of physical resource from
 111 more than two non-critical failures related to the same resource.
 112
 113 There are two types of unavailability - immediate and future:
 114
 115 * Immediate unavailability can be detected by setting traps of raw failures on
 116   hardware monitoring tools.
 117 * Future unavailability can be found by receiving maintenance instructions
 118   issued by the administrator of the NFVI or by failure prediction mechanisms.
 119
 120 Correlation and Cognition
 121 ^^^^^^^^^^^^^^^^^^^^^^^^^
 122
 123 The VIM shall correlate each fault to the impacted virtual resource, i.e., the
 124 VIM shall identify unavailability of virtualized resources that are or will be
 125 affected by failures on the physical resources under them. Unavailability of a
 126 virtualized resource is determined by referring to the mapping of physical and
 127 virtualized resources.
 128
 129 VIM shall allow configuration of fault correlation between physical and
 130 virtual resources. VIM shall support correlating faults:
 131
 132 * between a physical resource and another physical resource
 133 * between a physical resource and a virtual resource
 134 * between a virtual resource and another virtual resource
 135
 136 Failure aggregation is also required in this feature, e.g., a user may request
 137 to be only notified if failures on more than two standby VMs in an (N+M)
 138 deployment model occurred.
 139
 140 Notification
 141 ^^^^^^^^^^^^
 142
 143 The VIM shall notify the alarm, i.e., unavailability of virtual resource(s), to
 144 the Consumer owning it over the northbound interface, such that the Consumers
 145 impacted by the failure can take appropriate actions to recover from the
 146 failure.
 147
 148 The VIM shall also notify the unavailability of physical resources to its
 149 Administrator.
 150
 151 All notifications shall be transferred immediately in order to minimize the
 152 stalling time of the network service and to avoid over assignment caused by
 153 delay of capability updates.
 154
 155 There may be multiple consumers, so the VIM has to find out the owner of a
 156 faulty resource. Moreover, there may be a large number of virtual and physical
 157 resources in a real deployment, so polling the state of all resources to the VIM
 158 would lead to heavy signaling traffic. Thus, a publication/subscription
 159 messaging model is better suited for these notifications, as notifications are
 160 only sent to subscribed consumers.
 161
 162 Notifications will be send out along with the configuration by the consumer.
 163 The configuration includes endpoint(s) in which the consumers can specify
 164 multiple targets for the notification subscription, so that various and
 165 multiple receiver functions can consume the notification message.
 166 Also, the conditions for notifications shall be configurable, such that
 167 the consumer can set according policies, e.g. whether it wants to receive
 168 fault notifications or not.
 169
 170 Note: the VIM should only accept notification subscriptions for each resource
 171 by its owner or administrator.
 172 Notifications to the Consumer about the unavailability of virtualized
 173 resources will include a description of the fault, preferably with sufficient
 174 abstraction rather than detailed physical fault information.
 175
 176 .. _fencing:
 177
 178 Fencing
 179 ^^^^^^^
 180 Recovery actions, e.g. safe VM evacuation, have to be preceded by fencing the
 181 failed host. Fencing hereby means to isolate or shut down a faulty resource.
 182 Without fencing -- when the perceived disconnection is due to some transient
 183 or partial failure -- the evacuation might lead into two identical instances
 184 running together and having a dangerous conflict.
 185
 186 There is a cross-project definition in OpenStack of how to implement
 187 fencing, but there has not been any progress. The general description is
 188 available here:
 189 https://wiki.openstack.org/wiki/Fencing_Instances_of_an_Unreachable_Host
 190
 191 As OpenStack does not cover fencing it is in the responsibility of the Doctor
 192 project to make sure fencing is done by using tools like pacemaker and by
 193 calling OpenStack APIs. Only after fencing is done OpenStack resources can be
 194 marked as down. In case there are gaps in OpenStack projects to have all
 195 relevant resources marked as down, those gaps need to be identified and fixed.
 196 The Doctor Inspector component will be responsible of marking resources down in
 197 the OpenStack and back up if necessary.
 198
 199 Recovery Action
 200 ^^^^^^^^^^^^^^^
 201
 202 In the basic :ref:`uc-fault1` use case, no automatic actions will be taken by
 203 the VIM, but all recovery actions executed by the VIM and the NFVI will be
 204 instructed and coordinated by the Consumer.
 205
 206 In a more advanced use case, the VIM shall be able to recover the failed virtual
 207 resources according to a pre-defined behavior for that resource. In principle
 208 this means that the owner of the resource (i.e., its consumer or administrator)
 209 can define which recovery actions shall be taken by the VIM. Examples are a
 210 restart of the VM, migration/evacuation of the VM, or no action.
 211
 212
 213
 214 High level northbound interface specification
 215 ---------------------------------------------
 216
 217 Fault management
 218 ^^^^^^^^^^^^^^^^
 219
 220 This interface allows the Consumer to subscribe to fault notification from the
 221 VIM. Using a filter, the Consumer can narrow down which faults should be
 222 notified. A fault notification may trigger the Consumer to switch from ACT to
 223 STBY configuration and initiate fault recovery actions. A fault query
 224 request/response message exchange allows the Consumer to find out about active
 225 alarms at the VIM. A filter can be used to narrow down the alarms returned in
 226 the response message.
 227
 228 .. figure:: images/figure4.png
 229    :name: figure4
 230    :width: 100%
 231
 232    High-level message flow for fault management
 233
 234 The high level message flow for the fault management use case is shown in
 235 :numref:`figure4`.
 236 It consists of the following steps:
 237
 238 1. The VIM monitors the physical and virtual resources and the fault management
 239    workflow is triggered by a monitored fault event.
 240 2. Event correlation, fault detection and aggregation in VIM. Note: this may
 241    also happen after Step 3.
 242 3. Database lookup to find the virtual resources affected by the detected fault.
 243 4. Fault notification to Consumer.
 244 5. The Consumer switches to standby configuration (STBY)
 245 6. Instructions to VIM requesting certain actions to be performed on the
 246    affected resources, for example migrate/update/terminate specific
 247    resource(s). After reception of such instructions, the VIM is executing the
 248    requested action, e.g., it will migrate or terminate a virtual resource.
 249
 250 NFVI Maintenance
 251 ^^^^^^^^^^^^^^^^
 252
 253 The NFVI maintenance interface allows the Administrator to notify the VIM about
 254 a planned maintenance operation on the NFVI. A maintenance operation may for
 255 example be an update of the server firmware or the hypervisor. The
 256 MaintenanceRequest message contains instructions to change the state of the
 257 resource from 'normal' to 'maintenance'. After receiving the MaintenanceRequest,
 258 the VIM will notify the Consumer about the planned maintenance operation,
 259 whereupon the Consumer will switch to standby (STBY) configuration to allow the
 260 maintenance action to be executed. After the request was executed successfully
 261 (i.e., the physical resources have been emptied) or the operation resulted in an
 262 error state, the VIM sends a MaintenanceResponse message back to the
 263 Administrator.
 264
 265 .. figure:: images/figure5.png
 266    :name: figure5
 267    :width: 100%
 268
 269    High-level message flow for NFVI maintenance
 270
 271 The high level message flow for the NFVI maintenance use case is shown in
 272 :numref:`figure5`.
 273 It consists of the following steps:
 274
 275 1. Maintenance trigger received from administrator.
 276 2. VIM switches the affected NFVI resources to "maintenance" state, i.e., the
 277    NFVI resources are prepared for the maintenance operation. For example, the
 278    virtual resources should not be used for further allocation/migration
 279    requests and the VIM will coordinate with the Consumer on how to best empty
 280    the physical resources.
 281 3. Database lookup to find the virtual resources affected by the detected
 282    maintenance operation.
 283 4. StateChange notification to inform Consumer about planned maintenance
 284    operation.
 285 5. The Consumer switches to standby configuration (STBY)
 286 6. Instructions from Consumer to VIM requesting certain actions to be performed
 287    (step 6a). After receiving such instructions, the VIM executes the requested
 288    action in order to empty the physical resources (step 6b) and informs the
 289    Consumer is about the result of the actions. Note: this step is out of scope
 290    of Doctor.
 291 7. Maintenance response from VIM to inform the Administrator that the physical
 292    machines have been emptied (or the operation resulted in an error state).
 293 8. The Administrator is coordinating and executing the maintenance
 294    operation/work on the NFVI. Note: this step is out of scope of Doctor.
 295
 296 ..
 297  vim: set tabstop=4 expandtab textwidth=80:
 298