docs/development/overview/functest_scenario/doctor-scenario-in-functest.rst

   1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
   2 .. http://creativecommons.org/licenses/by/4.0
   3
   4
   5
   6 Platform overview
   7 """""""""""""""""
   8
   9 Doctor platform provides these features since `Danube Release <https://wiki.opnfv.org/display/SWREL/Danube>`_:
  10
  11 * Immediate Notification
  12 * Consistent resource state awareness for compute host down
  13 * Valid compute host status given to VM owner
  14
  15 These features enable high availability of Network Services on top of
  16 the virtualized infrastructure. Immediate notification allows VNF managers
  17 (VNFM) to process recovery actions promptly once a failure has occurred.
  18 Same framework can also be utilized to have VNFM awareness about
  19 infrastructure maintenance.
  20
  21 Consistency of resource state is necessary to execute recovery actions
  22 properly in the VIM.
  23
  24 Ability to query host status gives VM owner the possibility to get
  25 consistent state information through an API in case of a compute host
  26 fault.
  27
  28 The Doctor platform consists of the following components:
  29
  30 * OpenStack Compute (Nova)
  31 * OpenStack Networking (Neutron)
  32 * OpenStack Telemetry (Ceilometer)
  33 * OpenStack Alarming (AODH)
  34 * Doctor Sample Inspector, OpenStack Congress or OpenStack Vitrage
  35 * Doctor Sample Monitor or any monitor supported by Congress or Vitrage
  36
  37 .. note::
  38     Doctor Sample Monitor is used in Doctor testing. However in real
  39     implementation like Vitrage, there are several other monitors supported.
  40
  41 You can see an overview of the Doctor platform and how components interact in
  42 :numref:`figure-p1`.
  43
  44 .. figure:: ./images/Fault-management-design.png
  45     :name: figure-p1
  46     :width: 100%
  47
  48     Doctor platform and typical sequence
  49
  50 Detailed information on the Doctor architecture can be found in the Doctor
  51 requirements documentation:
  52 http://artifacts.opnfv.org/doctor/docs/requirements/05-implementation.html
  53
  54 Running test cases
  55 """"""""""""""""""
  56
  57 Functest will call the "doctor_tests/main.py" in Doctor to run the test job.
  58 Doctor testing can also be triggered by tox on OPNFV installer jumphost. Tox
  59 is normally used for functional, module and coding style testing in Python
  60 project.
  61
  62 Currently, 'Apex', 'Daisy', 'Fuel' and 'local' installer are supported.
  63
  64
  65 Fault management use case
  66 """""""""""""""""""""""""
  67
  68 * A consumer of the NFVI wants to receive immediate notifications about faults
  69   in the NFVI affecting the proper functioning of the virtual resources.
  70   Therefore, such faults have to be detected as quickly as possible, and, when
  71   a critical error is observed, the affected consumer is immediately informed
  72   about the fault and can switch over to the STBY configuration.
  73
  74 The faults to be monitored (and at which detection rate) will be configured by
  75 the consumer. Once a fault is detected, the Inspector in the Doctor
  76 architecture will check the resource map maintained by the Controller, to find
  77 out which virtual resources are affected and then update the resources state.
  78 The Notifier will receive the failure event requests sent from the Controller,
  79 and notify the consumer(s) of the affected resources according to the alarm
  80 configuration.
  81
  82 Detailed workflow information is as follows:
  83
  84 * Consumer(VNFM): (step 0) creates resources (network, server/instance) and an
  85   event alarm on state down notification of that server/instance or Neutron
  86   port.
  87
  88 * Monitor: (step 1) periodically checks nodes, such as ping from/to each
  89   dplane nic to/from gw of node, (step 2) once it fails to send out event
  90   with "raw" fault event information to Inspector
  91
  92 * Inspector: when it receives an event, it will (step 3) mark the host down
  93   ("mark-host-down"), (step 4) map the PM to VM, and change the VM status to
  94   down. In network failure case, also Neutron port is changed to down.
  95
  96 * Controller: (step 5) sends out instance update event to Ceilometer. In network
  97   failure case, also Neutron port is changed to down and corresponding event is
  98   sent to Ceilometer.
  99
 100 * Notifier: (step 6) Ceilometer transforms and passes the events to AODH,
 101   (step 7) AODH will evaluate events with the registered alarm definitions,
 102   then (step 8) it will fire the alarm to the "consumer" who owns the
 103   instance
 104
 105 * Consumer(VNFM): (step 9) receives the event and (step 10) recreates a new
 106   instance
 107
 108 Fault management test case
 109 """"""""""""""""""""""""""
 110
 111 Functest will call the 'doctor-test' command in Doctor to run the test job.
 112
 113 The following steps are executed:
 114
 115 Firstly, get the installer ip according to the installer type. Then ssh to
 116 the installer node to get the private key for accessing to the cloud. As
 117 'fuel' installer, ssh to the controller node to modify nova and ceilometer
 118 configurations.
 119
 120 Secondly, prepare image for booting VM, then create a test project and test
 121 user (both default to doctor) for the Doctor tests.
 122
 123 Thirdly, boot a VM under the doctor project and check the VM status to verify
 124 that the VM is launched completely. Then get the compute host info where the VM
 125 is launched to verify connectivity to the target compute host. Get the consumer
 126 ip according to the route to compute ip and create an alarm event in Ceilometer
 127 using the consumer ip.
 128
 129 Fourthly, the Doctor components are started, and, based on the above preparation,
 130 a failure is injected to the system, i.e. the network of compute host is
 131 disabled for 3 minutes. To ensure the host is down, the status of the host
 132 will be checked.
 133
 134 Finally, the notification time, i.e. the time between the execution of step 2
 135 (Monitor detects failure) and step 9 (Consumer receives failure notification)
 136 is calculated.
 137
 138 According to the Doctor requirements, the Doctor test is successful if the
 139 notification time is below 1 second.
 140
 141 Maintenance use case
 142 """"""""""""""""""""
 143
 144 * A consumer of the NFVI wants to interact with NFVI maintenance, upgrade,
 145   scaling and to have graceful retirement. Receiving notifications over these
 146   NFVI events and responding to those within given time window, consumer can
 147   guarantee zero downtime to his service.
 148
 149 The maintenance use case adds the Doctor platform an `admin tool` and an
 150 `app manager` component. Overview of maintenance components can be seen in
 151 :numref:`figure-p2`.
 152
 153 .. figure:: ./images/Maintenance-design.png
 154     :name: figure-p2
 155     :width: 100%
 156
 157     Doctor platform components in maintenance use case
 158
 159 In maintenance use case, `app manager` (VNFM) will subscribe to maintenance
 160 notifications triggered by project specific alarms through AODH. This is the way
 161 it gets to know different NFVI maintenance, upgrade and scaling operations that
 162 effect to its instances. The `app manager` can do actions depicted in `green
 163 color` or tell `admin tool` to do admin actions depicted in `orange color`
 164
 165 Any infrastructure component like `Inspector` can subscribe to maintenance
 166 notifications triggered by host specific alarms through AODH. Subscribing to the
 167 notifications needs admin privileges and can tell when a host is out of use as
 168 in maintenance and when it is taken back to production.
 169
 170 Maintenance test case
 171 """""""""""""""""""""
 172
 173 Maintenance test case is currently running in our Apex CI and executed by tox.
 174 This is because the special limitation mentioned below and also the fact we
 175 currently have only sample implementation as a proof of concept. Environmental
 176 variable TEST_CASE='maintenance' needs to be used when executing
 177 "doctor_tests/main.py". Test case workflow can be seen in :numref:`figure-p3`.
 178
 179 .. figure:: ./images/Maintenance-workflow.png
 180     :name: figure-p3
 181     :width: 100%
 182
 183     Maintenance test case workflow
 184
 185 In test case all compute capacity will be consumed with project (VNF) instances.
 186 For redundant services on instances and an empty compute needed for maintenance,
 187 test case will need at least 3 compute nodes in system. There will be 2
 188 instances on each compute, so minimum number of VCPUs is also 2. Depending on
 189 how many compute nodes there is application will always have 2 redundant
 190 instances (ACT-STDBY) on different compute nodes and rest of the compute
 191 capacity will be filled with non-redundant instances.
 192
 193 For each project specific maintenance message there is a time window for
 194 `app manager` to make any needed action. This will guarantee zero
 195 down time for his service. All replies back are done by calling `admin tool` API
 196 given in the message.
 197
 198 The following steps are executed:
 199
 200 Infrastructure admin will call `admin tool` API to trigger maintenance for
 201 compute hosts having instances belonging to a VNF.
 202
 203 Project specific `MAINTENANCE` notification is triggered to tell `app manager`
 204 that his instances are going to hit by infrastructure maintenance at a specific
 205 point in time. `app manager` will call `admin tool` API to answer back
 206 `ACK_MAINTENANCE`.
 207
 208 When the time comes to start the actual maintenance workflow in `admin tool`,
 209 a `DOWN_SCALE` notification is triggered as there is no empty compute node for
 210 maintenance (or compute upgrade). Project receives corresponding alarm and scales
 211 down instances and call `admin tool` API to answer back `ACK_DOWN_SCALE`.
 212
 213 As it might happen instances are not scaled down (removed) from a single
 214 compute node, `admin tool` might need to figure out what compute node should be
 215 made empty first and send `PREPARE_MAINTENANCE` to project telling which instance
 216 needs to be migrated to have the needed empty compute. `app manager` makes sure
 217 he is ready to migrate instance and call `admin tool` API to answer back
 218 `ACK_PREPARE_MAINTENANCE`. `admin tool` will make the migration and answer
 219 `ADMIN_ACTION_DONE`, so `app manager` knows instance can be again used.
 220
 221 :numref:`figure-p3` has next a light blue section of actions to be done for each
 222 compute. However as we now have one empty compute, we will maintain/upgrade that
 223 first. So on first round, we can straight put compute in maintenance and send
 224 admin level host specific `IN_MAINTENANCE` message. This is caught by `Inspector`
 225 to know host is down for maintenance. `Inspector` can now disable any automatic
 226 fault management actions for the host as it can be down for a purpose. After
 227 `admin tool` has completed maintenance/upgrade `MAINTENANCE_COMPLETE` message
 228 is sent to tell host is back in production.
 229
 230 Next rounds we always have instances on compute, so we need to have
 231 `PLANNED_MAINTANANCE` message to tell that those instances are now going to hit
 232 by maintenance. When `app manager` now receives this message, he knows instances
 233 to be moved away from compute will now move to already maintained/upgraded host.
 234 In test case no upgrade is done on application side to upgrade instances
 235 according to new infrastructure capabilities, but this could be done here as
 236 this information is also passed in the message. This might be just upgrading
 237 some RPMs, but also totally re-instantiating instance with a new flavor. Now if
 238 application runs an active side of a redundant instance on this compute,
 239 a switch over will be done. After `app manager` is ready he will call
 240 `admin tool` API to answer back `ACK_PLANNED_MAINTENANCE`. In test case the
 241 answer is `migrate`, so `admin tool` will migrate instances and reply
 242 `ADMIN_ACTION_DONE` and then `app manager` knows instances can be again used.
 243 Then we are ready to make the actual maintenance as previously trough
 244 `IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` steps.
 245
 246 After all computes are maintained, `admin tool` can send `MAINTENANCE_COMPLETE`
 247 to tell maintenance/upgrade is now complete. For `app manager` this means he
 248 can scale back to full capacity.
 249
 250 This is the current sample implementation and test case. Real life
 251 implementation is started in OpenStack Fenix project and there we should
 252 eventually address requirements more deeply and update the test case with Fenix
 253 implementation.