docs/release/scenarios/maintenance/maintenance.rst

   1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
   2 .. http://creativecommons.org/licenses/by/4.0
   3
   4
   5 Maintenance use case
   6 """"""""""""""""""""
   7
   8 * A consumer of the NFVI wants to interact with NFVI maintenance, upgrade,
   9   scaling and to have graceful retirement. Receiving notifications over these
  10   NFVI events and responding to those within given time window, consumer can
  11   guarantee zero downtime to his service.
  12
  13 The maintenance use case adds the Doctor platform an `admin tool` and an
  14 `app manager` component. Overview of maintenance components can be seen in
  15 :numref:`figure-p2`.
  16
  17 .. figure:: ./images/Maintenance-design.png
  18     :name: figure-p2
  19     :width: 100%
  20
  21     Doctor platform components in maintenance use case
  22
  23 In maintenance use case, `app manager` (VNFM) will subscribe to maintenance
  24 notifications triggered by project specific alarms through AODH. This is the way
  25 it gets to know different NFVI maintenance, upgrade and scaling operations that
  26 effect to its instances. The `app manager` can do actions depicted in `green
  27 color` or tell `admin tool` to do admin actions depicted in `orange color`
  28
  29 Any infrastructure component like `Inspector` can subscribe to maintenance
  30 notifications triggered by host specific alarms through AODH. Subscribing to the
  31 notifications needs admin privileges and can tell when a host is out of use as
  32 in maintenance and when it is taken back to production.
  33
  34 Maintenance test case
  35 """""""""""""""""""""
  36
  37 Maintenance test case is currently running in our Apex CI and executed by tox.
  38 This is because the special limitation mentioned below and also the fact we
  39 currently have only sample implementation as a proof of concept and we also
  40 support unofficial OpenStack project Fenix. Environment variable
  41 TEST_CASE='maintenance' needs to be used when executing "doctor_tests/main.py"
  42 and ADMIN_TOOL_TYPE='fenix' if want to test with Fenix instead of sample
  43 implementation. Test case workflow can be seen in :numref:`figure-p3`.
  44
  45 .. figure:: ./images/Maintenance-workflow.png
  46     :name: figure-p3
  47     :width: 100%
  48
  49     Maintenance test case workflow
  50
  51 In test case all compute capacity will be consumed with project (VNF) instances.
  52 For redundant services on instances and an empty compute needed for maintenance,
  53 test case will need at least 3 compute nodes in system. There will be 2
  54 instances on each compute, so minimum number of VCPUs is also 2. Depending on
  55 how many compute nodes there is application will always have 2 redundant
  56 instances (ACT-STDBY) on different compute nodes and rest of the compute
  57 capacity will be filled with non-redundant instances.
  58
  59 For each project specific maintenance message there is a time window for
  60 `app manager` to make any needed action. This will guarantee zero
  61 down time for his service. All replies back are done by calling `admin tool` API
  62 given in the message.
  63
  64 The following steps are executed:
  65
  66 Infrastructure admin will call `admin tool` API to trigger maintenance for
  67 compute hosts having instances belonging to a VNF.
  68
  69 Project specific `MAINTENANCE` notification is triggered to tell `app manager`
  70 that his instances are going to hit by infrastructure maintenance at a specific
  71 point in time. `app manager` will call `admin tool` API to answer back
  72 `ACK_MAINTENANCE`.
  73
  74 When the time comes to start the actual maintenance workflow in `admin tool`,
  75 a `DOWN_SCALE` notification is triggered as there is no empty compute node for
  76 maintenance (or compute upgrade). Project receives corresponding alarm and scales
  77 down instances and call `admin tool` API to answer back `ACK_DOWN_SCALE`.
  78
  79 As it might happen instances are not scaled down (removed) from a single
  80 compute node, `admin tool` might need to figure out what compute node should be
  81 made empty first and send `PREPARE_MAINTENANCE` to project telling which instance
  82 needs to be migrated to have the needed empty compute. `app manager` makes sure
  83 he is ready to migrate instance and call `admin tool` API to answer back
  84 `ACK_PREPARE_MAINTENANCE`. `admin tool` will make the migration and answer
  85 `ADMIN_ACTION_DONE`, so `app manager` knows instance can be again used.
  86
  87 :numref:`figure-p3` has next a light blue section of actions to be done for each
  88 compute. However as we now have one empty compute, we will maintain/upgrade that
  89 first. So on first round, we can straight put compute in maintenance and send
  90 admin level host specific `IN_MAINTENANCE` message. This is caught by `Inspector`
  91 to know host is down for maintenance. `Inspector` can now disable any automatic
  92 fault management actions for the host as it can be down for a purpose. After
  93 `admin tool` has completed maintenance/upgrade `MAINTENANCE_COMPLETE` message
  94 is sent to tell host is back in production.
  95
  96 Next rounds we always have instances on compute, so we need to have
  97 `PLANNED_MAINTANANCE` message to tell that those instances are now going to hit
  98 by maintenance. When `app manager` now receives this message, he knows instances
  99 to be moved away from compute will now move to already maintained/upgraded host.
 100 In test case no upgrade is done on application side to upgrade instances
 101 according to new infrastructure capabilities, but this could be done here as
 102 this information is also passed in the message. This might be just upgrading
 103 some RPMs, but also totally re-instantiating instance with a new flavor. Now if
 104 application runs an active side of a redundant instance on this compute,
 105 a switch over will be done. After `app manager` is ready he will call
 106 `admin tool` API to answer back `ACK_PLANNED_MAINTENANCE`. In test case the
 107 answer is `migrate`, so `admin tool` will migrate instances and reply
 108 `ADMIN_ACTION_DONE` and then `app manager` knows instances can be again used.
 109 Then we are ready to make the actual maintenance as previously trough
 110 `IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` steps.
 111
 112 After all computes are maintained, `admin tool` can send `MAINTENANCE_COMPLETE`
 113 to tell maintenance/upgrade is now complete. For `app manager` this means he
 114 can scale back to full capacity.
 115
 116 There is currently sample implementation on VNFM and test case. In
 117 infrastructure side there is sample implementation of 'admin_tool' and
 118 there is also support for the OpenStack Fenix that extends the use case to
 119 support 'ETSI FEAT03' for VNFM interaction and to optimize the whole
 120 infrastructure mainteannce and upgrade.