docs/development/design/maintenance-design-guideline.rst

   1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
   2 .. http://creativecommons.org/licenses/by/4.0
   3
   4 ====================================
   5 Planned Maintenance Design Guideline
   6 ====================================
   7
   8 .. NOTE::
   9    This is spec draft of design guideline for planned maintenance.
  10    JIRA ticket to track the update and collect comments: `DOCTOR-52`_.
  11
  12 This document describes how one can implement planned maintenance by utilizing
  13 the `OPNFV Doctor project`_. framework and to meet the set requirements.
  14
  15 Problem Description
  16 ===================
  17
  18 Telco application need to know when planned maintenance is going to happen in
  19 order to guarantee zero down time in its operation. It needs to be possible to
  20 make own actions to have application running on not affected resource or give
  21 guidance to admin actions like migration. More details are defined in
  22 requirement documentation: `use cases`_, `architecture`_ and `implementation`_.
  23 Also discussion in the OPNFV summit about `planned maintenance session`_.
  24
  25 Guidelines
  26 ==========
  27
  28 Cloud admin needs to make a notification about planned maintenance including
  29 all details that application needs in order to make decisions upon his affected
  30 service. This notification payload can be consumed by application by subscribing
  31 to corresponding event alarm trough alarming service like OpenStack AODH.
  32
  33 Before maintenance starts application needs to be able to make switch over for
  34 his ACT-STBY service affected, do operation to move service to not effected part
  35 of infra or give a hint for admin operation like migration that can be
  36 automatically issued by admin tool according to agreed policy.
  37
  38 Flow diagram::
  39
  40   admin alarming project  controller  inspector
  41     |   service  app manager   |           |
  42     |  1.   |         |        |           |
  43     +------------------------->+           |
  44     +<-------------------------+           |
  45     |  2.   |         |        |           |
  46     +------>+    3.   |        |           |
  47     |       +-------->+   4.   |           |
  48     |       |         +------->+           |
  49     |       |    5.   +<-------+           |
  50     +<----------------+        |           |
  51     |                 |   6.   |           |
  52     +------------------------->+           |
  53     +<-------------------------+     7.    |
  54     +------------------------------------->+
  55     |   8.  |         |        |           |
  56     +------>+    9.   |        |           |
  57     |       +-------->+        |           |
  58     +--------------------------------------+
  59     |                10.                   |
  60     +--------------------------------------+
  61     |  11.  |         |        |           |
  62     +------------------------->+           |
  63     +<-------------------------+           |
  64     |  12.  |         |        |           |
  65     +------>+-------->+        |    13.    |
  66     +------------------------------------->+
  67     +-------+---------+--------+-----------+
  68
  69 Concepts used below:
  70
  71 - `full maintenance`: This means maintenance will take a longer time and
  72   resource should be emptied, meaning container or VM need to be moved or
  73   deleted. Admin might need to test resource to work after maintenance.
  74
  75 - `reboot`: Only a reboot is needed and admin does not need separate testing
  76   after that. Container or VM can be left in place if so wanted.
  77
  78 - `notification`: Notification to rabbitmq.
  79
  80 Admin makes a planned maintenance session where he sets
  81 a `maintenance_session_id` that is a unique ID for all the hardware resources he
  82 is going to have the maintenance at the same time. Mostly maintenance should be
  83 done node by node, meaning a single compute node at a time would be in single
  84 planned maintenance session having unique `maintenance_session_id`. This ID will
  85 be carried trough the whole session in all places and can be used to query
  86 maintenance in admin tool API. Project running a Telco application should set
  87 a specific role for admin tool to know it cannot do planned maintenance unless
  88 project has agreed actions to be done for its VMs or containers. This means the
  89 project has configured itself to get alarms upon planned maintenance and it is
  90 capable of agreeing needed actions. Admin is supposed to use an admin tool to
  91 automate maintenance process partially or entirely.
  92
  93 The flow of a successful planned maintenance session as in OpenStack example
  94 case:
  95
  96 1.  Admin disables nova-compute in order to do planned maintenance on a compute
  97     host and gets ACK from the API call. This action needs to be done to ensure
  98     no thing will be placed in this compute host by any user. Action is always
  99     done regardless the whole compute will be affected or not.
 100 2.  Admin sends a project specific maintenance notification with state
 101     `planned maintenance`. This includes detailed information about maintenance,
 102     like when it is going to start, is it `reboot` or `full maintenance`
 103     including the information about project containers or VMs running on host or
 104     the part of it that will need maintenance. Also default action like
 105     migration will be mentioned that will be issued by admin before maintenance
 106     starts if no other action is set by project. In case project has a specific
 107     role set, planned maintenance cannot start unless project has agreed the
 108     admin action. Available admin actions are also listed in notification.
 109 3.  Application manager of the project receives AODH alarm about the same.
 110 4.  Application manager can do switch over to his ACT-STBY service, delete and
 111     re-instantiate his service on not affected resource if so wanted.
 112 5.  Application manager may call admin tool API to give preferred instructions
 113     for leaving VMs and containers in place or do admin action to migrate them.
 114     In case admin does not receive this instruction before maintenance is to
 115     start it will do the pre-configured default action like migration to
 116     projects without a specific role to say project need to agree the action.
 117     VMs or Containers can be left on host if type of maintenance is just `reboot`.
 118 6.  Admin does possible actions to VMs and containers and receives an ACK.
 119 7.  In case everything went ok, Admin sends admin type of maintenance
 120     notification with state `in maintenance`. This notification can be consumed
 121     by Inspector and other cloud services to know there is ongoing maintenance
 122     which means things like automatic fault management actions for the hardware
 123     resources should be disabled.
 124 8.  If maintenance type is `reboot` and project is still having containers or
 125     VMs running on affected hardware resource, Admin sends project specific
 126     maintenance notification with state updated to `in maintenance`. If project
 127     do not have anything left running on affected hardware resource, state will
 128     be `maintenance over` instead. If maintenance can not be performed for some
 129     reason state should be `maintenance cancelled`. In this case last operation
 130     remaining for admin is to re-enable nova-compute service, ensure
 131     everything is running and not to proceed any further steps.
 132 9.  Application manager of the project receives AODH alarm about the same.
 133 10. Admin will do the maintenance. This is out of Doctor scope.
 134 11. Admin enables nova-compute service when maintenance is over and host can be
 135     put back to production. An ACK is received from API call.
 136 12. In case project had left containers or VMs on hardware resource over
 137     maintenance, Admin sends project specific maintenance notification with
 138     state updated to `maintenance over`.
 139 13. Admin sends admin type of maintenance notification with state updated to
 140     `maintenance over`. Inspector and other
 141     cloud services can consume this to know hardware resource is back in use.
 142
 143 POC
 144 ---
 145
 146 There was a `Maintenance POC`_ for planned maintenance in the OPNFV Beijing
 147 summit to show the basic concept of using framework defined by the project.
 148
 149 .. _DOCTOR-52: https://jira.opnfv.org/browse/DOCTOR-52
 150 .. _OPNFV Doctor project: https://wiki.opnfv.org/doctor
 151 .. _use cases: http://artifacts.opnfv.org/doctor/docs/requirements/02-use_cases.html#nvfi-maintenance
 152 .. _architecture: http://artifacts.opnfv.org/doctor/docs/requirements/03-architecture.html#nfvi-maintenance
 153 .. _implementation:  http://artifacts.opnfv.org/doctor/docs/requirements/05-implementation.html#nfvi-maintenance
 154 .. _planned maintenance session: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2017-June/016677.html
 155 .. _Maintenance POC: https://wiki.opnfv.org/download/attachments/5046291/Doctor%20Maintenance%20PoC%202017.pptx?version=1&modificationDate=1498182869000&api=v2