docs/development/design/maintenance-design-guideline.rst

   1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
   2 .. http://creativecommons.org/licenses/by/4.0
   3
   4 ====================================
   5 Planned Maintenance Design Guideline
   6 ====================================
   7
   8 This document describes how one can implement infrastructure maintenance in
   9 interaction with VNFM by utilizing the `OPNFV Doctor project`_ framework and to
  10 meet the set requirements. Document concentrates to OpenStack and VMs while
  11 the concept designed is generic for any payload or even different VIM. Admin
  12 tool should be also for controller and other cloud hardware, but that is not the
  13 main focus in OPNFV Doctor and should be defined better in the upstream
  14 implementation. Same goes for any more detailed work to be done.
  15
  16 Problem Description
  17 ===================
  18
  19 Telco application need to know when infrastructure maintenance is going to happen
  20 in order to guarantee zero down time in its operation. It needs to be possible
  21 to make own actions to have application running on not affected resource or give
  22 guidance to admin actions like migration. More details are defined in
  23 requirement documentation: `use cases`_, `architecture`_ and `implementation`_.
  24
  25 Guidelines
  26 ==========
  27
  28 Concepts used:
  29
  30 - `event`: Notification to rabbitmq with particular event type.
  31
  32 - `state event`: Notification to rabbitmq with particular event type including
  33   payload with variable defined for state.
  34
  35 - `project event`: Notification to rabbitmq that is meant for project. Single
  36   event type is used with different payload and state information.
  37
  38 - `admin event`: Notification to rabbitmq that is meant for admin or as for any
  39   infrastructure service. Single event type is used with different state
  40   information.
  41
  42 - `rolling maintenance`: Node by Node rolling maintenance and upgrade where
  43   a single node at a time will be maintained after a possible application
  44   payload is moved away from the node.
  45
  46 - `project` stands for `application` in OpenStack contents and both are used in
  47   this document. `tenant` is many times used for the same.
  48
  49 Infrastructure admin needs to make notification with two different event types.
  50 One is meant for admin and one for project. Notification payload can be consumed
  51 by application and admin by subscribing to corresponding event alarm trough
  52 alarming service like OpenStack AODH.
  53
  54 - Infrastructure admin needs to make a notification about infrastructure
  55   maintenance including all details that application needs in order to make
  56   a decisions upon his affected service. Alarm Payload can hold a link to
  57   infrastructure admin tool API for reply and for other possible information.
  58   There is many steps of communication between admin tool and application, thus
  59   the payload needed for the information passed is very similar. Because of
  60   this, the same event type can be used, but there can be a variable like
  61   `state` to tell application what is needed as action for each event.
  62   If a project have not subscribed to alarm, admin tool responsible for the
  63   maintenance will assume it can do maintenance operations without interaction
  64   with application on top of it.
  65
  66 - Infrastructure admin needs to make an event about infrastructure maintenance
  67   telling when the maintenance starts and another when it ends. This admin level
  68   event should include the host name. This could be consumed by any admin level
  69   infrastructure entity. In this document we consume this in `Inspector` that
  70   is in `OPNFV Doctor project`_ terms infrastructure entity responsible for
  71   automatic host fault management. Automated actions surely needs to be disabled
  72   during planned maintenance.
  73
  74 Before maintenance starts application needs to be able to make switch over for
  75 his ACT-STBY service affected, do operation to move service to not effected part
  76 of infrastructure or give a hint for admin operation like migration that can be
  77 automatically issued by admin tool according to agreed policy.
  78
  79 There should be at least one empty host compatible to host under maintenance in
  80 order to have a smooth `rolling maintenance` done. For this to be possible also
  81 down scaling the application instances should be possible.
  82
  83 Infrastructure admin should have a tool that is responsible for hosting a
  84 maintenance work flow session with needed APIs for admin and for applications.
  85 The Group of hosts in single maintenance session should always have the same
  86 physical capabilities, so the rolling maintenance can be guaranteed.
  87
  88 Flow diagram is meant to be as high level as possible. It currently does not try
  89 to be perfect, but to show the most important interfaces needed between VNFM and
  90 infrastructure admin. This can be seen e.g. as missing error handling that can
  91 be defined later on.
  92
  93 Flow diagram:
  94
  95 .. figure:: images/maintenance-workflow.png
  96    :alt: Work flow in OpenStack
  97
  98 Flow diagram step by step:
  99
 100 - Infrastructure admin makes a maintenance session to maintain and upgrade
 101   certain group of hardware. At least compute hardware in single session should
 102   be having same capabilities like the amount number of VCPUs to ensure
 103   the maintenance can be done node by node in rolling fashion. Maintenance
 104   session need to have a `session_id` that is a unique ID to be carried
 105   throughout all events and can be used in APIs needed when interacting with
 106   the session. Maintenance session needs to have knowledge about when
 107   maintenance will start and what capabilities the possible upgrade to
 108   infrastructure will bring to application payload on top of it. It will be
 109   matter of the implementation to define in more detail whether some more data is
 110   needed when creating a session or if it is defined in the admin tool
 111   configuration.
 112
 113   There can be several parallel maintenance sessions and a single session can
 114   include multiple projects payload. Typically maintenance session should include
 115   similar type of compute hardware, so you can guarantee moving of instances on
 116   top of them can work between the compute hosts.
 117
 118 - State `MAINTENANCE` `project event` and reply `ACK_MAINTENANCE`. Immediately
 119   after a maintenance session is created, infrastructure admin tool will send
 120   a project specific 'notification' which application manager can consume by
 121   subscribing to AODH alarm for this event. As explained already earlier all
 122   `project event`s will only be sent in case the project subscribes to alarm and
 123   otherwise the interaction with application will simply not be done and
 124   operations could be forced.
 125
 126   The state `MAINTENANCE` event should at least include:
 127
 128     - `session_id` to reference correct maintenance session.
 129     - `state` as `MAINTENANCE` to identify event action needed.
 130     - `instance_ids` to tell project which of his instances will be affected by
 131       the maintenance. This might be a link to admin tool project specific API
 132       as AODH variables are limited to string of 255 character.
 133     - `reply_url` for application to call admin tool project specific API to
 134       answer `ACK_MAINTENANCE` including the `session_id`.
 135     - `project_id` to identify project.
 136     - `actions_at` time stamp to indicate when maintenance work flow will start.
 137       `ACK_MAINTENANCE` reply is needed before that time.
 138     - `metadata` to include key values pairs of a capabilities coming over the
 139       maintenance operation like 'openstack_version': 'Queens'
 140
 141 - Optional state `DOWN_SCALE` `project event` and reply `ACK_DOWN_SCALE`. When it
 142   is time to start the maintenance work flow as the time reaches the `actions_at`
 143   defined in previous `state event`, admin tool needs to check if there is already
 144   an empty compute host needed by the `rolling maintenance`. In case there is no
 145   empty host, admin tool can ask application to down scale by sending project
 146   specific `DOWN_SCALE` `state event`.
 147
 148   The state `DOWN_SCALE` event should at least include:
 149
 150     - `session_id` to reference correct maintenance session.
 151     - `state` as `DOWN_SCALE` to identify event action needed.
 152     - `reply_url` for application to call admin tool project specific API to
 153       answer `ACK_DOWN_SCALE` including the `session_id`.
 154     - `project_id` to identify project.
 155     - `actions_at` time stamp to indicate when is the last moment to send
 156       `ACK_DOWN_SCALE`. This means application can have time to finish some
 157       ongoing transactions before down scaling his instances. This guarantees
 158       a zero downtime for his service.
 159
 160 - Optional state `PREPARE_MAINTENANCE` `project event` and reply
 161   `ACK_PREPARE_MAINTENANCE`. In case still after down scaling the applications
 162   there is still no empty compute host, admin tools needs to analyze the
 163   situation on compute host under maintenance. It needs to choose compute node
 164   that is now almost empty or has otherwise least critical instances running if
 165   possible, like looking if there is floating IPs. When compute host is chosen,
 166   a `PREPARE_MAINTENANCE` `state event` can be sent to projects having instances
 167   running on this host to migrate them to other compute hosts. It might also be
 168   possible to have another round of `DOWN_SCALE` `state event` if necessary, but
 169   this is not proposed here.
 170
 171   The state `PREPARE_MAINTENANCE` event should at least include:
 172
 173     - `session_id` to reference correct maintenance session.
 174     - `state` as `PREPARE_MAINTENANCE` to identify event action needed.
 175     - `instance_ids` to tell project which of his instances will be affected by
 176       the `state event`. This might be a link to admin tool project specific API
 177       as AODH variables are limited to string of 255 character.
 178     - `reply_url` for application to call admin tool project specific API to
 179       answer `ACK_PREPARE_MAINTENANCE` including the `session_id` and
 180       `instance_ids` with list of key value pairs with key as `instance_id` and
 181       chosen action from allowed actions given via `allowed_actions` as value.
 182     - `project_id` to identify project.
 183     - `actions_at` time stamp to indicate when is the last moment to send
 184       `ACK_PREPARE_MAINTENANCE`. This means application can have time to finish
 185       some ongoing transactions within his instances and make possible
 186       switch over. This guarantees a zero downtime for his service.
 187     - `allowed_actions` to tell what admin tool supports as action to move
 188       instances to another compute host. Typically a list like: `['MIGRATE', 'LIVE_MIGRATE']`
 189
 190 - Optional state `INSTANCE_ACTION_DONE` `project event`. In case admin tool needed
 191   to make action to move instance like migrating it to another compute host, this
 192   `state event` will be sent to tell the operation is complete.
 193
 194   The state `INSTANCE_ACTION_DONE` event should at least include:
 195
 196     - `session_id` to reference correct maintenance session.
 197     - `instance_ids` to tell project which of his instance had the admin action
 198       done.
 199     - `project_id` to identify project.
 200
 201 - At this state it is guaranteed there is an empty compute host. It would be
 202   maintained first trough `IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` steps, but
 203   following the flow chart `PLANNED_MAINTENANCE` will be explained next.
 204
 205 - Optional state `PLANNED_MAINTENANCE` `project event` and reply
 206   `ACK_PLANNED_MAINTENANCE`. In case compute host to be maintained has
 207   instances, projects owning those should have this `state event`. When project
 208   receives this `state event` it knows instances moved to other compute host as
 209   resulting actions will now go to host that is already maintained. This means
 210   it might have new capabilities that project can take into use. This gives the
 211   project the possibility to upgrade his instances also to support new
 212   capabilities over the action chosen to move instances.
 213
 214   The state `PLANNED_MAINTENANCE` event should at least include:
 215
 216     - `session_id` to reference correct maintenance session.
 217     - `state` as `PLANNED_MAINTENANCE` to identify event action needed.
 218     - `instance_ids` to tell project which of his instances will be affected by
 219       the event. This might be a link to admin tool project specific API as AODH
 220       variables are limited to string of 255 character.
 221     - `reply_url` for application to call admin tool project specific API to
 222       answer `ACK_PLANNED_MAINTENANCE` including the `session_id` and
 223       `instance_ids` with list of key value pairs with key as `instance_id` and
 224       chosen action from allowed actions given via `allowed_actions` as value.
 225     - `project_id` to identify project.
 226     - `actions_at` time stamp to indicate when is the last moment to send
 227       `ACK_PLANNED_MAINTENANCE`. This means application can have time to finish
 228       some ongoing transactions within his instances and make possible switch
 229       over. This guarantees a zero downtime for his service.
 230     - `allowed_actions` to tell what admin tool supports as action to move
 231       instances to another compute host. Typically a list like: `['MIGRATE', 'LIVE_MIGRATE', 'OWN_ACTION']`
 232       `OWN_ACTION` means that application may want to re-instantiate his
 233       instance perhaps to take into use the new capability coming over the
 234       infrastructure maintenance. Re-instantiated instance will go to already
 235       maintained host having the new capability.
 236     - `metadata` to include key values pairs of a capabilities coming over the
 237       maintenance operation like 'openstack_version': 'Queens'
 238
 239 - `State IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` `admin event`s. Just before
 240   host goes to maintenance the IN_MAINTENANCE` `state event` will be send to
 241   indicate host is entering to maintenance. Host is then taken out of production
 242   and can be powered off, replaced, or rebooted during the operation.
 243   During the maintenance and upgrade host might be moved to admin's own host
 244   aggregate, so it can be tested to work before putting back to production.
 245   After maintenance is complete `MAINTENANCE_COMPLETE` `state event` will be sent
 246   to know host is back in use. Adding or removing of a host is yet not
 247   included in this concept, but can be addressed later.
 248
 249   The state `IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` event should at least
 250   include:
 251
 252     - `session_id` to reference correct maintenance session.
 253     - `state` as `IN_MAINTENANCE` or `MAINTENANCE_COMPLETE` to indicate host
 254       state.
 255     - `project_id` to identify admin project needed by AODH alarm.
 256     - `host` to indicate the host name.
 257
 258 - State `MAINTENANCE_COMPLETE` `project event` and reply
 259   `MAINTENANCE_COMPLETE_ACK`. After all compute nodes in the maintenance session
 260   have gone trough maintenance operation this `state event` can be send to all
 261   projects that had instances running on any of those nodes. If there was a down
 262   scale done, now the application could up scale back to full operation.
 263
 264     - `session_id` to reference correct maintenance session.
 265     - `state` as `MAINTENANCE_COMPLETE` to identify event action needed.
 266     - `instance_ids` to tell project which of his instances are currently
 267       running on hosts maintained in this maintenance session. This might be a
 268       link to admin tool project specific API as AODH variables are limited to
 269       string of 255 character.
 270     - `reply_url` for application to call admin tool project specific API to
 271       answer `ACK_MAINTENANCE` including the `session_id`.
 272     - `project_id` to identify project.
 273     - `actions_at` time stamp to indicate when maintenance work flow will start.
 274     - `metadata` to include key values pairs of a capabilities coming over the
 275       maintenance operation like 'openstack_version': 'Queens'
 276
 277 - At the end admin tool maintenance session can enter to `MAINTENANCE_COMPLETE`
 278   state and session can be removed.
 279
 280 Benefits
 281 ========
 282
 283 - Application is guaranteed zero downtime as it is aware of the maintenance
 284   action affecting its payload. The application is made aware of the maintenance
 285   time window to make sure it can prepare for it.
 286 - Application gets to know new capabilities over infrastructure maintenance and
 287   upgrade and can utilize those (like do its own upgrade)
 288 - Any application supporting the interaction being defined could be running on
 289   top of the same infrastructure provider. No vendor lock-in for application.
 290 - Any infrastructure component can be aware of host(s) under maintenance via
 291   `admin event`s about host state. No vendor lock-in for infrastructure
 292   components.
 293 - Generic messaging making it possible to use same concept in different type of
 294   clouds and application payloads. `instance_ids` will uniquely identify any
 295   type of instance and similar notification payload can be used regardless we
 296   are in OpenStack. Work flow just need to support different cloud
 297   infrastructure management to support different cloud.
 298 - No additional hardware is needed during maintenance operations as down- and
 299   up-scaling can be supported for the applications. Optional, if no extensive
 300   spare capacity is available for the maintenance - as typically the case in
 301   Telco environments.
 302 - Parallel maintenance sessions for different group of hardware. Same session
 303   should include hardware with same capabilities to guarantee `rolling
 304   maintenance` actions.
 305 - Multi-tenancy support. Project specific messaging about maintenance.
 306
 307 Future considerations
 308 =====================
 309
 310 - Pluggable architecture for infrastructure admin tool to handle different
 311   clouds and payloads.
 312 - Pluggable architecture to handle specific maintenance/upgrade cases like
 313   OpenStack upgrade between specific versions or admin testing before giving
 314   host back to production.
 315 - Support for user specific details need to be taken into account in admin side
 316   actions (e.g. run a script, ...).
 317 - (Re-)Use existing implementations like Mistral for work flows.
 318 - Scaling hardware resources. Allow critical application to be scaled at the
 319   same time in controlled fashion or retire application.
 320
 321 POC
 322 ---
 323
 324 There was a `Maintenance POC`_ demo 'How to gain VNF zero down-time during
 325 Infrastructure Maintenance and Upgrade' in the OCP and ONS summit March 2018.
 326 Similar concept is also being made as `OPNFV Doctor project`_ new test case
 327 scenario.
 328
 329 .. _OPNFV Doctor project: https://wiki.opnfv.org/doctor
 330 .. _use cases: http://artifacts.opnfv.org/doctor/docs/requirements/02-use_cases.html#nvfi-maintenance
 331 .. _architecture: http://artifacts.opnfv.org/doctor/docs/requirements/03-architecture.html#nfvi-maintenance
 332 .. _implementation:  http://artifacts.opnfv.org/doctor/docs/requirements/05-implementation.html#nfvi-maintenance
 333 .. _Maintenance POC: https://youtu.be/7q496Tutzlo