1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
2 .. http://creativecommons.org/licenses/by/4.0
4 ====================================
5 Planned Maintenance Design Guideline
6 ====================================
9 This is spec draft of design guideline for planned maintenance.
10 JIRA ticket to track the update and collect comments: `DOCTOR-52`_.
12 This document describes how one can implement planned maintenance by utilizing
13 the `OPNFV Doctor project`_. framework and to meet the set requirements.
18 Telco application need to know when planned maintenance is going to happen in
19 order to guarantee zero down time in its operation. It needs to be possible to
20 make own actions to have application running on not affected resource or give
21 guidance to admin actions like migration. More details are defined in
22 requirement documentation: `use cases`_, `architecture`_ and `implementation`_.
23 Also discussion in the OPNFV summit about `planned maintenance session`_.
28 Cloud admin needs to make a notification about planned maintenance including
29 all details that application needs in order to make decisions upon his affected
30 service. This notification payload can be consumed by application by subscribing
31 to corresponding event alarm trough alarming service like OpenStack AODH.
33 Before maintenance starts application needs to be able to make switch over for
34 his ACT-STBY service affected, do operation to move service to not effected part
35 of infra or give a hint for admin operation like migration that can be
36 automatically issued by admin tool according to agreed policy.
40 admin alarming project controller inspector
41 | service app manager | |
43 +------------------------->+ |
44 +<-------------------------+ |
50 +<----------------+ | |
52 +------------------------->+ |
53 +<-------------------------+ 7. |
54 +------------------------------------->+
58 +--------------------------------------+
60 +--------------------------------------+
62 +------------------------->+ |
63 +<-------------------------+ |
65 +------>+-------->+ | 13. |
66 +------------------------------------->+
67 +-------+---------+--------+-----------+
71 - `full maintenance`: This means maintenance will take a longer time and
72 resource should be emptied, meaning container or VM need to be moved or
73 deleted. Admin might need to test resource to work after maintenance.
75 - `reboot`: Only a reboot is needed and admin does not need separate testing
76 after that. Container or VM can be left in place if so wanted.
78 - `notification`: Notification to rabbitmq.
80 Admin makes a planned maintenance session where he sets
81 a `maintenance_session_id` that is a unique ID for all the hardware resources he
82 is going to have the maintenance at the same time. Mostly maintenance should be
83 done node by node, meaning a single compute node at a time would be in single
84 planned maintenance session having unique `maintenance_session_id`. This ID will
85 be carried trough the whole session in all places and can be used to query
86 maintenance in admin tool API. Project running a Telco application should set
87 a specific role for admin tool to know it cannot do planned maintenance unless
88 project has agreed actions to be done for its VMs or containers. This means the
89 project has configured itself to get alarms upon planned maintenance and it is
90 capable of agreeing needed actions. Admin is supposed to use an admin tool to
91 automate maintenance process partially or entirely.
93 The flow of a successful planned maintenance session as in OpenStack example
96 1. Admin disables nova-compute in order to do planned maintenance on a compute
97 host and gets ACK from the API call. This action needs to be done to ensure
98 no thing will be placed in this compute host by any user. Action is always
99 done regardless the whole compute will be affected or not.
100 2. Admin sends a project specific maintenance notification with state
101 `planned maintenance`. This includes detailed information about maintenance,
102 like when it is going to start, is it `reboot` or `full maintenance`
103 including the information about project containers or VMs running on host or
104 the part of it that will need maintenance. Also default action like
105 migration will be mentioned that will be issued by admin before maintenance
106 starts if no other action is set by project. In case project has a specific
107 role set, planned maintenance cannot start unless project has agreed the
108 admin action. Available admin actions are also listed in notification.
109 3. Application manager of the project receives AODH alarm about the same.
110 4. Application manager can do switch over to his ACT-STBY service, delete and
111 re-instantiate his service on not affected resource if so wanted.
112 5. Application manager may call admin tool API to give preferred instructions
113 for leaving VMs and containers in place or do admin action to migrate them.
114 In case admin does not receive this instruction before maintenance is to
115 start it will do the pre-configured default action like migration to
116 projects without a specific role to say project need to agree the action.
117 VMs or Containers can be left on host if type of maintenance is just `reboot`.
118 6. Admin does possible actions to VMs and containers and receives an ACK.
119 7. In case everything went ok, Admin sends admin type of maintenance
120 notification with state `in maintenance`. This notification can be consumed
121 by Inspector and other cloud services to know there is ongoing maintenance
122 which means things like automatic fault management actions for the hardware
123 resources should be disabled.
124 8. If maintenance type is `reboot` and project is still having containers or
125 VMs running on affected hardware resource, Admin sends project specific
126 maintenance notification with state updated to `in maintenance`. If project
127 do not have anything left running on affected hardware resource, state will
128 be `maintenance over` instead. If maintenance can not be performed for some
129 reason state should be `maintenance cancelled`. In this case last operation
130 remaining for admin is to re-enable nova-compute service, ensure
131 everything is running and not to proceed any further steps.
132 9. Application manager of the project receives AODH alarm about the same.
133 10. Admin will do the maintenance. This is out of Doctor scope.
134 11. Admin enables nova-compute service when maintenance is over and host can be
135 put back to production. An ACK is received from API call.
136 12. In case project had left containers or VMs on hardware resource over
137 maintenance, Admin sends project specific maintenance notification with
138 state updated to `maintenance over`.
139 13. Admin sends admin type of maintenance notification with state updated to
140 `maintenance over`. Inspector and other
141 cloud services can consume this to know hardware resource is back in use.
146 There was a `Maintenance POC`_ for planned maintenance in the OPNFV Beijing
147 summit to show the basic concept of using framework defined by the project.
149 .. _DOCTOR-52: https://jira.opnfv.org/browse/DOCTOR-52
150 .. _OPNFV Doctor project: https://wiki.opnfv.org/doctor
151 .. _use cases: http://artifacts.opnfv.org/doctor/docs/requirements/02-use_cases.html#nvfi-maintenance
152 .. _architecture: http://artifacts.opnfv.org/doctor/docs/requirements/03-architecture.html#nfvi-maintenance
153 .. _implementation: http://artifacts.opnfv.org/doctor/docs/requirements/05-implementation.html#nfvi-maintenance
154 .. _planned maintenance session: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2017-June/016677.html
155 .. _Maintenance POC: https://wiki.opnfv.org/download/attachments/5046291/Doctor%20Maintenance%20PoC%202017.pptx?version=1&modificationDate=1498182869000&api=v2