2 This is a specification draft of a blueprint proposed for OpenStack Nova
3 Liberty. It was written by project member(s) and agreed within the project
4 before submitting it upstream. No further changes to its content will be
5 made here anymore; please follow it upstream:
7 * Current version upstream: https://review.openstack.org/#/c/169836/
8 * Development activity:
9 https://blueprints.launchpad.net/nova/+spec/mark-host-down
11 **Original draft is as follow:**
13 ====================================================
14 Report host fault to update server state immediately
15 ====================================================
17 https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately
19 A new API is needed to report a host fault to change the state of the
20 instances and compute node immediately. This allows usage of evacuate API
21 without a delay. The new API provides the possibility for external monitoring
22 system to detect any kind of host failure fast and reliably and inform
23 OpenStack about it. Nova updates the compute node state and states of the
24 instances. This way the states in the Nova DB will be in sync with the
25 real state of the system.
29 * Nova state change for failed or unreachable host is slow and does not
30 reliably state compute node is down or not. This might cause same instance
31 to run twice if action taken to evacuate instance to another host.
32 * Nova state for instances on failed compute node will not change,
33 but remains active and running. This gives user a false information about
34 instance state. Currently one would need to call "nova reset-state" for each
35 instance to have them in error state.
36 * OpenStack user cannot make HA actions fast and reliably by trusting instance
37 state and compute node state.
38 * As compute node state changes slowly one cannot evacuate instances.
42 Use case in general is that in case there is a host fault one should change
43 compute node state fast and reliably when using DB servicegroup backend.
44 On top of this here is the use cases that are not covered currently to have
45 instance states changed correctly:
46 * Management network connectivity lost between controller and compute node.
49 Generic use case flow:
51 * The external monitoring system detects a host fault.
52 * The external monitoring system fences the host if not down already.
53 * The external system calls the new Nova API to force the failed compute node
54 into down state as well as instances running on it.
55 * Nova updates the compute node state and state of the effected instances to
58 Currently nova-compute state will be changing "down", but it takes a long
59 time. Server state keeps as "vm_state: active" and "power_state:
60 running", which is not correct. By having external tool to detect host faults
61 fast, fence host by powering down and then report host down to OpenStack, all
62 these states would reflect to actual situation. Also if OpenStack will not
63 implement automatic actions for fault correlation, external tool can do that.
64 This could be configured for example in server instance METADATA easily and be
65 read by external tool.
69 Liberty priorities have not yet been defined.
73 There needs to be a new API for Admin to state host is down. This API is used
74 to mark compute node and instances running on it down to reflect the real
77 Example on compute node is:
79 * When compute node is up and running:
80 vm_state: active and power_state: running
81 nova-compute state: up status: enabled
82 * When compute node goes down and new API is called to state host is down:
83 vm_state: stopped power_state: shutdown
84 nova-compute state: down status: enabled
86 vm_state values: soft-delete, deleted, resized and error
87 should not be touched.
88 task_state effect needs to be worked out if needs to be touched.
92 There is no attractive alternatives to detect all different host faults than
93 to have a external tool to detect different host faults. For this kind of tool
94 to exist there needs to be new API in Nova to report fault. Currently there
95 must have been some kind of workarounds implemented as cannot trust or get the
96 states from OpenStack fast enough.
104 * Update CLI to report host is down
106 nova host-update command
108 usage: nova host-update [--status <enable|disable>]
109 [--maintenance <enable|disable>]
113 Update host settings.
122 --status <enable|disable>
123 Either enable or disable a host.
125 --maintenance <enable|disable>
126 Either put or resume host to/from maintenance.
129 Report host down to update instance and compute node state in db.
131 * Update Compute API to report host is down:
133 /v2.1/{tenant_id}/os-hosts/{host_name}
135 Normal response codes: 200
138 Parameter Style Type Description
139 host_name URI xsd:string The name of the host of interest to you.
144 "maintenance_mode": "enable"
145 "host_down_reported": "true"
153 "host": "65c5d5b7e3bd44308e67fc50f362aee6",
154 "maintenance_mode": "enabled",
156 "host_down_reported": "true"
162 * New method to nova.compute.api module HostAPI class to have a
163 to mark host related instances and compute node down:
164 set_host_down(context, host_name)
166 * class novaclient.v2.hosts.HostManager(api) method update(host, values)
167 Needs to handle reporting host down.
169 * Schema does not need changes as in db only service and server states are to
174 API call needs admin privileges (in the default policy configuration).
180 Other end user impact
181 ---------------------
186 Only impact is that user can get information faster about instance and
187 compute node state. This also gives possibility to evacuate faster.
188 No impact that would slow down. Host down should be rare occurrence.
190 Other deployer impact
191 ---------------------
192 Developer can make use of any external tool to detect host fault and report it
203 Primary assignee: Tomi Juvonen
204 Other contributors: Ryota Mibu
218 Test cases that exists for enabling or putting host to maintenance should be
219 altered or similar new cases made test new functionality.
224 New API needs to be documented:
226 * Compute API extensions documentation.
227 http://developer.openstack.org/api-ref-compute-v2.1.html
228 * Nova commands documentation.
229 http://docs.openstack.org/user-guide-admin/content/novaclient_commands.html
230 * Compute command-line client documentation.
231 http://docs.openstack.org/cli-reference/content/novaclient_commands.html
232 * nova.compute.api documentation.
233 http://docs.openstack.org/developer/nova/api/nova.compute.api.html
234 * High Availability guide might have page to tell external tool could provide
235 ability to provide faster HA as able to update states by new API.
236 http://docs.openstack.org/high-availability-guide/content/index.html
240 * OPNFV Doctor project: https://wiki.opnfv.org/doctor
241 * OpenStack Instance HA Proposal:
242 http://blog.russellbryant.net/2014/10/15/openstack-instance-ha-proposal/
243 * The Different Facets of OpenStack HA:
244 http://blog.russellbryant.net/2015/03/10/
245 the-different-facets-of-openstack-ha/