docs/design/report-host-fault-to-update-server-state-immediately.rst

   1 .. NOTE::
   2    This is a specification draft of a blueprint proposed for OpenStack Nova
   3    Liberty. It was written by project member(s) and agreed within the project
   4    before submitting it upstream. No further changes to its content will be
   5    made here anymore; please follow it upstream:
   6
   7    * Current version upstream: https://review.openstack.org/#/c/169836/
   8    * Development activity:
   9      https://blueprints.launchpad.net/nova/+spec/mark-host-down
  10
  11    **Original draft is as follow:**
  12
  13 ====================================================
  14 Report host fault to update server state immediately
  15 ====================================================
  16
  17 https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately
  18
  19 A new API is needed to report a host fault to change the state of the
  20 instances and compute node immediately. This allows usage of evacuate API
  21 without a delay. The new API provides the possibility for external monitoring
  22 system to detect any kind of host failure fast and reliably and inform
  23 OpenStack about it. Nova updates the compute node state and states of the
  24 instances. This way the states in the Nova DB will be in sync with the
  25 real state of the system.
  26
  27 Problem description
  28 ===================
  29 * Nova state change for failed or unreachable host is slow and does not
  30   reliably state compute node is down or not. This might cause same instance
  31   to run twice if action taken to evacuate instance to another host.
  32 * Nova state for instances on failed compute node will not change,
  33   but remains active and running. This gives user a false information about
  34   instance state. Currently one would need to call "nova reset-state" for each
  35   instance to have them in error state.
  36 * OpenStack user cannot make HA actions fast and reliably by trusting instance
  37   state and compute node state.
  38 * As compute node state changes slowly one cannot evacuate instances.
  39
  40 Use Cases
  41 ---------
  42 Use case in general is that in case there is a host fault one should change
  43 compute node state fast and reliably when using DB servicegroup backend.
  44 On top of this here is the use cases that are not covered currently to have
  45 instance states changed correctly:
  46 * Management network connectivity lost between controller and compute node.
  47 * Host HW failed.
  48
  49 Generic use case flow:
  50
  51 * The external monitoring system detects a host fault.
  52 * The external monitoring system fences the host if not down already.
  53 * The external system calls the new Nova API to force the failed compute node
  54   into down state as well as instances running on it.
  55 * Nova updates the compute node state and state of the effected instances to
  56   Nova DB.
  57
  58 Currently nova-compute state will be changing "down", but it takes a long
  59 time. Server state keeps as "vm_state: active" and "power_state:
  60 running", which is not correct. By having external tool to detect host faults
  61 fast, fence host by powering down and then report host down to OpenStack, all
  62 these states would reflect to actual situation. Also if OpenStack will not
  63 implement automatic actions for fault correlation, external tool can do that.
  64 This could be configured for example in server instance METADATA easily and be
  65 read by external tool.
  66
  67 Project Priority
  68 -----------------
  69 Liberty priorities have not yet been defined.
  70
  71 Proposed change
  72 ===============
  73 There needs to be a new API for Admin to state host is down. This API is used
  74 to mark compute node and instances running on it down to reflect the real
  75 situation.
  76
  77 Example on compute node is:
  78
  79 * When compute node is up and running:
  80   vm_state: active and power_state: running
  81   nova-compute state: up status: enabled
  82 * When compute node goes down and new API is called to state host is down:
  83   vm_state: stopped power_state: shutdown
  84   nova-compute state: down status: enabled
  85
  86 vm_state values: soft-delete, deleted, resized and error
  87 should not be touched.
  88 task_state effect needs to be worked out if needs to be touched.
  89
  90 Alternatives
  91 ------------
  92 There is no attractive alternatives to detect all different host faults than
  93 to have a external tool to detect different host faults. For this kind of tool
  94 to exist there needs to be new API in Nova to report fault. Currently there
  95 must have been some kind of workarounds implemented as cannot trust or get the
  96 states from OpenStack fast enough.
  97
  98 Data model impact
  99 -----------------
 100 None
 101
 102 REST API impact
 103 ---------------
 104 * Update CLI to report host is down
 105
 106   nova host-update command
 107
 108   usage: nova host-update [--status <enable|disable>]
 109                         [--maintenance <enable|disable>]
 110                         [--report-host-down]
 111                         <hostname>
 112
 113   Update host settings.
 114
 115   Positional arguments
 116
 117   <hostname>
 118   Name of host.
 119
 120   Optional arguments
 121
 122   --status <enable|disable>
 123   Either enable or disable a host.
 124
 125   --maintenance <enable|disable>
 126   Either put or resume host to/from maintenance.
 127
 128   --down
 129   Report host down to update instance and compute node state in db.
 130
 131 * Update Compute API to report host is down:
 132
 133   /v2.1/{tenant_id}/os-hosts/{host_name}
 134
 135   Normal response codes: 200
 136   Request parameters
 137
 138   Parameter     Style   Type          Description
 139   host_name     URI     xsd:string    The name of the host of interest to you.
 140
 141   {
 142       "host": {
 143           "status": "enable",
 144           "maintenance_mode": "enable"
 145           "host_down_reported": "true"
 146
 147       }
 148
 149   }
 150
 151   {
 152       "host": {
 153           "host": "65c5d5b7e3bd44308e67fc50f362aee6",
 154           "maintenance_mode": "enabled",
 155           "status": "enabled"
 156           "host_down_reported": "true"
 157
 158       }
 159
 160   }
 161
 162 * New method to nova.compute.api module HostAPI class to have a
 163   to mark host related instances and compute node down:
 164   set_host_down(context, host_name)
 165
 166 * class novaclient.v2.hosts.HostManager(api) method update(host, values)
 167   Needs to handle reporting host down.
 168
 169 * Schema does not need changes as in db only service and server states are to
 170   be changed.
 171
 172 Security impact
 173 ---------------
 174 API call needs admin privileges (in the default policy configuration).
 175
 176 Notifications impact
 177 --------------------
 178 None
 179
 180 Other end user impact
 181 ---------------------
 182 None
 183
 184 Performance Impact
 185 ------------------
 186 Only impact is that user can get information faster about instance and
 187 compute node state. This also gives possibility to evacuate faster.
 188 No impact that would slow down. Host down should be rare occurrence.
 189
 190 Other deployer impact
 191 ---------------------
 192 Developer can make use of any external tool to detect host fault and report it
 193 to OpenStack.
 194
 195 Developer impact
 196 ----------------
 197 None
 198
 199 Implementation
 200 ==============
 201 Assignee(s)
 202 -----------
 203 Primary assignee:   Tomi Juvonen
 204 Other contributors: Ryota Mibu
 205
 206 Work Items
 207 ----------
 208 * Test cases.
 209 * API changes.
 210 * Documentation.
 211
 212 Dependencies
 213 ============
 214 None
 215
 216 Testing
 217 =======
 218 Test cases that exists for enabling or putting host to maintenance should be
 219 altered or similar new cases made test new functionality.
 220
 221 Documentation Impact
 222 ====================
 223
 224 New API needs to be documented:
 225
 226 * Compute API extensions documentation.
 227   http://developer.openstack.org/api-ref-compute-v2.1.html
 228 * Nova commands documentation.
 229   http://docs.openstack.org/user-guide-admin/content/novaclient_commands.html
 230 * Compute command-line client documentation.
 231   http://docs.openstack.org/cli-reference/content/novaclient_commands.html
 232 * nova.compute.api documentation.
 233   http://docs.openstack.org/developer/nova/api/nova.compute.api.html
 234 * High Availability guide might have page to tell external tool could provide
 235   ability to provide faster HA as able to update states by new API.
 236   http://docs.openstack.org/high-availability-guide/content/index.html
 237
 238 References
 239 ==========
 240 * OPNFV Doctor project: https://wiki.opnfv.org/doctor
 241 * OpenStack Instance HA Proposal:
 242   http://blog.russellbryant.net/2014/10/15/openstack-instance-ha-proposal/
 243 * The Different Facets of OpenStack HA:
 244   http://blog.russellbryant.net/2015/03/10/
 245   the-different-facets-of-openstack-ha/