docs/development/requirements/04-gaps.rst

   1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
   2 .. http://creativecommons.org/licenses/by/4.0
   3
   4 Gap analysis in upstream projects
   5 =================================
   6
   7 This section presents the findings of gaps on existing VIM platforms. The focus
   8 was to identify gaps based on the features and requirements specified in Section
   9 3.3. The analysis work determined gaps that are presented here.
  10
  11 VIM Northbound Interface
  12 ------------------------
  13
  14 Immediate Notification
  15 ^^^^^^^^^^^^^^^^^^^^^^
  16
  17 * Type: 'deficiency in performance'
  18 * Description
  19
  20   + To-be
  21
  22     - VIM has to notify unavailability of virtual resource (fault) to VIM user
  23       immediately.
  24     - Notification should be passed in '1 second' after fault detected/notified
  25       by VIM.
  26     - Also, the following conditions/requirement have to be met:
  27
  28       - Only the owning user can receive notification of fault related to owned
  29         virtual resource(s).
  30
  31   + As-is
  32
  33     - OpenStack Metering 'Ceilometer' can notify unavailability of virtual
  34       resource (fault) to the owner of virtual resource based on alarm
  35       configuration by the user.
  36
  37       - Ceilometer Alarm API:
  38         http://docs.openstack.org/developer/ceilometer/webapi/v2.html#alarms
  39
  40     - Alarm notifications are triggered by alarm evaluator instead of
  41       notification agents that might receive faults
  42
  43       - Ceilometer Architecture:
  44         http://docs.openstack.org/developer/ceilometer/architecture.html#id1
  45
  46     - Evaluation interval should be equal to or larger than configured pipeline
  47       interval for collection of underlying metrics.
  48
  49       - https://github.com/openstack/ceilometer/blob/stable/juno/ceilometer/alarm/service.py#L38-42
  50
  51     - The interval for collection has to be set large enough which depends on
  52       the size of the deployment and the number of metrics to be collected.
  53     - The interval may not be less than one second in even small deployments.
  54       The default value is 60 seconds.
  55     - Alternative: OpenStack has a message bus to publish system events.
  56       The operator can allow the user to connect this, but there are no
  57       functions to filter out other events that should not be passed to the user
  58       or which were not requested by the user.
  59
  60   + Gap
  61
  62     - Fault notifications cannot be received immediately by Ceilometer.
  63
  64 * Solved by
  65
  66   + Event Alarm Evaluator:
  67     https://specs.openstack.org/openstack/ceilometer-specs/specs/liberty/event-alarm-evaluator.html
  68   + New OpenStack alarms and notifications project AODH:
  69     http://docs.openstack.org/developer/aodh/
  70
  71 Maintenance Notification
  72 ^^^^^^^^^^^^^^^^^^^^^^^^
  73
  74 * Type: 'missing'
  75 * Description
  76
  77   + To-be
  78
  79     - VIM has to notify unavailability of virtual resource triggered by NFVI
  80       maintenance to VIM user.
  81     - Also, the following conditions/requirements have to be met:
  82
  83       - VIM should accept maintenance message from administrator and mark target
  84         physical resource "in maintenance".
  85       - Only the owner of virtual resource hosted by target physical resource
  86         can receive the notification that can trigger some process for
  87         applications which are running on the virtual resource (e.g. cut off
  88         VM).
  89
  90   + As-is
  91
  92     - OpenStack: None
  93     - AWS (just for study)
  94
  95       - AWS provides API and CLI to view status of resource (VM) and to create
  96         instance status and system status alarms to notify you when an instance
  97         has a failed status check.
  98         http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html
  99       - AWS provides API and CLI to view scheduled events, such as a reboot or
 100         retirement, for your instances. Also, those events will be notified
 101         via e-mail.
 102         http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html
 103
 104   + Gap
 105
 106     - VIM user cannot receive maintenance notifications.
 107
 108 * Solved by
 109
 110   + https://blueprints.launchpad.net/nova/+spec/service-status-notification
 111
 112 VIM Southbound interface
 113 ------------------------
 114
 115 Normalization of data collection models
 116 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 117
 118 * Type: 'missing'
 119 * Description
 120
 121   + To-be
 122
 123     - A normalized data format needs to be created to cope with the many data
 124       models from different monitoring solutions.
 125
 126   + As-is
 127
 128     - Data can be collected from many places (e.g. Zabbix, Nagios, Cacti,
 129       Zenoss). Although each solution establishes its own data models, no common
 130       data abstraction models exist in OpenStack.
 131
 132   + Gap
 133
 134     - Normalized data format does not exist.
 135
 136 * Solved by
 137
 138   + Specification in Section :ref:`southbound`.
 139
 140 OpenStack
 141 ---------
 142
 143 Ceilometer
 144 ^^^^^^^^^^
 145
 146 OpenStack offers a telemetry service, Ceilometer, for collecting measurements of
 147 the utilization of physical and virtual resources [CEIL]_. Ceilometer can
 148 collect a number of metrics across multiple OpenStack components and watch for
 149 variations and trigger alarms based upon the collected data.
 150
 151 Scalability of fault aggregation
 152 ________________________________
 153
 154 * Type: 'scalability issue'
 155 * Description
 156
 157   + To-be
 158
 159     - Be able to scale to a large deployment, where thousands of monitoring
 160       events per second need to be analyzed.
 161
 162   + As-is
 163
 164     - Performance issue when scaling to medium-sized deployments.
 165
 166   + Gap
 167
 168     - Ceilometer seems to be unsuitable for monitoring medium and large scale
 169       NFVI deployments.
 170
 171 * Solved by
 172
 173   + Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much
 174     higher number of fault events (up to 15 thousand events per second, but
 175     obviously also has some upper bound:
 176     http://blog.zabbix.com/scalable-zabbix-lessons-on-hitting-9400-nvps/2615/
 177
 178   + Decentralized/hierarchical deployment with multiple instances, where one
 179     instance is only responsible for a small NFVI.
 180
 181 Monitoring of hardware and software
 182 ___________________________________
 183
 184 * Type: 'missing (lack of functionality)'
 185 * Description
 186
 187   + To-be
 188
 189     - OpenStack (as VIM) should monitor various hardware and software in NFVI to
 190       handle faults on them by Ceilometer.
 191     - OpenStack may have monitoring functionality in itself and can be
 192       integrated with third party monitoring tools.
 193     - OpenStack need to be able to detect the faults listed in the Annex.
 194
 195   + As-is
 196
 197     - For each deployment of OpenStack, an operator has responsibility to
 198       configure monitoring tools with relevant scripts or plugins in order to
 199       monitor hardware and software.
 200     - OpenStack Ceilometer does not monitor hardware and software to capture
 201       faults.
 202
 203   + Gap
 204
 205     - Ceilometer is not able to detect and handle all faults listed in the Annex.
 206
 207 * Solved by
 208
 209   + Use of dedicated monitoring tools like Zabbix or Monasca.
 210     See :ref:`nfvi_faults`.
 211
 212 Nova
 213 ^^^^
 214
 215 OpenStack Nova [NOVA]_ is a mature and widely known and used component in
 216 OpenStack cloud deployments. It is the main part of an
 217 "infrastructure-as-a-service" system providing a cloud computing fabric
 218 controller, supporting a wide diversity of virtualization and container
 219 technologies.
 220
 221 Nova has proven throughout these past years to be highly available and
 222 fault-tolerant. Featuring its own API, it also provides a compatibility API with
 223 Amazon EC2 APIs.
 224
 225 Correct states when compute host is down
 226 ________________________________________
 227
 228 * Type: 'missing (lack of functionality)'
 229 * Description
 230
 231   + To-be
 232
 233     - The API shall support to change VM power state in case host has failed.
 234     - The API shall support to change nova-compute state.
 235     - There could be single API to change different VM states for all VMs
 236       belonging to a specific host.
 237     - Support external systems that are monitoring the infrastructure and resources
 238       that are able to call the API fast and reliable.
 239     - Resource states are reliable such that correlation actions can be fast and automated.
 240     - User shall be able to read states from OpenStack and trust they are correct.
 241
 242   + As-is
 243
 244     - When a VM goes down due to a host HW, host OS or hypervisor failure,
 245       nothing happens in OpenStack. The VMs of a crashed host/hypervisor are
 246       reported to be live and OK through the OpenStack API.
 247     - nova-compute state might change too slowly or the state is not reliable
 248       if expecting also VMs to be down. This leads to ability to schedule VMs
 249       to a failed host and slowness blocks evacuation.
 250
 251   + Gap
 252
 253     - OpenStack does not change its states fast and reliably enough.
 254     - The API does not support to have an external system to change states and to
 255       trust the states are reliable (external system has fenced failed host).
 256     - User cannot read all the states from OpenStack nor trust they are right.
 257
 258 * Solved by
 259
 260   + https://blueprints.launchpad.net/nova/+spec/mark-host-down
 261   + https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service
 262
 263 Evacuate VMs in Maintenance mode
 264 ________________________________
 265
 266 * Type: 'missing'
 267 * Description
 268
 269   + To-be
 270
 271     - When maintenance mode for a compute host is set, trigger VM evacuation to
 272       available compute nodes before bringing the host down for maintenance.
 273
 274   + As-is
 275
 276     - If setting a compute node to a maintenance mode, OpenStack only schedules
 277       evacuation of all VMs to available compute nodes if in-maintenance compute
 278       node runs the XenAPI and VMware ESX hypervisors. Other hypervisors (e.g.
 279       KVM) are not supported and, hence, guest VMs will likely stop running due
 280       to maintenance actions administrator may perform (e.g. hardware upgrades,
 281       OS updates).
 282
 283   + Gap
 284
 285     - Nova libvirt hypervisor driver does not implement automatic guest VMs
 286       evacuation when compute nodes are set to maintenance mode (``$ nova
 287       host-update --maintenance enable <hostname>``).
 288
 289 Monasca
 290 ^^^^^^^
 291
 292 Monasca is an open-source monitoring-as-a-service (MONaaS) solution that
 293 integrates with OpenStack. Even though it is still in its early days, it is the
 294 interest of the community that the platform be multi-tenant, highly scalable,
 295 performant and fault-tolerant. It provides a streaming alarm engine, a
 296 notification engine, and a northbound REST API users can use to interact with
 297 Monasca. Hundreds of thousands of metrics per second can be processed
 298 [MONA]_.
 299
 300 Anomaly detection
 301 _________________
 302
 303
 304 * Type: 'missing (lack of functionality)'
 305 * Description
 306
 307   + To-be
 308
 309     - Detect the failure and perform a root cause analysis to filter out other
 310       alarms that may be triggered due to their cascading relation.
 311
 312   + As-is
 313
 314     - A mechanism to detect root causes of failures is not available.
 315
 316   + Gap
 317
 318     - Certain failures can trigger many alarms due to their dependency on the
 319       underlying root cause of failure. Knowing the root cause can help filter
 320       out unnecessary and overwhelming alarms.
 321
 322 * Status
 323
 324   + Monasca as of now lacks this feature, although the community is aware and
 325     working toward supporting it.
 326
 327 Sensor monitoring
 328 _________________
 329
 330 * Type: 'missing (lack of functionality)'
 331 * Description
 332
 333   + To-be
 334
 335     - It should support monitoring sensor data retrieval, for instance, from
 336       IPMI.
 337
 338   + As-is
 339
 340     - Monasca does not monitor sensor data
 341
 342   + Gap
 343
 344     - Sensor monitoring is very important. It provides operators status
 345       on the state of the physical infrastructure (e.g. temperature, fans).
 346
 347 * Addressed by
 348
 349   + Monasca can be configured to use third-party monitoring solutions (e.g.
 350     Nagios, Cacti) for retrieving additional data.
 351
 352 Hardware monitoring tools
 353 -------------------------
 354
 355 Zabbix
 356 ^^^^^^
 357
 358 Zabbix is an open-source solution for monitoring availability and performance of
 359 infrastructure components (i.e. servers and network devices), as well as
 360 applications [ZABB]_. It can be customized for use with OpenStack. It is a
 361 mature tool and has been proven to be able to scale to large systems with
 362 100,000s of devices.
 363
 364 Delay in execution of actions
 365 _____________________________
 366
 367
 368 * Type: 'deficiency in performance'
 369 * Description
 370
 371   + To-be
 372
 373     - After detecting a fault, the monitoring tool should immediately execute
 374       the appropriate action, e.g. inform the manager through the NB I/F
 375
 376   + As-is
 377
 378     - A delay of around 10 seconds was measured in two independent testbed
 379       deployments
 380
 381   + Gap
 382
 383     - Cause of the delay is a periodic evaluation and notification. Periodicity is configured
 384       as 30s default value and can be reduced to 5s but not below.
 385       https://github.com/zabbix/zabbix/blob/trunk/conf/zabbix_server.conf#L329
 386
 387
 388 ..
 389  vim: set tabstop=4 expandtab textwidth=80: