docs/requirements/04-gaps.rst

   1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
   2 .. http://creativecommons.org/licenses/by/4.0
   3
   4 Gap analysis in upstream projects
   5 =================================
   6
   7 This section presents the findings of gaps on existing VIM platforms. The focus
   8 was to identify gaps based on the features and requirements specified in Section
   9 3.3. The analysis work determined gaps that are presented here.
  10
  11 VIM Northbound Interface
  12 ------------------------
  13
  14 Immediate Notification
  15 ^^^^^^^^^^^^^^^^^^^^^^
  16
  17 * Type: 'deficiency in performance'
  18 * Description
  19
  20   + To-be
  21
  22     - VIM has to notify unavailability of virtual resource (fault) to VIM user
  23       immediately.
  24     - Notification should be passed in '1 second' after fault detected/notified
  25       by VIM.
  26     - Also, the following conditions/requirement have to be met:
  27
  28       - Only the owning user can receive notification of fault related to owned
  29         virtual resource(s).
  30
  31   + As-is
  32
  33     - OpenStack Metering 'Ceilometer' can notify unavailability of virtual
  34       resource (fault) to the owner of virtual resource based on alarm
  35       configuration by the user.
  36
  37       - Ceilometer Alarm API:
  38         http://docs.openstack.org/developer/ceilometer/webapi/v2.html#alarms
  39
  40     - Alarm notifications are triggered by alarm evaluator instead of
  41       notification agents that might receive faults
  42
  43       - Ceilometer Architecture:
  44         http://docs.openstack.org/developer/ceilometer/architecture.html#id1
  45
  46     - Evaluation interval should be equal to or larger than configured pipeline
  47       interval for collection of underlying metrics.
  48
  49       - https://github.com/openstack/ceilometer/blob/stable/juno/ceilometer/alarm/service.py#L38-42
  50
  51     - The interval for collection has to be set large enough which depends on
  52       the size of the deployment and the number of metrics to be collected.
  53     - The interval may not be less than one second in even small deployments.
  54       The default value is 60 seconds.
  55     - Alternative: OpenStack has a message bus to publish system events.
  56       The operator can allow the user to connect this, but there are no
  57       functions to filter out other events that should not be passed to the user
  58       or which were not requested by the user.
  59
  60   + Gap
  61
  62     - Fault notifications cannot be received immediately by Ceilometer.
  63
  64 Maintenance Notification
  65 ^^^^^^^^^^^^^^^^^^^^^^^^
  66
  67 * Type: 'missing'
  68 * Description
  69
  70   + To-be
  71
  72     - VIM has to notify unavailability of virtual resource triggered by NFVI
  73       maintenance to VIM user.
  74     - Also, the following conditions/requirements have to be met:
  75
  76       - VIM should accept maintenance message from administrator and mark target
  77         physical resource "in maintenance".
  78       - Only the owner of virtual resource hosted by target physical resource
  79         can receive the notification that can trigger some process for
  80         applications which are running on the virtual resource (e.g. cut off
  81         VM).
  82
  83   + As-is
  84
  85     - OpenStack: None
  86     - AWS (just for study)
  87
  88       - AWS provides API and CLI to view status of resource (VM) and to create
  89         instance status and system status alarms to notify you when an instance
  90         has a failed status check.
  91         http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html
  92       - AWS provides API and CLI to view scheduled events, such as a reboot or
  93         retirement, for your instances. Also, those events will be notified
  94         via e-mail.
  95         http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html
  96
  97   + Gap
  98
  99     - VIM user cannot receive maintenance notifications.
 100
 101 * Related blueprints
 102
 103   + https://blueprints.launchpad.net/nova/+spec/service-status-notification
 104
 105 VIM Southbound interface
 106 ------------------------
 107
 108 Normalization of data collection models
 109 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 110
 111 * Type: 'missing'
 112 * Description
 113
 114   + To-be
 115
 116     - A normalized data format needs to be created to cope with the many data
 117       models from different monitoring solutions.
 118
 119   + As-is
 120
 121     - Data can be collected from many places (e.g. Zabbix, Nagios, Cacti,
 122       Zenoss). Although each solution establishes its own data models, no common
 123       data abstraction models exist in OpenStack.
 124
 125   + Gap
 126
 127     - Normalized data format does not exist.
 128
 129 OpenStack
 130 ---------
 131
 132 Ceilometer
 133 ^^^^^^^^^^
 134
 135 OpenStack offers a telemetry service, Ceilometer, for collecting measurements of
 136 the utilization of physical and virtual resources [CEIL]_. Ceilometer can
 137 collect a number of metrics across multiple OpenStack components and watch for
 138 variations and trigger alarms based upon the collected data.
 139
 140 Scalability of fault aggregation
 141 ________________________________
 142
 143 * Type: 'scalability issue'
 144 * Description
 145
 146   + To-be
 147
 148     - Be able to scale to a large deployment, where thousands of monitoring
 149       events per second need to be analyzed.
 150
 151   + As-is
 152
 153     - Performance issue when scaling to medium-sized deployments.
 154
 155   + Gap
 156
 157     - Ceilometer seems to be unsuitable for monitoring medium and large scale
 158       NFVI deployments.
 159
 160 * Related blueprints
 161
 162   + Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much
 163     higher number of fault events (up to 15 thousand events per second, but
 164     obviously also has some upper bound:
 165     http://blog.zabbix.com/scalable-zabbix-lessons-on-hitting-9400-nvps/2615/
 166
 167   + Decentralized/hierarchical deployment with multiple instances, where one
 168     instance is only responsible for a small NFVI.
 169
 170 Monitoring of hardware and software
 171 ___________________________________
 172
 173 * Type: 'missing (lack of functionality)'
 174 * Description
 175
 176   + To-be
 177
 178     - OpenStack (as VIM) should monitor various hardware and software in NFVI to
 179       handle faults on them by Ceilometer.
 180     - OpenStack may have monitoring functionality in itself and can be
 181       integrated with third party monitoring tools.
 182     - OpenStack need to be able to detect the faults listed in the Annex.
 183
 184   + As-is
 185
 186     - For each deployment of OpenStack, an operator has responsibility to
 187       configure monitoring tools with relevant scripts or plugins in order to
 188       monitor hardware and software.
 189     - OpenStack Ceilometer does not monitor hardware and software to capture
 190       faults.
 191
 192    + Gap
 193
 194      - Ceilometer is not able to detect and handle all faults listed in the Annex.
 195
 196 * Related blueprints / workarounds
 197
 198   - Use other dedicated monitoring tools like Zabbix or Monasca
 199
 200 Nova
 201 ^^^^
 202
 203 OpenStack Nova [NOVA]_ is a mature and widely known and used component in
 204 OpenStack cloud deployments. It is the main part of an
 205 "infrastructure-as-a-service" system providing a cloud computing fabric
 206 controller, supporting a wide diversity of virtualization and container
 207 technologies.
 208
 209 Nova has proven throughout these past years to be highly available and
 210 fault-tolerant. Featuring its own API, it also provides a compatibility API with
 211 Amazon EC2 APIs.
 212
 213 Correct states when compute host is down
 214 ________________________________________
 215
 216 * Type: 'missing (lack of functionality)'
 217 * Description
 218
 219   + To-be
 220
 221     - There needs to be API to change VM power_State in case host has failed.
 222     - There needs to be API to change nova-compute state.
 223     - There could be single API to change different VM states for all VMs
 224       belonging to specific host.
 225     - As external system monitoring the infra calls these APIs change can be
 226       fast and reliable.
 227     - Correlation actions can be faster and automated as states are reliable.
 228     - User will be able to read states from OpenStack and trust they are
 229       correct.
 230
 231   + As-is
 232
 233     - When a VM goes down due to a host HW, host OS or hypervisor failure,
 234       nothing happens in OpenStack. The VMs of a crashed host/hypervisor are
 235       reported to be live and OK through the OpenStack API.
 236     - nova-compute state might change too slowly or the state is not reliable
 237       if expecting also VMs to be down. This leads to ability to schedule VMs
 238       to a failed host and slowness blocks evacuation.
 239
 240   + Gap
 241
 242     - OpenStack does not change its states fast and reliably enough.
 243     - There is API missing to have external system to change states and to
 244       trust the states are then reliable (external system has fenced failed
 245       host).
 246     - User cannot read all the states from OpenStack nor trust they are right.
 247
 248 * Related blueprints
 249
 250   + https://blueprints.launchpad.net/nova/+spec/mark-host-down
 251   + https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service
 252
 253 Evacuate VMs in Maintenance mode
 254 ________________________________
 255
 256 * Type: 'missing'
 257 * Description
 258
 259   + To-be
 260
 261     - When maintenance mode for a compute host is set, trigger VM evacuation to
 262       available compute nodes before bringing the host down for maintenance.
 263
 264   + As-is
 265
 266     - If setting a compute node to a maintenance mode, OpenStack only schedules
 267       evacuation of all VMs to available compute nodes if in-maintenance compute
 268       node runs the XenAPI and VMware ESX hypervisors. Other hypervisors (e.g.
 269       KVM) are not supported and, hence, guest VMs will likely stop running due
 270       to maintenance actions administrator may perform (e.g. hardware upgrades,
 271       OS updates).
 272
 273   + Gap
 274
 275     - Nova libvirt hypervisor driver does not implement automatic guest VMs
 276       evacuation when compute nodes are set to maintenance mode (``$ nova
 277       host-update --maintenance enable <hostname>``).
 278
 279 Monasca
 280 ^^^^^^^
 281
 282 Monasca is an open-source monitoring-as-a-service (MONaaS) solution that
 283 integrates with OpenStack. Even though it is still in its early days, it is the
 284 interest of the community that the platform be multi-tenant, highly scalable,
 285 performant and fault-tolerant. It provides a streaming alarm engine, a
 286 notification engine, and a northbound REST API users can use to interact with
 287 Monasca. Hundreds of thousands of metrics per second can be processed
 288 [MONA]_.
 289
 290 Anomaly detection
 291 _________________
 292
 293
 294 * Type: 'missing (lack of functionality)'
 295 * Description
 296
 297   + To-be
 298
 299     - Detect the failure and perform a root cause analysis to filter out other
 300       alarms that may be triggered due to their cascading relation.
 301
 302   + As-is
 303
 304     - A mechanism to detect root causes of failures is not available.
 305
 306   + Gap
 307
 308     - Certain failures can trigger many alarms due to their dependency on the
 309       underlying root cause of failure. Knowing the root cause can help filter
 310       out unnecessary and overwhelming alarms.
 311
 312 * Related blueprints / workarounds
 313
 314   + Monasca as of now lacks this feature, although the community is aware and
 315     working toward supporting it.
 316
 317 Sensor monitoring
 318 _________________
 319
 320 * Type: 'missing (lack of functionality)'
 321 * Description
 322
 323   + To-be
 324
 325     - It should support monitoring sensor data retrieval, for instance, from
 326       IPMI.
 327
 328   + As-is
 329
 330     - Monasca does not monitor sensor data
 331
 332   + Gap
 333
 334     - Sensor monitoring is very important. It provides operators status
 335       on the state of the physical infrastructure (e.g. temperature, fans).
 336
 337 * Related blueprints / workarounds
 338
 339   + Monasca can be configured to use third-party monitoring solutions (e.g.
 340     Nagios, Cacti) for retrieving additional data.
 341
 342 Hardware monitoring tools
 343 -------------------------
 344
 345 Zabbix
 346 ^^^^^^
 347
 348 Zabbix is an open-source solution for monitoring availability and performance of
 349 infrastructure components (i.e. servers and network devices), as well as
 350 applications [ZABB]_. It can be customized for use with OpenStack. It is a
 351 mature tool and has been proven to be able to scale to large systems with
 352 100,000s of devices.
 353
 354 Delay in execution of actions
 355 _____________________________
 356
 357
 358 * Type: 'deficiency in performance'
 359 * Description
 360
 361   + To-be
 362
 363     - After detecting a fault, the monitoring tool should immediately execute
 364       the appropriate action, e.g. inform the manager through the NB I/F
 365
 366   + As-is
 367
 368     - A delay of around 10 seconds was measured in two independent testbed
 369       deployments
 370
 371   + Gap
 372
 373     - Cause of the delay needs to be identified and fixed
 374
 375 ..
 376  vim: set tabstop=4 expandtab textwidth=80: