requirements/04-gaps.rst

   1 Gap analysis in upstream projects
   2 =================================
   3
   4 This section presents the findings of gaps on existing VIM platforms. The focus
   5 was to identify gaps based on the features and requirements specified in Section
   6 3.3. The analysis work determined gaps that are presented here.
   7
   8 VIM Northbound Interface
   9 ------------------------
  10
  11 Immediate Notification
  12 ^^^^^^^^^^^^^^^^^^^^^^
  13
  14 * Type: 'deficiency in performance'
  15 * Description
  16
  17   + To-be
  18
  19     - VIM has to notify unavailability of virtual resource (fault) to VIM user
  20       immediately.
  21     - Notification should be passed in '1 second' after fault detected/notified
  22       by VIM.
  23     - Also, the following conditions/requirement have to be met:
  24
  25       - Only the owning user can receive notification of fault related to owned
  26         virtual resource(s).
  27
  28   + As-is
  29
  30     - OpenStack Metering 'Ceilometer' can notify unavailability of virtual
  31       resource (fault) to the owner of virtual resource based on alarm
  32       configuration by the user.
  33
  34       - Ceilometer Alarm API:
  35         http://docs.openstack.org/developer/ceilometer/webapi/v2.html#alarms
  36
  37     - Alarm notifications are triggered by alarm evaluator instead of
  38       notification agents that might receive faults
  39
  40       - Ceilometer Architecture:
  41         http://docs.openstack.org/developer/ceilometer/architecture.html#id1
  42
  43     - Evaluation interval should be equal to or larger than configured pipeline
  44       interval for collection of underlying metrics.
  45
  46       - https://github.com/openstack/ceilometer/blob/stable/juno/ceilometer/alarm/service.py#L38-42
  47
  48     - The interval for collection has to be set large enough which depends on
  49       the size of the deployment and the number of metrics to be collected.
  50     - The interval may not be less than one second in even small deployments.
  51       The default value is 60 seconds.
  52     - Alternative: OpenStack has a message bus to publish system events.
  53       The operator can allow the user to connect this, but there are no
  54       functions to filter out other events that should not be passed to the user
  55       or which were not requested by the user.
  56
  57   + Gap
  58
  59     - Fault notifications cannot be received immediately by Ceilometer.
  60
  61 Maintenance Notification
  62 ^^^^^^^^^^^^^^^^^^^^^^^^
  63
  64 * Type: 'missing'
  65 * Description
  66
  67   + To-be
  68
  69     - VIM has to notify unavailability of virtual resource triggered by NFVI
  70       maintenance to VIM user.
  71     - Also, the following conditions/requirements have to be met:
  72
  73       - VIM should accept maintenance message from administrator and mark target
  74         physical resource "in maintenance".
  75       - Only the owner of virtual resource hosted by target physical resource
  76         can receive the notification that can trigger some process for
  77         applications which are running on the virtual resource (e.g. cut off
  78         VM).
  79
  80   + As-is
  81
  82     - OpenStack: None
  83     - AWS (just for study)
  84
  85       - AWS provides API and CLI to view status of resource (VM) and to create
  86         instance status and system status alarms to notify you when an instance
  87         has a failed status check.
  88         http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html
  89       - AWS provides API and CLI to view scheduled events, such as a reboot or
  90         retirement, for your instances. Also, those events will be notified
  91         via e-mail.
  92         http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html
  93
  94   + Gap
  95
  96     - VIM user cannot receive maintenance notifications.
  97
  98 * Related blueprints
  99
 100   + https://blueprints.launchpad.net/nova/+spec/service-status-notification
 101
 102 VIM Southbound interface
 103 ------------------------
 104
 105 Normalization of data collection models
 106 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 107
 108 * Type: 'missing'
 109 * Description
 110
 111   + To-be
 112
 113     - A normalized data format needs to be created to cope with the many data
 114       models from different monitoring solutions.
 115
 116   + As-is
 117
 118     - Data can be collected from many places (e.g. Zabbix, Nagios, Cacti,
 119       Zenoss). Although each solution establishes its own data models, no common
 120       data abstraction models exist in OpenStack.
 121
 122   + Gap
 123
 124     - Normalized data format does not exist.
 125
 126 OpenStack
 127 ---------
 128
 129 Ceilometer
 130 ^^^^^^^^^^
 131
 132 OpenStack offers a telemetry service, Ceilometer, for collecting measurements of
 133 the utilization of physical and virtual resources [CEIL]_. Ceilometer can
 134 collect a number of metrics across multiple OpenStack components and watch for
 135 variations and trigger alarms based upon the collected data.
 136
 137 Scalability of fault aggregation
 138 ________________________________
 139
 140 * Type: 'scalability issue'
 141 * Description
 142
 143   + To-be
 144
 145     - Be able to scale to a large deployment, where thousands of monitoring
 146       events per second need to be analyzed.
 147
 148   + As-is
 149
 150     - Performance issue when scaling to medium-sized deployments.
 151
 152   + Gap
 153
 154     - Ceilometer seems to be unsuitable for monitoring medium and large scale
 155       NFVI deployments.
 156
 157 * Related blueprints
 158
 159   + Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much
 160     higher number of fault events (up to 15 thousand events per second, but
 161     obviously also has some upper bound:
 162     http://blog.zabbix.com/scalable-zabbix-lessons-on-hitting-9400-nvps/2615/
 163
 164   + Decentralized/hierarchical deployment with multiple instances, where one
 165     instance is only responsible for a small NFVI.
 166
 167 Monitoring of hardware and software
 168 ___________________________________
 169
 170 * Type: 'missing (lack of functionality)'
 171 * Description
 172
 173   + To-be
 174
 175     - OpenStack (as VIM) should monitor various hardware and software in NFVI to
 176       handle faults on them by Ceilometer.
 177     - OpenStack may have monitoring functionality in itself and can be
 178       integrated with third party monitoring tools.
 179     - OpenStack need to be able to detect the faults listed in Section 3.5.
 180
 181   + As-is
 182
 183     - For each deployment of OpenStack, an operator has responsibility to
 184       configure monitoring tools with relevant scripts or plugins in order to
 185       monitor hardware and software.
 186     - OpenStack Ceilometer does not monitor hardware and software to capture
 187       faults.
 188
 189    + Gap
 190
 191      - Ceilometer is not able to detect and handle all faults listed in Section
 192        3.5.
 193
 194 * Related blueprints / workarounds
 195
 196   - Use other dedicated monitoring tools like Zabbix or Monasca
 197
 198 Nova
 199 ^^^^
 200
 201 OpenStack Nova [NOVA]_ is a mature and widely known and used component in
 202 OpenStack cloud deployments. It is the main part of an
 203 "infrastructure-as-a-service" system providing a cloud computing fabric
 204 controller, supporting a wide diversity of virtualization and container
 205 technologies.
 206
 207 Nova has proven throughout these past years to be highly available and
 208 fault-tolerant. Featuring its own API, it also provides a compatibility API with
 209 Amazon EC2 APIs.
 210
 211 Correct states when compute host is down
 212 ________________________________________
 213
 214 * Type: 'missing (lack of functionality)'
 215 * Description
 216
 217   + To-be
 218
 219     - There needs to be API to change VM power_State in case host has failed.
 220     - There needs to be API to change nova-compute state.
 221     - There could be single API to change different VM states for all VMs
 222       belonging to specific host.
 223     - As external system monitoring the infra calls these APIs change can be
 224       fast and reliable.
 225     - Correlation actions can be faster and automated as states are reliable.
 226     - User will be able to read states from OpenStack and trust they are
 227       correct.
 228
 229   + As-is
 230
 231     - When a VM goes down due to a host HW, host OS or hypervisor failure,
 232       nothing happens in OpenStack. The VMs of a crashed host/hypervisor are
 233       reported to be live and OK through the OpenStack API.
 234     - nova-compute state might change too slowly or the state is not reliable
 235       if expecting also VMs to be down. This leads to ability to schedule VMs
 236       to a failed host and slowness blocks evacuation.
 237
 238   + Gap
 239
 240     - OpenStack does not change its states fast and reliably enough.
 241     - There is API missing to have external system to change states and to
 242       trust the states are then reliable (external system has fenced failed
 243       host).
 244     - User cannot read all the states from OpenStack nor trust they are right.
 245
 246 * Related blueprints
 247
 248   + https://blueprints.launchpad.net/nova/+spec/mark-host-down
 249   + https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service
 250
 251 Evacuate VMs in Maintenance mode
 252 ________________________________
 253
 254 * Type: 'missing'
 255 * Description
 256
 257   + To-be
 258
 259     - When maintenance mode for a compute host is set, trigger VM evacuation to
 260       available compute nodes before bringing the host down for maintenance.
 261
 262   + As-is
 263
 264     - If setting a compute node to a maintenance mode, OpenStack only schedules
 265       evacuation of all VMs to available compute nodes if in-maintenance compute
 266       node runs the XenAPI and VMware ESX hypervisors. Other hypervisors (e.g.
 267       KVM) are not supported and, hence, guest VMs will likely stop running due
 268       to maintenance actions administrator may perform (e.g. hardware upgrades,
 269       OS updates).
 270
 271   + Gap
 272
 273     - Nova libvirt hypervisor driver does not implement automatic guest VMs
 274       evacuation when compute nodes are set to maintenance mode (``$ nova
 275       host-update --maintenance enable <hostname>``).
 276
 277 Monasca
 278 ^^^^^^^
 279
 280 Monasca is an open-source monitoring-as-a-service (MONaaS) solution that
 281 integrates with OpenStack. Even though it is still in its early days, it is the
 282 interest of the community that the platform be multi-tenant, highly scalable,
 283 performant and fault-tolerant. It provides a streaming alarm engine, a
 284 notification engine, and a northbound REST API users can use to interact with
 285 Monasca. Hundreds of thousands of metrics per second can be processed
 286 [MONA]_.
 287
 288 Anomaly detection
 289 _________________
 290
 291
 292 * Type: 'missing (lack of functionality)'
 293 * Description
 294
 295   + To-be
 296
 297     - Detect the failure and perform a root cause analysis to filter out other
 298       alarms that may be triggered due to their cascading relation.
 299
 300   + As-is
 301
 302     - A mechanism to detect root causes of failures is not available.
 303
 304   + Gap
 305
 306     - Certain failures can trigger many alarms due to their dependency on the
 307       underlying root cause of failure. Knowing the root cause can help filter
 308       out unnecessary and overwhelming alarms.
 309
 310 * Related blueprints / workarounds
 311
 312   + Monasca as of now lacks this feature, although the community is aware and
 313     working toward supporting it.
 314
 315 Sensor monitoring
 316 _________________
 317
 318 * Type: 'missing (lack of functionality)'
 319 * Description
 320
 321   + To-be
 322
 323     - It should support monitoring sensor data retrieval, for instance, from
 324       IPMI.
 325
 326   + As-is
 327
 328     - Monasca does not monitor sensor data
 329
 330   + Gap
 331
 332     - Sensor monitoring is very important. It provides operators status
 333       on the state of the physical infrastructure (e.g. temperature, fans).
 334
 335 * Related blueprints / workarounds
 336
 337   + Monasca can be configured to use third-party monitoring solutions (e.g.
 338     Nagios, Cacti) for retrieving additional data.
 339
 340 Hardware monitoring tools
 341 -------------------------
 342
 343 Zabbix
 344 ^^^^^^
 345
 346 Zabbix is an open-source solution for monitoring availability and performance of
 347 infrastructure components (i.e. servers and network devices), as well as
 348 applications [ZABB]_. It can be customized for use with OpenStack. It is a
 349 mature tool and has been proven to be able to scale to large systems with
 350 100,000s of devices.
 351
 352 Delay in execution of actions
 353 _____________________________
 354
 355
 356 * Type: 'deficiency in performance'
 357 * Description
 358
 359   + To-be
 360
 361     - After detecting a fault, the monitoring tool should immediately execute
 362       the appropriate action, e.g. inform the manager through the NB I/F
 363
 364   + As-is
 365
 366     - A delay of around 10 seconds was measured in two independent testbed
 367       deployments
 368
 369   + Gap
 370
 371     - Cause of the delay needs to be identified and fixed
 372
 373 ..
 374  vim: set tabstop=4 expandtab textwidth=80: