1 Gap analysis in upstream projects
2 =================================
4 This section presents the findings of gaps on existing VIM platforms. The focus
5 was to identify gaps based on the features and requirements specified in Section
6 3.3. The analysis work determined gaps that are presented here.
8 VIM Northbound Interface
9 ------------------------
11 Immediate Notification
12 ^^^^^^^^^^^^^^^^^^^^^^
14 * Type: 'deficiency in performance'
19 - VIM has to notify unavailability of virtual resource (fault) to VIM user
21 - Notification should be passed in '1 second' after fault detected/notified
23 - Also, the following conditions/requirement have to be met:
25 - Only the owning user can receive notification of fault related to owned
30 - OpenStack Metering 'Ceilometer' can notify unavailability of virtual
31 resource (fault) to the owner of virtual resource based on alarm
32 configuration by the user.
34 - Ceilometer Alarm API:
35 http://docs.openstack.org/developer/ceilometer/webapi/v2.html#alarms
37 - Alarm notifications are triggered by alarm evaluator instead of
38 notification agents that might receive faults
40 - Ceilometer Architecture:
41 http://docs.openstack.org/developer/ceilometer/architecture.html#id1
43 - Evaluation interval should be equal to or larger than configured pipeline
44 interval for collection of underlying metrics.
46 - https://github.com/openstack/ceilometer/blob/stable/juno/ceilometer/alarm/service.py#L38-42
48 - The interval for collection has to be set large enough which depends on
49 the size of the deployment and the number of metrics to be collected.
50 - The interval may not be less than one second in even small deployments.
51 The default value is 60 seconds.
52 - Alternative: OpenStack has a message bus to publish system events.
53 The operator can allow the user to connect this, but there are no
54 functions to filter out other events that should not be passed to the user
55 or which were not requested by the user.
59 - Fault notifications cannot be received immediately by Ceilometer.
61 Maintenance Notification
62 ^^^^^^^^^^^^^^^^^^^^^^^^
69 - VIM has to notify unavailability of virtual resource triggered by NFVI
70 maintenance to VIM user.
71 - Also, the following conditions/requirements have to be met:
73 - VIM should accept maintenance message from administrator and mark target
74 physical resource "in maintenance".
75 - Only the owner of virtual resource hosted by target physical resource
76 can receive the notification that can trigger some process for
77 applications which are running on the virtual resource (e.g. cut off
83 - AWS (just for study)
85 - AWS provides API and CLI to view status of resource (VM) and to create
86 instance status and system status alarms to notify you when an instance
87 has a failed status check.
88 http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html
89 - AWS provides API and CLI to view scheduled events, such as a reboot or
90 retirement, for your instances. Also, those events will be notified
92 http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html
96 - VIM user cannot receive maintenance notifications.
100 + https://blueprints.launchpad.net/nova/+spec/service-status-notification
102 VIM Southbound interface
103 ------------------------
105 Normalization of data collection models
106 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
113 - A normalized data format needs to be created to cope with the many data
114 models from different monitoring solutions.
118 - Data can be collected from many places (e.g. Zabbix, Nagios, Cacti,
119 Zenoss). Although each solution establishes its own data models, no common
120 data abstraction models exist in OpenStack.
124 - Normalized data format does not exist.
132 OpenStack offers a telemetry service, Ceilometer, for collecting measurements of
133 the utilization of physical and virtual resources [CEIL]_. Ceilometer can
134 collect a number of metrics across multiple OpenStack components and watch for
135 variations and trigger alarms based upon the collected data.
137 Scalability of fault aggregation
138 ________________________________
140 * Type: 'scalability issue'
145 - Be able to scale to a large deployment, where thousands of monitoring
146 events per second need to be analyzed.
150 - Performance issue when scaling to medium-sized deployments.
154 - Ceilometer seems to be unsuitable for monitoring medium and large scale
159 + Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much
160 higher number of fault events (up to 15 thousand events per second, but
161 obviously also has some upper bound:
162 http://blog.zabbix.com/scalable-zabbix-lessons-on-hitting-9400-nvps/2615/
164 + Decentralized/hierarchical deployment with multiple instances, where one
165 instance is only responsible for a small NFVI.
167 Monitoring of hardware and software
168 ___________________________________
170 * Type: 'missing (lack of functionality)'
175 - OpenStack (as VIM) should monitor various hardware and software in NFVI to
176 handle faults on them by Ceilometer.
177 - OpenStack may have monitoring functionality in itself and can be
178 integrated with third party monitoring tools.
179 - OpenStack need to be able to detect the faults listed in the Annex.
183 - For each deployment of OpenStack, an operator has responsibility to
184 configure monitoring tools with relevant scripts or plugins in order to
185 monitor hardware and software.
186 - OpenStack Ceilometer does not monitor hardware and software to capture
191 - Ceilometer is not able to detect and handle all faults listed in the Annex.
193 * Related blueprints / workarounds
195 - Use other dedicated monitoring tools like Zabbix or Monasca
200 OpenStack Nova [NOVA]_ is a mature and widely known and used component in
201 OpenStack cloud deployments. It is the main part of an
202 "infrastructure-as-a-service" system providing a cloud computing fabric
203 controller, supporting a wide diversity of virtualization and container
206 Nova has proven throughout these past years to be highly available and
207 fault-tolerant. Featuring its own API, it also provides a compatibility API with
210 Correct states when compute host is down
211 ________________________________________
213 * Type: 'missing (lack of functionality)'
218 - There needs to be API to change VM power_State in case host has failed.
219 - There needs to be API to change nova-compute state.
220 - There could be single API to change different VM states for all VMs
221 belonging to specific host.
222 - As external system monitoring the infra calls these APIs change can be
224 - Correlation actions can be faster and automated as states are reliable.
225 - User will be able to read states from OpenStack and trust they are
230 - When a VM goes down due to a host HW, host OS or hypervisor failure,
231 nothing happens in OpenStack. The VMs of a crashed host/hypervisor are
232 reported to be live and OK through the OpenStack API.
233 - nova-compute state might change too slowly or the state is not reliable
234 if expecting also VMs to be down. This leads to ability to schedule VMs
235 to a failed host and slowness blocks evacuation.
239 - OpenStack does not change its states fast and reliably enough.
240 - There is API missing to have external system to change states and to
241 trust the states are then reliable (external system has fenced failed
243 - User cannot read all the states from OpenStack nor trust they are right.
247 + https://blueprints.launchpad.net/nova/+spec/mark-host-down
248 + https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service
250 Evacuate VMs in Maintenance mode
251 ________________________________
258 - When maintenance mode for a compute host is set, trigger VM evacuation to
259 available compute nodes before bringing the host down for maintenance.
263 - If setting a compute node to a maintenance mode, OpenStack only schedules
264 evacuation of all VMs to available compute nodes if in-maintenance compute
265 node runs the XenAPI and VMware ESX hypervisors. Other hypervisors (e.g.
266 KVM) are not supported and, hence, guest VMs will likely stop running due
267 to maintenance actions administrator may perform (e.g. hardware upgrades,
272 - Nova libvirt hypervisor driver does not implement automatic guest VMs
273 evacuation when compute nodes are set to maintenance mode (``$ nova
274 host-update --maintenance enable <hostname>``).
279 Monasca is an open-source monitoring-as-a-service (MONaaS) solution that
280 integrates with OpenStack. Even though it is still in its early days, it is the
281 interest of the community that the platform be multi-tenant, highly scalable,
282 performant and fault-tolerant. It provides a streaming alarm engine, a
283 notification engine, and a northbound REST API users can use to interact with
284 Monasca. Hundreds of thousands of metrics per second can be processed
291 * Type: 'missing (lack of functionality)'
296 - Detect the failure and perform a root cause analysis to filter out other
297 alarms that may be triggered due to their cascading relation.
301 - A mechanism to detect root causes of failures is not available.
305 - Certain failures can trigger many alarms due to their dependency on the
306 underlying root cause of failure. Knowing the root cause can help filter
307 out unnecessary and overwhelming alarms.
309 * Related blueprints / workarounds
311 + Monasca as of now lacks this feature, although the community is aware and
312 working toward supporting it.
317 * Type: 'missing (lack of functionality)'
322 - It should support monitoring sensor data retrieval, for instance, from
327 - Monasca does not monitor sensor data
331 - Sensor monitoring is very important. It provides operators status
332 on the state of the physical infrastructure (e.g. temperature, fans).
334 * Related blueprints / workarounds
336 + Monasca can be configured to use third-party monitoring solutions (e.g.
337 Nagios, Cacti) for retrieving additional data.
339 Hardware monitoring tools
340 -------------------------
345 Zabbix is an open-source solution for monitoring availability and performance of
346 infrastructure components (i.e. servers and network devices), as well as
347 applications [ZABB]_. It can be customized for use with OpenStack. It is a
348 mature tool and has been proven to be able to scale to large systems with
351 Delay in execution of actions
352 _____________________________
355 * Type: 'deficiency in performance'
360 - After detecting a fault, the monitoring tool should immediately execute
361 the appropriate action, e.g. inform the manager through the NB I/F
365 - A delay of around 10 seconds was measured in two independent testbed
370 - Cause of the delay needs to be identified and fixed
373 vim: set tabstop=4 expandtab textwidth=80: