JIRA: DOCTOR-81
Change-Id: I3a9c0d020bcbdbb261df8209556dbdf488f3c3db
Signed-off-by: Gerald Kunzmann <kunzmann@docomolab-euro.com>
After receiving the MaintenanceRequest,the VIM decides on the actions to be taken
based on maintenance policies predefined by the affected Consumer(s).
After receiving the MaintenanceRequest,the VIM decides on the actions to be taken
based on maintenance policies predefined by the affected Consumer(s).
-.. [#timeout] Timeout is set by the Administrator and corresponds to the maximum time to empty the physical resources.
+.. [#timeout] Timeout is set by the Administrator and corresponds to the maximum time
+ to empty the physical resources.
.. figure:: images/figure5a.png
:name: figure5a
.. figure:: images/figure5a.png
:name: figure5a
It consists of the following steps:
5. The Consumer C3 switches to standby configuration (STBY).
It consists of the following steps:
5. The Consumer C3 switches to standby configuration (STBY).
-6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed (steps 6a, 6b).
- The VIM executes the requested actions and sends back a NACK to consumer C2 (step 6d) as the
- migration of the virtual resource(s) is not completed by the given timeout.
+6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed
+ (steps 6a, 6b). The VIM executes the requested actions and sends back a NACK to consumer C2
+ (step 6d) as the migration of the virtual resource(s) is not completed by the given timeout.
7. The VIM switches the physical resources to "enabled" state.
7. The VIM switches the physical resources to "enabled" state.
-8. MaintenanceResponse is sent from VIM to inform the Administrator that the maintenance action cannot start.
+8. MaintenanceNotification is sent from VIM to inform the Administrator that the maintenance action
+ cannot start.
- Fault notifications cannot be received immediately by Ceilometer.
- Fault notifications cannot be received immediately by Ceilometer.
+* Solved by
+
+ + Event Alarm Evaluator:
+ https://specs.openstack.org/openstack/ceilometer-specs/specs/liberty/event-alarm-evaluator.html
+ + New OpenStack alarms and notifications project AODH:
+ http://docs.openstack.org/developer/aodh/
+
Maintenance Notification
^^^^^^^^^^^^^^^^^^^^^^^^
Maintenance Notification
^^^^^^^^^^^^^^^^^^^^^^^^
- VIM user cannot receive maintenance notifications.
- VIM user cannot receive maintenance notifications.
+ https://blueprints.launchpad.net/nova/+spec/service-status-notification
+ https://blueprints.launchpad.net/nova/+spec/service-status-notification
- Normalized data format does not exist.
- Normalized data format does not exist.
+* Solved by
+
+ + Specification in Section :ref:`southbound`.
+
- Ceilometer seems to be unsuitable for monitoring medium and large scale
NFVI deployments.
- Ceilometer seems to be unsuitable for monitoring medium and large scale
NFVI deployments.
+ Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much
higher number of fault events (up to 15 thousand events per second, but
+ Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much
higher number of fault events (up to 15 thousand events per second, but
- OpenStack Ceilometer does not monitor hardware and software to capture
faults.
- OpenStack Ceilometer does not monitor hardware and software to capture
faults.
- - Ceilometer is not able to detect and handle all faults listed in the Annex.
+ - Ceilometer is not able to detect and handle all faults listed in the Annex.
-* Related blueprints / workarounds
- - Use other dedicated monitoring tools like Zabbix or Monasca
+ + Use of dedicated monitoring tools like Zabbix or Monasca.
+ See :ref:`nfvi_faults`.
- - There needs to be API to change VM power_State in case host has failed.
- - There needs to be API to change nova-compute state.
+ - The API shall support to change VM power state in case host has failed.
+ - The API shall support to change nova-compute state.
- There could be single API to change different VM states for all VMs
- There could be single API to change different VM states for all VMs
- belonging to specific host.
- - As external system monitoring the infra calls these APIs change can be
- fast and reliable.
- - Correlation actions can be faster and automated as states are reliable.
- - User will be able to read states from OpenStack and trust they are
- correct.
+ belonging to a specific host.
+ - Support external systems that are monitoring the infrastructure and resources
+ that are able to call the API fast and reliable.
+ - Resource states are reliable such that correlation actions can be fast and automated.
+ - User shall be able to read states from OpenStack and trust they are correct.
+ Gap
- OpenStack does not change its states fast and reliably enough.
+ Gap
- OpenStack does not change its states fast and reliably enough.
- - There is API missing to have external system to change states and to
- trust the states are then reliable (external system has fenced failed
- host).
+ - The API does not support to have an external system to change states and to
+ trust the states are reliable (external system has fenced failed host).
- User cannot read all the states from OpenStack nor trust they are right.
- User cannot read all the states from OpenStack nor trust they are right.
+ https://blueprints.launchpad.net/nova/+spec/mark-host-down
+ https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service
+ https://blueprints.launchpad.net/nova/+spec/mark-host-down
+ https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service
underlying root cause of failure. Knowing the root cause can help filter
out unnecessary and overwhelming alarms.
underlying root cause of failure. Knowing the root cause can help filter
out unnecessary and overwhelming alarms.
-* Related blueprints / workarounds
+ Monasca as of now lacks this feature, although the community is aware and
working toward supporting it.
+ Monasca as of now lacks this feature, although the community is aware and
working toward supporting it.
- Sensor monitoring is very important. It provides operators status
on the state of the physical infrastructure (e.g. temperature, fans).
- Sensor monitoring is very important. It provides operators status
on the state of the physical infrastructure (e.g. temperature, fans).
-* Related blueprints / workarounds
+ Monasca can be configured to use third-party monitoring solutions (e.g.
Nagios, Cacti) for retrieving additional data.
+ Monasca can be configured to use third-party monitoring solutions (e.g.
Nagios, Cacti) for retrieving additional data.
- - Cause of the delay needs to be identified and fixed
+ - Cause of the delay is a periodic evaluation and notification. Periodicity is configured
+ as 30s default value and can be reduced to 5s but not below.
+ https://github.com/zabbix/zabbix/blob/trunk/conf/zabbix_server.conf#L329
+
..
vim: set tabstop=4 expandtab textwidth=80:
..
vim: set tabstop=4 expandtab textwidth=80:
however consider what can be attributes of the notification vs. what should be a
property of the alarm instance. This will be analyzed later.
however consider what can be attributes of the notification vs. what should be a
property of the alarm instance. This will be analyzed later.
Detailed southbound interface specification
-------------------------------------------
Detailed southbound interface specification
-------------------------------------------
.. This work is licensed under a Creative Commons Attribution 4.0 International License.
.. http://creativecommons.org/licenses/by/4.0
.. This work is licensed under a Creative Commons Attribution 4.0 International License.
.. http://creativecommons.org/licenses/by/4.0
Annex: NFVI Faults
=================================================
Annex: NFVI Faults
=================================================