+.. This work is licensed under a Creative Commons Attribution 4.0 International License.
+.. http://creativecommons.org/licenses/by/4.0
+
Detailed architecture and interface specification
=================================================
Simple information elements:
-* SubscriptionID: identifies a subscription to receive fault or maintenance
+* SubscriptionID (Identifier): identifies a subscription to receive fault or maintenance
notifications.
-* NotificationID: identifies a fault or maintenance notification.
+* NotificationID (Identifier): identifies a fault or maintenance notification.
* VirtualResourceID (Identifier): identifies a virtual resource affected by a
fault or a maintenance action of the underlying physical resource.
* PhysicalResourceID (Identifier): identifies a physical resource affected by a
* EventTime (Datetime): Time when the fault was observed.
* EventStartTime and EventEndTime (Datetime): Datetime range that can be used in
a FaultQueryFilter to narrow down the faults to be queried.
-* ProbableCause: information about the probable cause of the fault.
+* ProbableCause (String): information about the probable cause of the fault.
* CorrelatedFaultID (Integer): list of other faults correlated to this fault.
* isRootCause (Boolean): Parameter indicating if this fault is the root for
other correlated faults. If TRUE, then the faults listed in the parameter
* ZoneID (Identifier): Identifier of the resource zone. A resource zone is the
logical separation of physical and software resources in an NFVI deployment
for physical isolation, redundancy, or administrative designation.
-* Metadata (Key-Value-Pairs): provides additional information of a physical
+* Metadata (Key-value pair): provides additional information of a physical
resource in maintenance/error state.
Complex information elements (see also UML diagrams in :numref:`figure13`
particular describing the information elements used for alarm notifications.
- FaultID [1] (Identifier)
- - FaultType [1]
+ - FaultType [1] (String)
- Severity [1] (Integer)
- EventTime [1] (Datetime)
- - ProbableCause [1]
+ - ProbableCause [1] (String)
- CorrelatedFaultID [0..*] (Identifier)
- FaultDetails [0..*] (Key-value pair)
- PhysicalResourceID [1] (Identifier)
- PhysicalResourceState [1] (String): mandates the new state of the physical
resource.
+ - Metadata [0..*] (Key-value pair)
* PhysicalResourceInfoClass:
- FirmwareVersion [0..1] (String)
- HypervisorVersion [0..1] (String)
- ZoneID [0..1] (Identifier)
+ - Metadata [0..*] (Key-value pair)
* StateQueryFilterClass: narrows down a StateQueryRequest, for example it limits
the query to certain physical resources, a certain zone, or a given resource
resources. For each resource, information about the current state, the
firmware version, etc. is provided.
+NFV IFA, OPNFV Doctor and AODH alarms
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This section compares the alarm interfaces of ETSI NFV IFA with the specifications
+of this document and the alarm class of AODH.
+
+ETSI NFV specifies an interface for alarms from virtualised resources in ETSI GS
+NFV-IFA 005 [ENFV]_. The interface specifies an Alarm class and two notifications plus
+operations to query alarm instances and to subscribe to the alarm notifications.
+
+The specification in this document has a structure that is very similar to the
+ETSI NFV specifications. The notifications differ in that an alarm notification
+in the NFV interface defines a single fault for a single resource while the
+notification specified in this document can contain multiple faults for
+multiple resources. The Doctor specification is lacking the detailed time stamps
+of the NFV specification essential for synchronizaion of the alarm list
+using the query operation. The detailed time stamps are also of value in the event
+and alarm history DBs.
+
+AODH defines a base class for alarms, not the notifications. This means that
+some of the dynamic attributes of the ETSI NFV alarm type, like alarmRaisedTime,
+are not applicable to the AODH alarm class but are attributes of in the actual
+notifications. (Description of these attributes will be added later.) The AODH alarm
+class is lacking some attributes present in the NFV specification, fault details
+and correlated alarms. Instead the AODH alarm class has attributes for actions,
+rules and user and project id.
+
+
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| ETSI NFV Alarm Type | OPNFV Doctor | AODH Event Alarm | Description / Comment | Recommendations |
+| | Requirement Specs | Notification | | |
++========================+========================+=====================+=============================================+=======================================+
+| alarmId | FaultId | alarm_id | Identifier of an alarm. | \- |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | \- | alarm_name | Human readable alarm name. | May be added in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| managedObjectId | VirtualResourceId | (reason) | Identifier of the affected virtual resource | \- |
+| | | | is part of the AODH reason parameter. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | \- | user_id, project_id | User and project identifiers. | May be added in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| alarmRaisedTime | \- | \- | Timestamp when alarm was raised. | To be added to Doctor and AODH. May |
+| | | | | be derived (e.g. in a shimlayer) from |
+| | | | | the AODH alarm history. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| alarmChangedTime | \- | \- | Timestamp when alarm was changed/updated. | see above |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| alarmClearedTime | \- | \- | Timestamp when alarm was cleared. | see above |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| eventTime | \- | \- | Timestamp when alarm was first observed by | see above |
+| | | | the Monitor. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | EventTime | generated | Timestamp of the Notification. | Update parameter name in Doctor spec. |
+| | | | | May be added in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| state: | VirtualResourceState: | current: ok, alarm, | ETSI NFV IFA 005/006 lists example alarm | Maintenance state is missing in AODH. |
+| E.g. Fired, Updated | E.g. normal, down | insufficient_data | states. | List of alarm states will be |
+| Cleared | maintenance, error | | | specified in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| perceivedSeverity: | Severity (Integer) | Severity: | ETSI NFV IFA 005/006 lists example | List of alarm states will be |
+| E.g. Critical, Major, | | low (default), | perceived severity values. | specified in ETSI NFV Stage 3. |
+| Minor, Warning, | | moderate, critical | | |
+| Indeterminate, Cleared | | | | **OPNFV: Severity (Integer)**: |
+| | | | | * update OPNFV Doctor specification |
+| | | | | to *Enum* |
+| | | | | |
+| | | | | **perceivedSeverity=Indetermined**: |
+| | | | | * remove value *Indetermined* in |
+| | | | | IFA and map undefined values to |
+| | | | | “minor” severity, or |
+| | | | | * add value *indetermined* in AODH |
+| | | | | and make it the default value. |
+| | | | | |
+| | | | | **perceivedSeverity=Cleared**: |
+| | | | | * remove value *Cleared* in IFA as |
+| | | | | the information about a cleared |
+| | | | | alarm alarm can be derived from |
+| | | | | the alarm state parameter, or |
+| | | | | * add value *cleared* in AODH and |
+| | | | | set a rule that the severity is |
+| | | | | “cleared” when the state is *ok*. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| faultType | FaultType | event_type in | Type of the fault, e.g. “CPU failure” of a | OpenStack Alarming (Aodh) can use a |
+| | | reason_data | compute resource, in machine interpretable | fuzzy matching with wildcard string, |
+| | | | format. | "compute.cpu.failure". |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| N/A | N/A | type = "event" | Type of the notification. For fault | \- |
+| | | | notifications the type in AODH is “event”. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| probableCause | ProbableCause | \- | Probable cause of the alarm. | May be provided (e.g. in a shimlayer) |
+| | | | | based on Vitrage topology awareness / |
+| | | | | root-cause-analysis. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| isRootCause | IsRootCause | \- | Boolean indicating whether the fault is the | see above |
+| | | | root cause of other faults. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| correlatedAlarmId | CorrelatedFaultId | \- | List of IDs of correlated faults. | see above |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| faultDetails | FaultDetails | \- | Additional details about the fault/alarm. | FaultDetails information element will |
+| | | | | be specified in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | \- | action, previous | Additional AODH alarm related parameters. | \- |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+
+Table: Comparison of alarm attributes
+
+The primary area of improvement should be alignment of the perceived severity. This
+is important for a quick and accurate evaluation of the alarm. AODH thus should
+support also the X.733 values Critical, Major, Minor, Warning and Indeterminate.
+
+The detailed time stamps (raised, changed, cleared) which are essential for
+synchronizing the alarm list using a query operation should be added to the
+Doctor specification.
+
+Other areas that need alignment is the so called alarm state in NFV. Here we must
+however consider what can be attributes of the notification vs. what should be a
+property of the alarm instance. This will be analyzed later.
+
+
+Detailed southbound interface specification
+-------------------------------------------
+
+This section is specifying the southbound interfaces for fault management
+between the Monitors and the Inspector.
+Although southbound interfaces should be flexible to handle various events from
+different types of Monitors, we define unified event API in order to improve
+interoperability between the Monitors and the Inspector.
+This is not limiting implementation of Monitor and Inspector as these could be
+extended in order to support failures from intelligent inspection like prediction.
+
+Note: The interface definition will be aligned with current work in ETSI NFV IFA
+working group.
+
+Fault event interface
+^^^^^^^^^^^^^^^^^^^^^
+
+This interface allows the Monitors to notify the Inspector about an event which
+was captured by the Monitor and may effect resources managed in the VIM.
+
+EventNotification
+_________________
+
+
+Event notification including fault description.
+The entity of this notification is event, and not fault or error specifically.
+This allows us to use generic event format or framework build out of Doctor project.
+The parameters below shall be mandatory, but keys in 'Details' can be optional.
+
+Parameters:
+
+* Time [1]: Datetime when the fault was observed in the Monitor.
+* Type [1]: Type of event that will be used to process correlation in Inspector.
+* Details [0..1]: Details containing additional information with Key-value pair style.
+ Keys shall be defined depending on the Type of the event.
+
+E.g.:
+
+.. code-block:: bash
+
+ {
+ 'event': {
+ 'time': '2016-04-12T08:00:00',
+ 'type': 'compute.host.down',
+ 'details': {
+ 'hostname': 'compute-1',
+ 'source': 'sample_monitor',
+ 'cause': 'link-down',
+ 'severity': 'critical',
+ 'status': 'down',
+ 'monitor_id': 'monitor-1',
+ 'monitor_event_id': '123',
+ }
+ }
+ }
+
+Optional parameters in 'Details':
+
+* Hostname: the hostname on which the event occurred.
+* Source: the display name of reporter of this event. This is not limited to monitor, other entity can be specified such as 'KVM'.
+* Cause: description of the cause of this event which could be different from the type of this event.
+* Severity: the severity of this event set by the monitor.
+* Status: the status of target object in which error occurred.
+* MonitorID: the ID of the monitor sending this event.
+* MonitorEventID: the ID of the event in the monitor. This can be used by operator while tracking the monitor log.
+* RelatedTo: the array of IDs which related to this event.
+
+Also, we can have bulk API to receive multiple events in a single HTTP POST
+message by using the 'events' wrapper as follows:
+
+.. code-block:: bash
+
+ {
+ 'events': [
+ 'event': {
+ 'time': '2016-04-12T08:00:00',
+ 'type': 'compute.host.down',
+ 'details': {},
+ },
+ 'event': {
+ 'time': '2016-04-12T08:00:00',
+ 'type': 'compute.host.nic.error',
+ 'details': {},
+ }
+ ]
+ }
+
+
+
+
Blueprints
----------
and service states correctly.
.. [*] https://blueprints.launchpad.net/nova/+spec/pacemaker-servicegroup-driver
-
-..
- vim: set tabstop=4 expandtab textwidth=80: