resources. For each resource, information about the current state, the
firmware version, etc. is provided.
+
+Detailed southbound interface specification
+-------------------------------------------
+
+This section is specifying the southbound interfaces for fault management
+between the Monitors and the Inspector.
+Although southbound interfaces should be flexible to handle various events from
+different types of Monitors, we define unified event API in order to improve
+interoperability between the Monitors and the Inspector.
+This is not limiting implementation of Monitor and Inspector as these could be
+extended in order to support failures from intelligent inspection like prediction.
+
+Note: The interface definition will be aligned with current work in ETSI NFV IFA
+working group.
+
+Fault event interface
+^^^^^^^^^^^^^^^^^^^^^
+
+This interface allows the Monitors to notify the Inspector about an event which
+was captured by the Monitor and may effect resources managed in the VIM.
+
+EventNotification
+_________________
+
+
+Event notification including fault description.
+The entity of this notification is event, and not fault or error specifically.
+This allows us to use generic event format or framework build out of Doctor project.
+The parameters below shall be mandatory, but keys in 'Details' can be optional.
+
+Parameters:
+
+* Time [1]: Datetime when the fault was observed in the Monitor.
+* Type [1]: Type of event that will be used to process correlation in Inspector.
+* Details [0..1]: Details containing additional information with Key-value pair style.
+ Keys shall be defined depending on the Type of the event.
+
+E.g.:
+
+.. code-block:: bash
+
+ {
+ 'event': {
+ 'time': '2016-04-12T08:00:00',
+ 'type': 'compute.host.down',
+ 'details': {
+ 'hostname': 'compute-1',
+ 'source': 'sample_monitor',
+ 'cause': 'link-down',
+ 'severity': 'critical',
+ 'status': 'down',
+ 'monitor_id': 'monitor-1',
+ 'monitor_event_id': '123',
+ }
+ }
+ }
+
+Optional parameters in 'Details':
+
+* Hostname: the hostname on which the event occurred.
+* Source: the display name of reporter of this event. This is not limited to monitor, other entity can be specified such as 'KVM'.
+* Cause: description of the cause of this event which could be different from the type of this event.
+* Severity: the severity of this event set by the monitor.
+* Status: the status of target object in which error occurred.
+* MonitorID: the ID of the monitor sending this event.
+* MonitorEventID: the ID of the event in the monitor. This can be used by operator while tracking the monitor log.
+* RelatedTo: the array of IDs which related to this event.
+
+Also, we can have bulk API to receive multiple events in a single HTTP POST
+message by using the 'events' wrapper as follows:
+
+.. code-block:: bash
+
+ {
+ 'events': [
+ 'event': {
+ 'time': '2016-04-12T08:00:00',
+ 'type': 'compute.host.down',
+ 'details': {},
+ },
+ 'event': {
+ 'time': '2016-04-12T08:00:00',
+ 'type': 'compute.host.nic.error',
+ 'details': {},
+ }
+ ]
+ }
+
+
Blueprints
----------
and service states correctly.
.. [*] https://blueprints.launchpad.net/nova/+spec/pacemaker-servicegroup-driver
-
-..
- vim: set tabstop=4 expandtab textwidth=80:
level faults that are considered within the scope of the Doctor project
requiring immediate action by the Consumer.
-**Compute Hardware**
+**Compute/Storage**
-+-------------------+----------+------------+-----------------+----------------+
-| Fault | Severity | How to | Comment | Action to |
-| | | detect? | | recover |
-+===================+==========+============+=================+================+
-| Processor/CPU | Critical | Zabbix | | Switch to |
-| failure, CPU | | | | hot standby |
-| condition not ok | | | | |
-+-------------------+----------+------------+-----------------+----------------+
-| Memory failure/ | Critical | Zabbix | | Switch to |
-| Memory condition | | (IPMI) | | hot standby |
-| not ok | | | | |
-+-------------------+----------+------------+-----------------+----------------+
-| Network card | Critical | Zabbix/ | | Switch to |
-| failure, e.g. | | Ceilometer | | hot standby |
-| network adapter | | | | |
-| connectivity lost | | | | |
-+-------------------+----------+------------+-----------------+----------------+
-| Disk crash | Info | RAID | Network storage | Inform OAM |
-| | | monitoring | is very | |
-| | | | redundant (e.g. | |
-| | | | RAID system) | |
-| | | | and can | |
-| | | | guarantee high | |
-| | | | availability | |
-+-------------------+----------+------------+-----------------+----------------+
-| Storage | Critical | Zabbix | | Live migration |
-| controller | | (IPMI) | | if storage |
-| | | | | is still |
-| | | | | accessible; |
-| | | | | otherwise hot |
-| | | | | standby |
-+-------------------+----------+------------+-----------------+----------------+
-| PDU/power | Critical | Zabbix/ | | Switch to |
-| failure, power | | Ceilometer | | hot standby |
-| off, server reset | | | | |
-+-------------------+----------+------------+-----------------+----------------+
-| Power | Warning | SNMP | | Live migration |
-| degration, power | | | | |
-| redundancy lost, | | | | |
-| power threshold | | | | |
-| exceeded | | | | |
-+-------------------+----------+------------+-----------------+----------------+
-| Chassis problem | Warning | SNMP | | Live migration |
-| (e.g. fan | | | | |
-| degraded/failed, | | | | |
-| chassis power | | | | |
-| degraded), CPU | | | | |
-| fan problem, | | | | |
-| temperature/ | | | | |
-| thermal condition | | | | |
-| not ok | | | | |
-+-------------------+----------+------------+-----------------+----------------+
-| Mainboard failure | Critical | Zabbix | | Switch to |
-| | | (IPMI) | | hot standby |
-+-------------------+----------+------------+-----------------+----------------+
-| OS crash (e.g. | Critical | Zabbix | | Switch to |
-| kernel panic) | | | | hot standby |
-+-------------------+----------+------------+-----------------+----------------+
++-------------------+----------+------------+-----------------+------------------+
+| Fault | Severity | How to | Comment | Immediate action |
+| | | detect? | | to recover |
++===================+==========+============+=================+==================+
+| Processor/CPU | Critical | Zabbix | | Switch to hot |
+| failure, CPU | | | | standby |
+| condition not ok | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Memory failure/ | Critical | Zabbix | | Switch to |
+| Memory condition | | (IPMI) | | hot standby |
+| not ok | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Network card | Critical | Zabbix/ | | Switch to |
+| failure, e.g. | | Ceilometer | | hot standby |
+| network adapter | | | | |
+| connectivity lost | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Disk crash | Info | RAID | Network storage | Inform OAM |
+| | | monitoring | is very | |
+| | | | redundant (e.g. | |
+| | | | RAID system) | |
+| | | | and can | |
+| | | | guarantee high | |
+| | | | availability | |
++-------------------+----------+------------+-----------------+------------------+
+| Storage | Critical | Zabbix | | Live migration |
+| controller | | (IPMI) | | if storage |
+| | | | | is still |
+| | | | | accessible; |
+| | | | | otherwise hot |
+| | | | | standby |
++-------------------+----------+------------+-----------------+------------------+
+| PDU/power | Critical | Zabbix/ | | Switch to |
+| failure, power | | Ceilometer | | hot standby |
+| off, server reset | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Power | Warning | SNMP | | Live migration |
+| degration, power | | | | |
+| redundancy lost, | | | | |
+| power threshold | | | | |
+| exceeded | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Chassis problem | Warning | SNMP | | Live migration |
+| (e.g. fan | | | | |
+| degraded/failed, | | | | |
+| chassis power | | | | |
+| degraded), CPU | | | | |
+| fan problem, | | | | |
+| temperature/ | | | | |
+| thermal condition | | | | |
+| not ok | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Mainboard failure | Critical | Zabbix | e.g. PCIe, SAS | Switch to |
+| | | (IPMI) | link failure | hot standby |
++-------------------+----------+------------+-----------------+------------------+
+| OS crash (e.g. | Critical | Zabbix | | Switch to |
+| kernel panic) | | | | hot standby |
++-------------------+----------+------------+-----------------+------------------+
**Hypervisor**
-+----------------+----------+------------+---------+-------------------+
-| Fault | Severity | How to | Comment | Action to |
-| | | detect? | | recover |
-+================+==========+============+=========+===================+
-| System has | Critical | Zabbix | | Switch to |
-| restarted | | | | hot standby |
-+----------------+----------+------------+---------+-------------------+
-| Hypervisor | Warning/ | Zabbix/ | | Evacuation/switch |
-| failure | Critical | Ceilometer | | to hot standby |
-+----------------+----------+------------+---------+-------------------+
-| Zabbix/ | Warning | ? | | Live migration |
-| Ceilometer | | | | |
-| is unreachable | | | | |
-+----------------+----------+------------+---------+-------------------+
++----------------+----------+------------+-------------+-------------------+
+| Fault | Severity | How to | Comment | Immediate action |
+| | | detect? | | to recover |
++================+==========+============+=============+===================+
+| System has | Critical | Zabbix | | Switch to |
+| restarted | | | | hot standby |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor | Warning/ | Zabbix/ | | Evacuation/switch |
+| failure | Critical | Ceilometer | | to hot standby |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor | Warning | Alarming | Zabbix/ | Rebuild VM |
+| status not | | service | Ceilometer | |
+| retrievable | | | unreachable | |
+| after certain | | | | |
+| period | | | | |
++----------------+----------+------------+-------------+-------------------+
**Network**
-
+------------------+----------+---------+----------------+---------------------+
-| Fault | Severity | How to | Comment | Action to |
+| Fault | Severity | How to | Comment | Immediate action to |
| | | detect? | | recover |
+==================+==========+=========+================+=====================+
-| SDN/OpenFlow | Critical | ? | | Switch to |
-| switch, | | | | hot standby |
+| SDN/OpenFlow | Critical | Ceilo- | | Switch to |
+| switch, | | meter | | hot standby |
| controller | | | | or reconfigure |
| degraded/failed | | | | virtual network |
| | | | | topology |
+------------------+----------+---------+----------------+---------------------+
| Hardware failure | Warning | SNMP | Redundancy of | Live migration if |
-| of physical | | | physical | possible otherwise |
+| of physical | | | physical | possible otherwise |
| switch/router | | | infrastructure | evacuation |
| | | | is reduced or | |
| | | | no longer | |
CONSUMER_PORT=12346
TEST_USER=demo
TEST_PW=demo
-TEST_TENANT=demo
+TEST_PROJECT=demo
TEST_ROLE=_member_
SUPPORTED_INSTALLER_TYPES="apex local"
}
create_test_user() {
- keystone user-list | grep -q "$TEST_USER" || {
- keystone user-create --name "$TEST_USER" --pass "$TEST_PW"
+ openstack user list | grep -q "$TEST_USER" || {
+ openstack user create "$TEST_USER" --password "$TEST_PW"
}
- keystone tenant-list | grep -q "$TEST_TENANT" || {
- keystone tenant-create --name "$TEST_TENANT"
+ openstack project list | grep -q "$TEST_PROJECT" || {
+ openstack project create "$TEST_PROJECT"
}
- keystone user-role-list --user "$TEST_USER" --tenant "$TEST_TENANT" \
+ openstack user role list "$TEST_USER" --project "$TEST_PROJECT" \
| grep -q "$TEST_ROLE" || {
- keystone user-role-add --user "$TEST_USER" --role "$TEST_ROLE" \
- --tenant "$TEST_TENANT"
+ openstack role add "$TEST_ROLE" --user "$TEST_USER" \
+ --project "$TEST_PROJECT"
}
}
# test VM done with test user, so can test non-admin
export OS_USERNAME="$TEST_USER"
export OS_PASSWORD="$TEST_PW"
- export OS_TENANT_NAME="$TEST_TENANT"
+ export OS_TENANT_NAME="$TEST_PROJECT"
nova boot --flavor "$VM_FLAVOR" \
--image "$IMAGE_NAME" \
"$VM_NAME"
wait_for_vm_launch() {
echo "waiting for vm launch..."
- while true
+ count=0
+ while [[ ${count} -lt 60 ]]
do
state=$(nova list | grep " $VM_NAME " | awk '{print $6}')
[[ "$state" == "ACTIVE" ]] && return 0
+ [[ "$state" == "ERROR" ]] && echo "vm state is ERROR" && exit 1
+ count=$(($count+1))
sleep 1
done
+ echo "ERROR: time out while waiting for vm launch"
+ exit 1
}
inject_failure() {
# Switching to test user
export OS_USERNAME="$TEST_USER"
export OS_PASSWORD="$TEST_PW"
- export OS_TENANT_NAME="$TEST_TENANT"
+ export OS_TENANT_NAME="$TEST_PROJECT"
host_status_line=$(nova show $VM_NAME | grep "host_status")
[[ $? -ne 0 ]] && {
python ./nova_force_down.py "$COMPUTE_HOST" --unset
sleep 1
- nova delete "$VM_NAME"
+ nova list | grep -q " $VM_NAME " && nova delete "$VM_NAME"
sleep 1
alarm_id=$(ceilometer alarm-list | grep " $ALARM_NAME " | awk '{print $2}')
sleep 1
image_id=$(glance image-list | grep " $IMAGE_NAME " | awk '{print $2}')
sleep 1
[ -n "$image_id" ] && glance image-delete "$image_id"
- keystone user-role-remove --user "$TEST_USER" --role "$TEST_ROLE" \
- --tenant "$TEST_TENANT"
- keystone tenant-remove --name "$TEST_TENANT"
- keystone user-delete "$TEST_USER"
+ openstack role remove "$TEST_ROLE" --user "$TEST_USER" \
+ --project "$TEST_PROJECT"
+ openstack project delete "$TEST_PROJECT"
+ openstack user delete "$TEST_USER"
#TODO: add host status check via nova admin api
echo "waiting disabled compute host back to be enabled..."