Merge "Description of Doctor Scenario in Functest"

author Gerald Kunzmann <kunzmann@docomolab-euro.com>

Thu, 30 Jun 2016 15:24:01 +0000 (15:24 +0000)

committer Gerrit Code Review <gerrit@172.30.200.206>

Thu, 30 Jun 2016 15:24:01 +0000 (15:24 +0000)
author Gerald Kunzmann <kunzmann@docomolab-euro.com>
Thu, 30 Jun 2016 15:24:01 +0000 (15:24 +0000)
committer Gerrit Code Review <gerrit@172.30.200.206>
Thu, 30 Jun 2016 15:24:01 +0000 (15:24 +0000)
diff --git a/INFO b/INFO

index 67d08e1..4bce1d6 100644 (file)
--- a/INFO
+++ b/INFO
@@ -13,6 +13,7 @@ Repository: doctor
  Committers:
  Ashiq Khan (NTT DOCOMO, khan@nttdocomo.com)
  Carlos Goncalves (NEC, Carlos.Goncalves@neclab.eu)
+Dong Wenjuan (ZTE, dong.wenjuan@zte.com.cn)
  Gerald Kunzmann (NTT DOCOMO, kunzmann@docomolab-euro.com)
  Mario Cho (hephaex@gmail.com)
  Peter Lee (ClearPath Networks, plee@clearpathnet.com)
@@ -26,3 +27,4 @@ Link to TSC approval of the project: http://meetbot.opnfv.org/meetings/opnfv-mee
  Link(s) to approval of committer update:
  http://lists.opnfv.org/pipermail/opnfv-tsc/2015-June/000905.html
  http://lists.opnfv.org/pipermail/opnfv-tech-discuss/2015-June/003165.html
+http://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-June/011245.html
diff --git a/docs/requirements/02-use_cases.rst b/docs/requirements/02-use_cases.rst

index df041e9..424a3c6 100644 (file)
--- a/docs/requirements/02-use_cases.rst
+++ b/docs/requirements/02-use_cases.rst
@@ -59,7 +59,7 @@ represented as different virtual resources marked by different colors. Consumers
  resources (VMs in this example) shown with the same colors.
  
  The first requirement in this use case is that the Controller needs to detect
-faults in the NVFI ("1. Fault Notification" in :numref:`figure1`) affecting
+faults in the NFVI ("1. Fault Notification" in :numref:`figure1`) affecting
  the proper functioning of the virtual resources (labelled as VM-x) running on
  top of it. It should be possible to configure which relevant fault items should
  be detected. The VIM (e.g. OpenStack) itself could be extended to detect such
diff --git a/docs/requirements/05-implementation.rst b/docs/requirements/05-implementation.rst

index e0753bd..297f34a 100644 (file)
--- a/docs/requirements/05-implementation.rst
+++ b/docs/requirements/05-implementation.rst
@@ -644,6 +644,95 @@ Parameters:
    resources. For each resource, information about the current state, the
    firmware version, etc. is provided.
  
+
+Detailed southbound interface specification
+-------------------------------------------
+
+This section is specifying the southbound interfaces for fault management
+between the Monitors and the Inspector.
+Although southbound interfaces should be flexible to handle various events from
+different types of Monitors, we define unified event API in order to improve
+interoperability between the Monitors and the Inspector.
+This is not limiting implementation of Monitor and Inspector as these could be
+extended in order to support failures from intelligent inspection like prediction.
+
+Note: The interface definition will be aligned with current work in ETSI NFV IFA
+working group.
+
+Fault event interface
+^^^^^^^^^^^^^^^^^^^^^
+
+This interface allows the Monitors to notify the Inspector about an event which
+was captured by the Monitor and may effect resources managed in the VIM.
+
+EventNotification
+_________________
+
+
+Event notification including fault description.
+The entity of this notification is event, and not fault or error specifically.
+This allows us to use generic event format or framework build out of Doctor project.
+The parameters below shall be mandatory, but keys in 'Details' can be optional.
+
+Parameters:
+
+* Time [1]: Datetime when the fault was observed in the Monitor.
+* Type [1]: Type of event that will be used to process correlation in Inspector.
+* Details [0..1]: Details containing additional information with Key-value pair style.
+  Keys shall be defined depending on the Type of the event.
+
+E.g.:
+
+.. code-block:: bash
+
+    {
+        'event': {
+            'time': '2016-04-12T08:00:00',
+            'type': 'compute.host.down',
+            'details': {
+                'hostname': 'compute-1',
+                'source': 'sample_monitor',
+                'cause': 'link-down',
+                'severity': 'critical',
+                'status': 'down',
+                'monitor_id': 'monitor-1',
+                'monitor_event_id': '123',
+            }
+        }
+    }
+
+Optional parameters in 'Details':
+
+* Hostname: the hostname on which the event occurred.
+* Source: the display name of reporter of this event. This is not limited to monitor, other entity can be specified such as 'KVM'.
+* Cause: description of the cause of this event which could be different from the type of this event.
+* Severity: the severity of this event set by the monitor.
+* Status: the status of target object in which error occurred.
+* MonitorID: the ID of the monitor sending this event.
+* MonitorEventID: the ID of the event in the monitor. This can be used by operator while tracking the monitor log.
+* RelatedTo: the array of IDs which related to this event.
+
+Also, we can have bulk API to receive multiple events in a single HTTP POST
+message by using the 'events' wrapper as follows:
+
+.. code-block:: bash
+
+    {
+        'events': [
+            'event': {
+                'time': '2016-04-12T08:00:00',
+                'type': 'compute.host.down',
+                'details': {},
+            },
+            'event': {
+                'time': '2016-04-12T08:00:00',
+                'type': 'compute.host.nic.error',
+                'details': {},
+            }
+        ]
+    }
+
+
  Blueprints
  ----------
  
@@ -838,6 +927,3 @@ as Doctor will be doing. Also this BP might need enhancement to change server
  and service states correctly.
  
  .. [*] https://blueprints.launchpad.net/nova/+spec/pacemaker-servicegroup-driver
-
-..
- vim: set tabstop=4 expandtab textwidth=80:
diff --git a/docs/requirements/07-annex.rst b/docs/requirements/07-annex.rst

index bf65ff7..8cb1961 100644 (file)
--- a/docs/requirements/07-annex.rst
+++ b/docs/requirements/07-annex.rst
@@ -26,99 +26,100 @@ Administrator should be notified. The following tables provide a list of high
  level faults that are considered within the scope of the Doctor project
  requiring immediate action by the Consumer.
  
-**Compute Hardware**
+**Compute/Storage**
  
-+-------------------+----------+------------+-----------------+----------------+
-| Fault             | Severity | How to     | Comment         | Action to      |
-|                   |          | detect?    |                 | recover        |
-+===================+==========+============+=================+================+
-| Processor/CPU     | Critical | Zabbix     |                 | Switch to      |
-| failure, CPU      |          |            |                 | hot standby    |
-| condition not ok  |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Memory failure/   | Critical | Zabbix     |                 | Switch to      |
-| Memory condition  |          | (IPMI)     |                 | hot standby    |
-| not ok            |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Network card      | Critical | Zabbix/    |                 | Switch to      |
-| failure, e.g.     |          | Ceilometer |                 | hot standby    |
-| network adapter   |          |            |                 |                |
-| connectivity lost |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Disk crash        | Info     | RAID       | Network storage | Inform OAM     |
-|                   |          | monitoring | is very         |                |
-|                   |          |            | redundant (e.g. |                |
-|                   |          |            | RAID system)    |                |
-|                   |          |            | and can         |                |
-|                   |          |            | guarantee high  |                |
-|                   |          |            | availability    |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Storage           | Critical | Zabbix     |                 | Live migration |
-| controller        |          | (IPMI)     |                 | if storage     |
-|                   |          |            |                 | is still       |
-|                   |          |            |                 | accessible;    |
-|                   |          |            |                 | otherwise hot  |
-|                   |          |            |                 | standby        |
-+-------------------+----------+------------+-----------------+----------------+
-| PDU/power         | Critical | Zabbix/    |                 | Switch to      |
-| failure, power    |          | Ceilometer |                 | hot standby    |
-| off, server reset |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Power             | Warning  | SNMP       |                 | Live migration |
-| degration, power  |          |            |                 |                |
-| redundancy lost,  |          |            |                 |                |
-| power threshold   |          |            |                 |                |
-| exceeded          |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Chassis problem   | Warning  | SNMP       |                 | Live migration |
-| (e.g. fan         |          |            |                 |                |
-| degraded/failed,  |          |            |                 |                |
-| chassis power     |          |            |                 |                |
-| degraded), CPU    |          |            |                 |                |
-| fan problem,      |          |            |                 |                |
-| temperature/      |          |            |                 |                |
-| thermal condition |          |            |                 |                |
-| not ok            |          |            |                 |                |
-+-------------------+----------+------------+-----------------+----------------+
-| Mainboard failure | Critical | Zabbix     |                 | Switch to      |
-|                   |          | (IPMI)     |                 | hot standby    |
-+-------------------+----------+------------+-----------------+----------------+
-| OS crash (e.g.    | Critical | Zabbix     |                 | Switch to      |
-| kernel panic)     |          |            |                 | hot standby    |
-+-------------------+----------+------------+-----------------+----------------+
++-------------------+----------+------------+-----------------+------------------+
+| Fault             | Severity | How to     | Comment         | Immediate action |
+|                   |          | detect?    |                 | to recover       |
++===================+==========+============+=================+==================+
+| Processor/CPU     | Critical | Zabbix     |                 | Switch to hot    |
+| failure, CPU      |          |            |                 | standby          |
+| condition not ok  |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Memory failure/   | Critical | Zabbix     |                 | Switch to        |
+| Memory condition  |          | (IPMI)     |                 | hot standby      |
+| not ok            |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Network card      | Critical | Zabbix/    |                 | Switch to        |
+| failure, e.g.     |          | Ceilometer |                 | hot standby      |
+| network adapter   |          |            |                 |                  |
+| connectivity lost |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Disk crash        | Info     | RAID       | Network storage | Inform OAM       |
+|                   |          | monitoring | is very         |                  |
+|                   |          |            | redundant (e.g. |                  |
+|                   |          |            | RAID system)    |                  |
+|                   |          |            | and can         |                  |
+|                   |          |            | guarantee high  |                  |
+|                   |          |            | availability    |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Storage           | Critical | Zabbix     |                 | Live migration   |
+| controller        |          | (IPMI)     |                 | if storage       |
+|                   |          |            |                 | is still         |
+|                   |          |            |                 | accessible;      |
+|                   |          |            |                 | otherwise hot    |
+|                   |          |            |                 | standby          |
++-------------------+----------+------------+-----------------+------------------+
+| PDU/power         | Critical | Zabbix/    |                 | Switch to        |
+| failure, power    |          | Ceilometer |                 | hot standby      |
+| off, server reset |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Power             | Warning  | SNMP       |                 | Live migration   |
+| degration, power  |          |            |                 |                  |
+| redundancy lost,  |          |            |                 |                  |
+| power threshold   |          |            |                 |                  |
+| exceeded          |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Chassis problem   | Warning  | SNMP       |                 | Live migration   |
+| (e.g. fan         |          |            |                 |                  |
+| degraded/failed,  |          |            |                 |                  |
+| chassis power     |          |            |                 |                  |
+| degraded), CPU    |          |            |                 |                  |
+| fan problem,      |          |            |                 |                  |
+| temperature/      |          |            |                 |                  |
+| thermal condition |          |            |                 |                  |
+| not ok            |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Mainboard failure | Critical | Zabbix     | e.g. PCIe, SAS  | Switch to        |
+|                   |          | (IPMI)     | link failure    | hot standby      |
++-------------------+----------+------------+-----------------+------------------+
+| OS crash (e.g.    | Critical | Zabbix     |                 | Switch to        |
+| kernel panic)     |          |            |                 | hot standby      |
++-------------------+----------+------------+-----------------+------------------+
  
  **Hypervisor**
  
-+----------------+----------+------------+---------+-------------------+
-| Fault          | Severity | How to     | Comment | Action to         |
-|                |          | detect?    |         | recover           |
-+================+==========+============+=========+===================+
-| System has     | Critical | Zabbix     |         | Switch to         |
-| restarted      |          |            |         | hot standby       |
-+----------------+----------+------------+---------+-------------------+
-| Hypervisor     | Warning/ | Zabbix/    |         | Evacuation/switch |
-| failure        | Critical | Ceilometer |         | to hot standby    |
-+----------------+----------+------------+---------+-------------------+
-| Zabbix/        | Warning  | ?          |         | Live migration    |
-| Ceilometer     |          |            |         |                   |
-| is unreachable |          |            |         |                   |
-+----------------+----------+------------+---------+-------------------+
++----------------+----------+------------+-------------+-------------------+
+| Fault          | Severity | How to     | Comment     | Immediate action  |
+|                |          | detect?    |             | to recover        |
++================+==========+============+=============+===================+
+| System has     | Critical | Zabbix     |             | Switch to         |
+| restarted      |          |            |             | hot standby       |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor     | Warning/ | Zabbix/    |             | Evacuation/switch |
+| failure        | Critical | Ceilometer |             | to hot standby    |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor     | Warning  | Alarming   | Zabbix/     | Rebuild VM        |
+| status not     |          | service    | Ceilometer  |                   |
+| retrievable    |          |            | unreachable |                   |
+| after certain  |          |            |             |                   |
+| period         |          |            |             |                   |
++----------------+----------+------------+-------------+-------------------+
  
  **Network**
  
-
  +------------------+----------+---------+----------------+---------------------+
-| Fault            | Severity | How to  | Comment        | Action to           |
+| Fault            | Severity | How to  | Comment        | Immediate action to |
  |                  |          | detect? |                | recover             |
  +==================+==========+=========+================+=====================+
-| SDN/OpenFlow     | Critical | ?       |                | Switch to           |
-| switch,          |          |         |                | hot standby         |
+| SDN/OpenFlow     | Critical | Ceilo-  |                | Switch to           |
+| switch,          |          | meter   |                | hot standby         |
  | controller       |          |         |                | or reconfigure      |
  | degraded/failed  |          |         |                | virtual network     |
  |                  |          |         |                | topology            |
  +------------------+----------+---------+----------------+---------------------+
  | Hardware failure | Warning  | SNMP    | Redundancy of  | Live migration if   |
-| of physical      |          |         | physical       | possible  otherwise |
+| of physical      |          |         | physical       | possible otherwise  |
  | switch/router    |          |         | infrastructure | evacuation          |
  |                  |          |         | is reduced or  |                     |
  |                  |          |         | no longer      |                     |
diff --git a/tests/run.sh b/tests/run.sh

index 241d689..d7240d2 100755 (executable)
--- a/tests/run.sh
+++ b/tests/run.sh
@@ -21,7 +21,7 @@ INSPECTOR_PORT=12345
  CONSUMER_PORT=12346
  TEST_USER=demo
  TEST_PW=demo
-TEST_TENANT=demo
+TEST_PROJECT=demo
  TEST_ROLE=_member_
  
  SUPPORTED_INSTALLER_TYPES="apex local"
@@ -89,16 +89,16 @@ register_image() {
  }
  
  create_test_user() {
-    keystone user-list | grep -q "$TEST_USER" || {
-        keystone user-create --name "$TEST_USER" --pass "$TEST_PW"
+    openstack user list | grep -q "$TEST_USER" || {
+        openstack user create "$TEST_USER" --password "$TEST_PW"
      }
-    keystone tenant-list | grep -q "$TEST_TENANT" || {
-        keystone tenant-create --name "$TEST_TENANT"
+    openstack project list | grep -q "$TEST_PROJECT" || {
+        openstack project create "$TEST_PROJECT"
      }
-    keystone user-role-list --user "$TEST_USER" --tenant "$TEST_TENANT" \
+    openstack user role list "$TEST_USER" --project "$TEST_PROJECT" \
      | grep -q "$TEST_ROLE" || {
-        keystone user-role-add --user "$TEST_USER" --role "$TEST_ROLE" \
-                           --tenant "$TEST_TENANT"
+        openstack role add "$TEST_ROLE" --user "$TEST_USER" \
+                           --project "$TEST_PROJECT"
      }
  }
  
@@ -108,7 +108,7 @@ boot_vm() {
          # test VM done with test user, so can test non-admin
          export OS_USERNAME="$TEST_USER"
          export OS_PASSWORD="$TEST_PW"
-        export OS_TENANT_NAME="$TEST_TENANT"
+        export OS_TENANT_NAME="$TEST_PROJECT"
          nova boot --flavor "$VM_FLAVOR" \
                    --image "$IMAGE_NAME" \
                    "$VM_NAME"
@@ -166,12 +166,17 @@ stop_consumer() {
  
  wait_for_vm_launch() {
      echo "waiting for vm launch..."
-    while true
+    count=0
+    while [[ ${count} -lt 60 ]]
      do
          state=$(nova list | grep " $VM_NAME " | awk '{print $6}')
          [[ "$state" == "ACTIVE" ]] && return 0
+        [[ "$state" == "ERROR" ]] && echo "vm state is ERROR" && exit 1
+        count=$(($count+1))
          sleep 1
      done
+    echo "ERROR: time out while waiting for vm launch"
+    exit 1
  }
  
  inject_failure() {
@@ -202,7 +207,7 @@ check_host_status_down() {
          # Switching to test user
          export OS_USERNAME="$TEST_USER"
          export OS_PASSWORD="$TEST_PW"
-        export OS_TENANT_NAME="$TEST_TENANT"
+        export OS_TENANT_NAME="$TEST_PROJECT"
  
          host_status_line=$(nova show $VM_NAME | grep "host_status")
          [[ $? -ne 0 ]] && {
@@ -226,7 +231,7 @@ cleanup() {
  
      python ./nova_force_down.py "$COMPUTE_HOST" --unset
      sleep 1
-    nova delete "$VM_NAME"
+    nova list | grep -q " $VM_NAME " && nova delete "$VM_NAME"
      sleep 1
      alarm_id=$(ceilometer alarm-list | grep " $ALARM_NAME " | awk '{print $2}')
      sleep 1
@@ -235,10 +240,10 @@ cleanup() {
      image_id=$(glance image-list | grep " $IMAGE_NAME " | awk '{print $2}')
      sleep 1
      [ -n "$image_id" ] && glance image-delete "$image_id"
-    keystone user-role-remove --user "$TEST_USER" --role "$TEST_ROLE" \
-                              --tenant "$TEST_TENANT"
-    keystone tenant-remove --name "$TEST_TENANT"
-    keystone user-delete "$TEST_USER"
+    openstack role remove "$TEST_ROLE" --user "$TEST_USER" \
+                              --project "$TEST_PROJECT"
+    openstack project delete "$TEST_PROJECT"
+    openstack user delete "$TEST_USER"
  
      #TODO: add host status check via nova admin api
      echo "waiting disabled compute host back to be enabled..."
author	Gerald Kunzmann <kunzmann@docomolab-euro.com>
	Thu, 30 Jun 2016 15:24:01 +0000 (15:24 +0000)
committer	Gerrit Code Review <gerrit@172.30.200.206>
	Thu, 30 Jun 2016 15:24:01 +0000 (15:24 +0000)
INFO		patch \| blob \| history
docs/requirements/02-use_cases.rst		patch \| blob \| history
docs/requirements/05-implementation.rst		patch \| blob \| history
docs/requirements/07-annex.rst		patch \| blob \| history
tests/run.sh		patch \| blob \| history