blueprint performance-profiler 91/25591/1
authorYujun Zhang <zhang.yujunz@zte.com.cn>
Wed, 7 Dec 2016 06:21:33 +0000 (14:21 +0800)
committerYujun Zhang <zhang.yujunz@zte.com.cn>
Wed, 7 Dec 2016 06:22:57 +0000 (14:22 +0800)
JIRA: DOCTOR-72

Change-Id: Iced857d8194f15c660ae506119797f85e067a6f4
Signed-off-by: Yujun Zhang <zhang.yujunz@zte.com.cn>
docs/design/performance-profiler.rst [new file with mode: 0644]

diff --git a/docs/design/performance-profiler.rst b/docs/design/performance-profiler.rst
new file mode 100644 (file)
index 0000000..f834a91
--- /dev/null
@@ -0,0 +1,118 @@
+.. This work is licensed under a Creative Commons Attribution 4.0 International License.
+.. http://creativecommons.org/licenses/by/4.0
+
+
+====================
+Performance Profiler
+====================
+
+https://goo.gl/98Osig
+
+This blueprint proposes to create a performance profiler for doctor scenarios.
+
+Problem Description
+===================
+
+In the verification job for notification time, we have encountered some
+performance issues, such as
+
+1. In environment deployed by APEX, it meets the criteria while in the one by
+Fuel, the performance is much more poor.
+2. Signification performance degradation was spotted when we increase the total
+number of VMs
+
+It takes time to dig the log and analyse the reason. People have to collect
+timestamp at each checkpoints manually to find out the bottleneck. A performance
+profiler will make this process automatic.
+
+Proposed Change
+===============
+
+Current Doctor scenario covers the inspector and notifier in the whole fault
+management cycle::
+
+  start                                          end
+    +       +         +        +       +          +
+    |       |         |        |       |          |
+    |monitor|inspector|notifier|manager|controller|
+    +------>+         |        |       |          |
+  occurred  +-------->+        |       |          |
+    |     detected    +------->+       |          |
+    |       |     identified   +-------+          |
+    |       |               notified   +--------->+
+    |       |                  |    processed  resolved
+    |       |                  |                  |
+    |       +<-----doctor----->+                  |
+    |                                             |
+    |                                             |
+    +<---------------fault management------------>+
+
+The notification time can be split into several parts and visualized as a
+timeline::
+
+  start                                         end
+    0----5---10---15---20---25---30---35---40---45--> (x 10ms)
+    +    +   +   +   +    +      +   +   +   +   +
+  0-hostdown |   |   |    |      |   |   |   |   |
+    +--->+   |   |   |    |      |   |   |   |   |
+    |  1-raw failure |    |      |   |   |   |   |
+    |    +-->+   |   |    |      |   |   |   |   |
+    |    | 2-found affected      |   |   |   |   |
+    |    |   +-->+   |    |      |   |   |   |   |
+    |    |     3-marked host down|   |   |   |   |
+    |    |       +-->+    |      |   |   |   |   |
+    |    |         4-set VM error|   |   |   |   |
+    |    |           +--->+      |   |   |   |   |
+    |    |           |  5-notified VM error  |   |
+    |    |           |    +----->|   |   |   |   |
+    |    |           |    |    6-transformed event
+    |    |           |    |      +-->+   |   |   |
+    |    |           |    |      | 7-evaluated event
+    |    |           |    |      |   +-->+   |   |
+    |    |           |    |      |     8-fired alarm
+    |    |           |    |      |       +-->+   |
+    |    |           |    |      |         9-received alarm
+    |    |           |    |      |           +-->+
+  sample | sample    |    |      |           |10-handled alarm
+  monitor| inspector |nova| c/m  |    aodh   |
+    |                                        |
+    +<-----------------doctor--------------->+
+
+Note: c/m = ceilometer
+
+And a table of components sorted by time cost from most to least
+
++----------+---------+----------+
+|Component |Time Cost|Percentage|
++==========+=========+==========+
+|inspector |160ms    | 40%      |
++----------+---------+----------+
+|aodh      |110ms    | 30%      |
++----------+---------+----------+
+|monitor   |50ms     | 14%      |
++----------+---------+----------+
+|...       |         |          |
++----------+---------+----------+
+|...       |         |          |
++----------+---------+----------+
+
+Note: data in the table is for demonstration only, not actual measurement
+
+Timestamps can be collected from various sources
+
+1. log files
+2. trace point in code
+
+The performance profiler will be integrated into the verification job to provide
+detail result of the test. It can also be deployed independently to diagnose
+performance issue in specified environment.
+
+Working Items
+=============
+
+1. PoC with limited checkpoints
+2. Integration with verification job
+3. Collect timestamp at all checkpoints
+4. Display the profiling result in console
+5. Report the profiling result to test database
+6. Independent package which can be installed to specified environment