docs/development/design/performance-profiler.rst

   1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
   2 .. http://creativecommons.org/licenses/by/4.0
   3
   4
   5 ====================
   6 Performance Profiler
   7 ====================
   8
   9 https://goo.gl/98Osig
  10
  11 This blueprint proposes to create a performance profiler for doctor scenarios.
  12
  13 Problem Description
  14 ===================
  15
  16 In the verification job for notification time, we have encountered some
  17 performance issues, such as
  18
  19 1. In environment deployed by APEX, it meets the criteria while in the one by
  20 Fuel, the performance is much more poor.
  21 2. Signification performance degradation was spotted when we increase the total
  22 number of VMs
  23
  24 It takes time to dig the log and analyse the reason. People have to collect
  25 timestamp at each checkpoints manually to find out the bottleneck. A performance
  26 profiler will make this process automatic.
  27
  28 Proposed Change
  29 ===============
  30
  31 Current Doctor scenario covers the inspector and notifier in the whole fault
  32 management cycle::
  33
  34   start                                          end
  35     +       +         +        +       +          +
  36     |       |         |        |       |          |
  37     |monitor|inspector|notifier|manager|controller|
  38     +------>+         |        |       |          |
  39   occurred  +-------->+        |       |          |
  40     |     detected    +------->+       |          |
  41     |       |     identified   +-------+          |
  42     |       |               notified   +--------->+
  43     |       |                  |    processed  resolved
  44     |       |                  |                  |
  45     |       +<-----doctor----->+                  |
  46     |                                             |
  47     |                                             |
  48     +<---------------fault management------------>+
  49
  50 The notification time can be split into several parts and visualized as a
  51 timeline::
  52
  53   start                                         end
  54     0----5---10---15---20---25---30---35---40---45--> (x 10ms)
  55     +    +   +   +   +    +      +   +   +   +   +
  56   0-hostdown |   |   |    |      |   |   |   |   |
  57     +--->+   |   |   |    |      |   |   |   |   |
  58     |  1-raw failure |    |      |   |   |   |   |
  59     |    +-->+   |   |    |      |   |   |   |   |
  60     |    | 2-found affected      |   |   |   |   |
  61     |    |   +-->+   |    |      |   |   |   |   |
  62     |    |     3-marked host down|   |   |   |   |
  63     |    |       +-->+    |      |   |   |   |   |
  64     |    |         4-set VM error|   |   |   |   |
  65     |    |           +--->+      |   |   |   |   |
  66     |    |           |  5-notified VM error  |   |
  67     |    |           |    +----->|   |   |   |   |
  68     |    |           |    |    6-transformed event
  69     |    |           |    |      +-->+   |   |   |
  70     |    |           |    |      | 7-evaluated event
  71     |    |           |    |      |   +-->+   |   |
  72     |    |           |    |      |     8-fired alarm
  73     |    |           |    |      |       +-->+   |
  74     |    |           |    |      |         9-received alarm
  75     |    |           |    |      |           +-->+
  76   sample | sample    |    |      |           |10-handled alarm
  77   monitor| inspector |nova| c/m  |    aodh   |
  78     |                                        |
  79     +<-----------------doctor--------------->+
  80
  81 Note: c/m = ceilometer
  82
  83 And a table of components sorted by time cost from most to least
  84
  85 +----------+---------+----------+
  86 |Component |Time Cost|Percentage|
  87 +==========+=========+==========+
  88 |inspector |160ms    | 40%      |
  89 +----------+---------+----------+
  90 |aodh      |110ms    | 30%      |
  91 +----------+---------+----------+
  92 |monitor   |50ms     | 14%      |
  93 +----------+---------+----------+
  94 |...       |         |          |
  95 +----------+---------+----------+
  96 |...       |         |          |
  97 +----------+---------+----------+
  98
  99 Note: data in the table is for demonstration only, not actual measurement
 100
 101 Timestamps can be collected from various sources
 102
 103 1. log files
 104 2. trace point in code
 105
 106 The performance profiler will be integrated into the verification job to provide
 107 detail result of the test. It can also be deployed independently to diagnose
 108 performance issue in specified environment.
 109
 110 Working Items
 111 =============
 112
 113 1. PoC with limited checkpoints
 114 2. Integration with verification job
 115 3. Collect timestamp at all checkpoints
 116 4. Display the profiling result in console
 117 5. Report the profiling result to test database
 118 6. Independent package which can be installed to specified environment