improve installation manual (add congress config)
[doctor.git] / docs / development / design / performance-profiler.rst
1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
2 .. http://creativecommons.org/licenses/by/4.0
3
4
5 ====================
6 Performance Profiler
7 ====================
8
9 https://goo.gl/98Osig
10
11 This blueprint proposes to create a performance profiler for doctor scenarios.
12
13 Problem Description
14 ===================
15
16 In the verification job for notification time, we have encountered some
17 performance issues, such as
18
19 1. In environment deployed by APEX, it meets the criteria while in the one by
20 Fuel, the performance is much more poor.
21 2. Signification performance degradation was spotted when we increase the total
22 number of VMs
23
24 It takes time to dig the log and analyse the reason. People have to collect
25 timestamp at each checkpoints manually to find out the bottleneck. A performance
26 profiler will make this process automatic.
27
28 Proposed Change
29 ===============
30
31 Current Doctor scenario covers the inspector and notifier in the whole fault
32 management cycle::
33
34   start                                          end
35     +       +         +        +       +          +
36     |       |         |        |       |          |
37     |monitor|inspector|notifier|manager|controller|
38     +------>+         |        |       |          |
39   occurred  +-------->+        |       |          |
40     |     detected    +------->+       |          |
41     |       |     identified   +-------+          |
42     |       |               notified   +--------->+
43     |       |                  |    processed  resolved
44     |       |                  |                  |
45     |       +<-----doctor----->+                  |
46     |                                             |
47     |                                             |
48     +<---------------fault management------------>+
49
50 The notification time can be split into several parts and visualized as a
51 timeline::
52
53   start                                         end
54     0----5---10---15---20---25---30---35---40---45--> (x 10ms)
55     +    +   +   +   +    +      +   +   +   +   +
56   0-hostdown |   |   |    |      |   |   |   |   |
57     +--->+   |   |   |    |      |   |   |   |   |
58     |  1-raw failure |    |      |   |   |   |   |
59     |    +-->+   |   |    |      |   |   |   |   |
60     |    | 2-found affected      |   |   |   |   |
61     |    |   +-->+   |    |      |   |   |   |   |
62     |    |     3-marked host down|   |   |   |   |
63     |    |       +-->+    |      |   |   |   |   |
64     |    |         4-set VM error|   |   |   |   |
65     |    |           +--->+      |   |   |   |   |
66     |    |           |  5-notified VM error  |   |
67     |    |           |    +----->|   |   |   |   |
68     |    |           |    |    6-transformed event
69     |    |           |    |      +-->+   |   |   |
70     |    |           |    |      | 7-evaluated event
71     |    |           |    |      |   +-->+   |   |
72     |    |           |    |      |     8-fired alarm
73     |    |           |    |      |       +-->+   |
74     |    |           |    |      |         9-received alarm
75     |    |           |    |      |           +-->+
76   sample | sample    |    |      |           |10-handled alarm
77   monitor| inspector |nova| c/m  |    aodh   |
78     |                                        |
79     +<-----------------doctor--------------->+
80
81 Note: c/m = ceilometer
82
83 And a table of components sorted by time cost from most to least
84
85 +----------+---------+----------+
86 |Component |Time Cost|Percentage|
87 +==========+=========+==========+
88 |inspector |160ms    | 40%      |
89 +----------+---------+----------+
90 |aodh      |110ms    | 30%      |
91 +----------+---------+----------+
92 |monitor   |50ms     | 14%      |
93 +----------+---------+----------+
94 |...       |         |          |
95 +----------+---------+----------+
96 |...       |         |          |
97 +----------+---------+----------+
98
99 Note: data in the table is for demonstration only, not actual measurement
100
101 Timestamps can be collected from various sources
102
103 1. log files
104 2. trace point in code
105
106 The performance profiler will be integrated into the verification job to provide
107 detail result of the test. It can also be deployed independently to diagnose
108 performance issue in specified environment.
109
110 Working Items
111 =============
112
113 1. PoC with limited checkpoints
114 2. Integration with verification job
115 3. Collect timestamp at all checkpoints
116 4. Display the profiling result in console
117 5. Report the profiling result to test database
118 6. Independent package which can be installed to specified environment