1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
2 .. http://creativecommons.org/licenses/by/4.0
5 This work is licensed under a Creative Commons Attribution 3.0 Unported
8 http://creativecommons.org/licenses/by/3.0/legalcode
10 ============================
11 Notification Alarm Evaluator
12 ============================
15 This is spec draft of blueprint for OpenStack Ceilomter Liberty.
16 To see current version: https://review.openstack.org/172893
17 To track development activity:
18 https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator
20 https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator
22 This blueprint proposes to add a new alarm evaluator for handling alarms on
23 events passed from other OpenStack services, that provides event-driven alarm
24 evaluation which makes new sequence in Ceilometer instead of the polling-based
25 approach of the existing Alarm Evaluator, and realizes immediate alarm
26 notification to end users.
31 As an end user, I need to receive alarm notification immediately once
32 Ceilometer captured an event which would make alarm fired, so that I can
33 perform recovery actions promptly to shorten downtime of my service.
34 The typical use case is that an end user set alarm on "compute.instance.update"
35 in order to trigger recovery actions once the instance status has changed to
36 'shutdown' or 'error'. It should be nice that an end user can receive
37 notification within 1 second after fault observed as the same as other helth-
38 check mechanisms can do in some cases.
40 The existing Alarm Evaluator is periodically querying/polling the databases
41 in order to check all alarms independently from other processes. This is good
42 approach for evaluating an alarm on samples stored in a certain period.
43 However, this is not efficient to evaluate an alarm on events which are emitted
44 by other OpenStack servers once in a while.
46 The periodical evaluation leads delay on sending alarm notification to users.
47 The default period of evaluation cycle is 60 seconds. It is recommended that
48 an operator set longer interval than configured pipeline interval for
49 underlying metrics, and also longer enough to evaluate all defined alarms
50 in certain period while taking into account the number of resources, users and
56 The proposal is to add a new event-driven alarm evaluator which receives
57 messages from Notification Agent and finds related Alarms, then evaluates each
60 * New alarm evaluator could receive event notification from Notification Agent
61 by which adding a dedicated notifier as a publisher in pipeline.yaml
62 (e.g. notifier://?topic=event_eval).
64 * When new alarm evaluator received event notification, it queries alarm
65 database by Project ID and Resource ID written in the event notification.
67 * Found alarms are evaluated by referring event notification.
69 * Depending on the result of evaluation, those alarms would be fired through
70 Alarm Notifier as the same as existing Alarm Evaluator does.
72 This proposal also adds new alarm type "notification" and "notification_rule".
73 This enables users to create alarms on events. The separation from other alarm
74 types (such as "threshold" type) is intended to show different timing of
75 evaluation and different format of condition, since the new evaluator will
76 check each event notification once it received whereas "threshold" alarm can
77 evaluate average of values in certain period calculated from multiple samples.
79 The new alarm evaluator handles Notification type alarms, so we have to change
80 existing alarm evaluator to exclude "notification" type alarms from evaluation
86 There was similar blueprint proposal "Alarm type based on notification", but
87 the approach is different. The old proposal was to adding new step (alarm
88 evaluations) in Notification Agent every time it received event from other
89 OpenStack services, whereas this proposal intends to execute alarm evaluation
90 in another component which can minimize impact to existing pipeline processing.
92 Another approach is enhancement of existing alarm evaluator by adding
93 notification listener. However, there are two issues; 1) this approach could
94 cause stall of periodical evaluations when it receives bulk of notifications,
95 and 2) this could break the alarm portioning i.e. when alarm evaluator received
96 notification, it might have to evaluate some alarms which are not assign to it.
101 Resource ID will be added to Alarm model as an optional attribute.
102 This would help the new alarm evaluator to filter out non-related alarms
103 while querying alarms, otherwise it have to evaluate all alarms in the project.
108 Alarm API will be extended as follows;
110 * Add "notification" type into alarm type list
111 * Add "resource_id" to "alarm"
112 * Add "notification_rule" to "alarm"
114 Sample data of Notification-type alarm::
118 "http://site:8000/alarm"
121 "description": "An alarm",
123 "insufficient_data_actions": [
124 "http://site:8000/nodata"
126 "name": "InstanceStatusAlarm",
127 "notification_rule": {
128 "event_type": "compute.instance.update",
131 "field" : "traits.state",
139 "project_id": "c96c887c216949acbdfbd8b494863567",
140 "repeat_actions": false,
141 "resource_id": "153462d0-a9b8-4b5b-8175-9e4b05e9b856",
142 "severity": "moderate",
144 "state_timestamp": "2015-04-03T17:49:38.406845",
145 "timestamp": "2015-04-03T17:49:38.406839",
146 "type": "notification",
147 "user_id": "c96c887c216949acbdfbd8b494863567"
150 "resource_id" will be refered to query alarm and will not be check permission
151 and belonging of project.
163 Other end user impact
164 ---------------------
168 Performance/Scalability Impacts
169 -------------------------------
171 When Ceilomter received a number of events from other OpenStack services in
172 short period, this alarm evaluator can keep working since events are queued in
173 a messaging queue system, but it can cause delay of alarm notification to users
174 and increase the number of read and write access to alarm database.
176 "resource_id" can be optional, but restricting it to mandatory could be reduce
177 performance impact. If user create "notification" alarm without "resource_id",
178 those alarms will be evaluated every time event occurred in the project.
179 That may lead new evaluator heavy.
181 Other deployer impact
182 ---------------------
184 New service process have to be run.
189 Developers should be aware that events could be notified to end users and avoid
190 passing raw infra information to end users, while defining events and traits.
210 * New event-driven alarm evaluator
212 * Add new alarm type "notification" as well as AlarmNotificationRule
214 * Add "resource_id" to Alarm model
216 * Modify existing alarm evaluator to filter out "notification" alarms
218 * Add new config parameter for alarm request check whether accepting alarms
219 without specifying "resource_id" or not
224 This proposal is key feature to provide information of cloud resources to end
225 users in real-time that enables efficient integration with user-side manager
226 or Orchestrator, whereas currently those information are considered to be
227 consumed by admin side tool or service.
228 Based on this change, we will seek orchestrating scenarios including fault
229 recovery and add useful event definition as well as additional traits.
239 New unit/scenario tests are required for this change.
244 * Proposed evaluator will be described in the developer document.
246 * New alarm type and how to use will be explained in user guide.
251 * OPNFV Doctor project: https://wiki.opnfv.org/doctor
253 * Blueprint "Alarm type based on notification":
254 https://blueprints.launchpad.net/ceilometer/+spec/alarm-on-notification