2 This work is licensed under a Creative Commons Attribution 3.0 Unported
5 http://creativecommons.org/licenses/by/3.0/legalcode
7 ============================
8 Notification Alarm Evaluator
9 ============================
12 This is spec draft of brlueprint for OpenStack Ceilomter Liberty.
13 To see current version: https://review.openstack.org/172893
14 To track development activity:
15 https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator
17 https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator
19 This blueprint proposes to add a new alarm evaluator for handling alarms on
20 events passed from other OpenStack services, that provides event-driven alarm
21 evaluation which makes new sequence in Ceilometer instead of the polling-based
22 approach of the existing Alarm Evaluator, and realizes immediate alarm
23 notification to end users.
28 As an end user, I need to receive alarm notification immediately once
29 Ceilometer captured an event which would make alarm fired, so that I can
30 perform recovery actions promptly to shorten downtime of my service.
31 The typical use case is that an end user set alarm on "compute.instance.update"
32 in order to trigger recovery actions once the instance status has changed to
33 'shutdown' or 'error'. It should be nice that an end user can receive
34 notification within 1 second after fault observed as the same as other helth-
35 check mechanisms can do in some cases.
37 The existing Alarm Evaluator is periodically querying/polling the databases
38 in order to check all alarms independently from other processes. This is good
39 approach for evaluating an alarm on samples stored in a certain period.
40 However, this is not efficient to evaluate an alarm on events which are emitted
41 by other OpenStack servers once in a while.
43 The periodical evaluation leads delay on sending alarm notification to users.
44 The default period of evaluation cycle is 60 seconds. It is recommended that
45 an operator set longer interval than configured pipeline interval for
46 underlying metrics, and also longer enough to evaluate all defined alarms
47 in certain period while taking into account the number of resources, users and
53 The proposal is to add a new event-driven alarm evaluator which receives
54 messages from Notification Agent and finds related Alarms, then evaluates each
57 * New alarm evaluator could receive event notification from Notification Agent
58 by which adding a dedicated notifier as a publisher in pipeline.yaml
59 (e.g. notifier://?topic=event_eval).
61 * When new alarm evaluator received event notification, it queries alarm
62 database by Project ID and Resource ID written in the event notification.
64 * Found alarms are evaluated by referring event notification.
66 * Depending on the result of evaluation, those alarms would be fired through
67 Alarm Notifier as the same as existing Alarm Evaluator does.
69 This proposal also adds new alarm type "notification" and "notification_rule".
70 This enables users to create alarms on events. The separation from other alarm
71 types (such as "threshold" type) is intended to show different timing of
72 evaluation and different format of condition, since the new evaluator will
73 check each event notification once it received whereas "threshold" alarm can
74 evaluate average of values in certain period calculated from multiple samples.
76 The new alarm evaluator handles Notification type alarms, so we have to change
77 existing alarm evaluator to exclude "notification" type alarms from evaluation
83 There was similar blueprint proposal "Alarm type based on notification", but
84 the approach is different. The old proposal was to adding new step (alarm
85 evaluations) in Notification Agent every time it received event from other
86 OpenStack services, whereas this proposal intends to execute alarm evaluation
87 in another component which can minimize impact to existing pipeline processing.
89 Another approach is enhancement of existing alarm evaluator by adding
90 notification listener. However, there are two issues; 1) this approach could
91 cause stall of periodical evaluations when it receives bulk of notifications,
92 and 2) this could break the alarm portioning i.e. when alarm evaluator received
93 notification, it might have to evaluate some alarms which are not assign to it.
98 Resource ID will be added to Alarm model as an optional attribute.
99 This would help the new alarm evaluator to filter out non-related alarms
100 while querying alarms, otherwise it have to evaluate all alarms in the project.
105 Alarm API will be extended as follows;
107 * Add "notification" type into alarm type list
108 * Add "resource_id" to "alarm"
109 * Add "notification_rule" to "alarm"
111 Sample data of Notification-type alarm::
115 "http://site:8000/alarm"
118 "description": "An alarm",
120 "insufficient_data_actions": [
121 "http://site:8000/nodata"
123 "name": "InstanceStatusAlarm",
124 "notification_rule": {
125 "event_type": "compute.instance.update",
128 "field" : "traits.state",
136 "project_id": "c96c887c216949acbdfbd8b494863567",
137 "repeat_actions": false,
138 "resource_id": "153462d0-a9b8-4b5b-8175-9e4b05e9b856",
139 "severity": "moderate",
141 "state_timestamp": "2015-04-03T17:49:38.406845",
142 "timestamp": "2015-04-03T17:49:38.406839",
143 "type": "notification",
144 "user_id": "c96c887c216949acbdfbd8b494863567"
147 "resource_id" will be refered to query alarm and will not be check permission
148 and belonging of project.
160 Other end user impact
161 ---------------------
165 Performance/Scalability Impacts
166 -------------------------------
168 When Ceilomter received a number of events from other OpenStack services in
169 short period, this alarm evaluator can keep working since events are queued in
170 a messaging queue system, but it can cause delay of alarm notification to users
171 and increase the number of read and write access to alarm database.
173 "resource_id" can be optional, but restricting it to mandatory could be reduce
174 performance impact. If user create "notification" alarm without "resource_id",
175 those alarms will be evaluated every time event occurred in the project.
176 That may lead new evaluator heavy.
178 Other deployer impact
179 ---------------------
181 New service process have to be run.
186 Developers should be aware that events could be notified to end users and avoid
187 passing raw infra information to end users, while defining events and traits.
207 * New event-driven alarm evaluator
209 * Add new alarm type "notification" as well as AlarmNotificationRule
211 * Add "resource_id" to Alarm model
213 * Modify existing alarm evaluator to filter out "notification" alarms
215 * Add new config parameter for alarm request check whether accepting alarms
216 without specifying "resource_id" or not
221 This proposal is key feature to provide information of cloud resources to end
222 users in real-time that enables efficient integration with user-side manager
223 or Orchestrator, whereas currently those information are considered to be
224 consumed by admin side tool or service.
225 Based on this change, we will seek orchestrating scenarios including fault
226 recovery and add useful event definition as well as additional traits.
236 New unit/scenario tests are required for this change.
241 * Proposed evaluator will be described in the developer document.
243 * New alarm type and how to use will be explained in user guide.
248 * OPNFV Doctor project: https://wiki.opnfv.org/doctor
250 * Blueprint "Alarm type based on notification":
251 https://blueprints.launchpad.net/ceilometer/+spec/alarm-on-notification