1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
2 .. http://creativecommons.org/licenses/by/4.0
4 ============================
5 Notification Alarm Evaluator
6 ============================
9 This is spec draft of blueprint for OpenStack Ceilomter Liberty.
10 To see current version: https://review.openstack.org/172893
11 To track development activity:
12 https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator
14 https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator
16 This blueprint proposes to add a new alarm evaluator for handling alarms on
17 events passed from other OpenStack services, that provides event-driven alarm
18 evaluation which makes new sequence in Ceilometer instead of the polling-based
19 approach of the existing Alarm Evaluator, and realizes immediate alarm
20 notification to end users.
25 As an end user, I need to receive alarm notification immediately once
26 Ceilometer captured an event which would make alarm fired, so that I can
27 perform recovery actions promptly to shorten downtime of my service.
28 The typical use case is that an end user set alarm on "compute.instance.update"
29 in order to trigger recovery actions once the instance status has changed to
30 'shutdown' or 'error'. It should be nice that an end user can receive
31 notification within 1 second after fault observed as the same as other helth-
32 check mechanisms can do in some cases.
34 The existing Alarm Evaluator is periodically querying/polling the databases
35 in order to check all alarms independently from other processes. This is good
36 approach for evaluating an alarm on samples stored in a certain period.
37 However, this is not efficient to evaluate an alarm on events which are emitted
38 by other OpenStack servers once in a while.
40 The periodical evaluation leads delay on sending alarm notification to users.
41 The default period of evaluation cycle is 60 seconds. It is recommended that
42 an operator set longer interval than configured pipeline interval for
43 underlying metrics, and also longer enough to evaluate all defined alarms
44 in certain period while taking into account the number of resources, users and
50 The proposal is to add a new event-driven alarm evaluator which receives
51 messages from Notification Agent and finds related Alarms, then evaluates each
54 * New alarm evaluator could receive event notification from Notification Agent
55 by which adding a dedicated notifier as a publisher in pipeline.yaml
56 (e.g. notifier://?topic=event_eval).
58 * When new alarm evaluator received event notification, it queries alarm
59 database by Project ID and Resource ID written in the event notification.
61 * Found alarms are evaluated by referring event notification.
63 * Depending on the result of evaluation, those alarms would be fired through
64 Alarm Notifier as the same as existing Alarm Evaluator does.
66 This proposal also adds new alarm type "notification" and "notification_rule".
67 This enables users to create alarms on events. The separation from other alarm
68 types (such as "threshold" type) is intended to show different timing of
69 evaluation and different format of condition, since the new evaluator will
70 check each event notification once it received whereas "threshold" alarm can
71 evaluate average of values in certain period calculated from multiple samples.
73 The new alarm evaluator handles Notification type alarms, so we have to change
74 existing alarm evaluator to exclude "notification" type alarms from evaluation
80 There was similar blueprint proposal "Alarm type based on notification", but
81 the approach is different. The old proposal was to adding new step (alarm
82 evaluations) in Notification Agent every time it received event from other
83 OpenStack services, whereas this proposal intends to execute alarm evaluation
84 in another component which can minimize impact to existing pipeline processing.
86 Another approach is enhancement of existing alarm evaluator by adding
87 notification listener. However, there are two issues; 1) this approach could
88 cause stall of periodical evaluations when it receives bulk of notifications,
89 and 2) this could break the alarm portioning i.e. when alarm evaluator received
90 notification, it might have to evaluate some alarms which are not assign to it.
95 Resource ID will be added to Alarm model as an optional attribute.
96 This would help the new alarm evaluator to filter out non-related alarms
97 while querying alarms, otherwise it have to evaluate all alarms in the project.
102 Alarm API will be extended as follows;
104 * Add "notification" type into alarm type list
105 * Add "resource_id" to "alarm"
106 * Add "notification_rule" to "alarm"
108 Sample data of Notification-type alarm::
112 "http://site:8000/alarm"
115 "description": "An alarm",
117 "insufficient_data_actions": [
118 "http://site:8000/nodata"
120 "name": "InstanceStatusAlarm",
121 "notification_rule": {
122 "event_type": "compute.instance.update",
125 "field" : "traits.state",
133 "project_id": "c96c887c216949acbdfbd8b494863567",
134 "repeat_actions": false,
135 "resource_id": "153462d0-a9b8-4b5b-8175-9e4b05e9b856",
136 "severity": "moderate",
138 "state_timestamp": "2015-04-03T17:49:38.406845",
139 "timestamp": "2015-04-03T17:49:38.406839",
140 "type": "notification",
141 "user_id": "c96c887c216949acbdfbd8b494863567"
144 "resource_id" will be refered to query alarm and will not be check permission
145 and belonging of project.
157 Other end user impact
158 ---------------------
162 Performance/Scalability Impacts
163 -------------------------------
165 When Ceilomter received a number of events from other OpenStack services in
166 short period, this alarm evaluator can keep working since events are queued in
167 a messaging queue system, but it can cause delay of alarm notification to users
168 and increase the number of read and write access to alarm database.
170 "resource_id" can be optional, but restricting it to mandatory could be reduce
171 performance impact. If user create "notification" alarm without "resource_id",
172 those alarms will be evaluated every time event occurred in the project.
173 That may lead new evaluator heavy.
175 Other deployer impact
176 ---------------------
178 New service process have to be run.
183 Developers should be aware that events could be notified to end users and avoid
184 passing raw infra information to end users, while defining events and traits.
204 * New event-driven alarm evaluator
206 * Add new alarm type "notification" as well as AlarmNotificationRule
208 * Add "resource_id" to Alarm model
210 * Modify existing alarm evaluator to filter out "notification" alarms
212 * Add new config parameter for alarm request check whether accepting alarms
213 without specifying "resource_id" or not
218 This proposal is key feature to provide information of cloud resources to end
219 users in real-time that enables efficient integration with user-side manager
220 or Orchestrator, whereas currently those information are considered to be
221 consumed by admin side tool or service.
222 Based on this change, we will seek orchestrating scenarios including fault
223 recovery and add useful event definition as well as additional traits.
233 New unit/scenario tests are required for this change.
238 * Proposed evaluator will be described in the developer document.
240 * New alarm type and how to use will be explained in user guide.
245 * OPNFV Doctor project: https://wiki.opnfv.org/doctor
247 * Blueprint "Alarm type based on notification":
248 https://blueprints.launchpad.net/ceilometer/+spec/alarm-on-notification