requirements/05-implementation.rst

   1 Detailed implementation plan
   2 ============================
   3
   4 This section describes a detailed implementation plan, which is based on the
   5 high level architecture introduced in Section 3. Section 5.1 describes the
   6 functional blocks of the Doctor architecture, which is followed by a high level
   7 message flow in Section 5.2. Section 5.3 provides a mapping of selected existing
   8 open source components to the building blocks of the Doctor architecture.
   9 Thereby, the selection of components is based on their maturity and the gap
  10 analysis executed in Section 4. Sections 5.4 and 5.5 detail the specification of
  11 the related northbound interface and the related information elements. Finally,
  12 Section 5.6 provides a first set of blueprints to address selected gaps required
  13 for the realization functionalities of the Doctor project.
  14
  15 Functional Blocks
  16 -----------------
  17
  18 This section introduces the functional blocks to form the VIM. OpenStack was
  19 selected as the candidate for implementation. Inside the VIM, 4 different
  20 building blocks are defined (see :num:`Figure #figure6`).
  21
  22 .. _figure6:
  23
  24 .. figure:: images/figure6.png
  25    :width: 100%
  26
  27    Functional blocks
  28
  29 Monitor
  30 ^^^^^^^
  31
  32 The Monitor module has the responsibility for monitoring the virtualized
  33 infrastructure. There are already many existing tools and services (e.g. Zabbix)
  34 to monitor different aspects of hardware and software resources which can be
  35 used for this purpose.
  36
  37 Inspector
  38 ^^^^^^^^^
  39
  40 The Inspector module has the ability a) to receive various failure notifications
  41 regarding physical resource(s) from Monitor module(s), b) to find the affected
  42 virtual resource(s) by querying the resource map in the Controller, and c) to
  43 update the state of the virtual resource (and physical resource).
  44
  45 The Inspector has drivers for different types of events and resources to
  46 integrate any type of Monitor and Controller modules. It also uses a failure
  47 policy database to decide on the failure selection and aggregation from raw
  48 events. This failure policy database is configured by the Administrator.
  49
  50 The reason for separation of the Inspector and Controller modules is to make the
  51 Controller focus on simple operations by avoiding a tight integration of various
  52 health check mechanisms into the Controller.
  53
  54 Controller
  55 ^^^^^^^^^^
  56
  57 The Controller is responsible for maintaining the resource map (i.e. the mapping
  58 from physical resources to virtual resources), accepting update requests for the
  59 resource state(s) (exposing as provider API), and sending all failure events
  60 regarding virtual resources to the Notifier. Optionally, the Controller has the
  61 ability to poison the state of virtual resources mapping to physical resources
  62 for which it has received failure notifications from the Inspector. The
  63 Controller also re-calculates the capacity of the NVFI when receiving a failure
  64 notification for a physical resource.
  65
  66 In a real-world deployment, the VIM may have several controllers, one for each
  67 resource type, such as Nova, Neutron and Cinder in OpenStack. Each controller
  68 maintains a database of virtual and physical resources which shall be the master
  69 source for resource information inside the VIM.
  70
  71 Notifier
  72 ^^^^^^^^
  73
  74 The focus of the Notifier is on selecting and aggregating failure events
  75 received from the controller based on policies mandated by the Consumer.
  76 Therefore, it allows the Consumer to subscribe for alarms regarding virtual
  77 resources using a method such as API endpoint. After receiving a fault
  78 event from a Controller, it will notify the fault to the Consumer by referring
  79 to the alarm configuration which was defined by the Consumer earlier on.
  80
  81 To reduce complexity of the Controller, it is a good approach for the
  82 Controllers to emit all notifications without any filtering mechanism and have
  83 another service (i.e. Notifier) handle those notifications properly. This is the
  84 general philosophy of notifications in OpenStack. Note that a fault message
  85 consumed by the Notifier is different from the fault message received by the
  86 Inspector; the former message is related to virtual resources which are visible
  87 to users with relevant ownership, whereas the latter is related to raw devices
  88 or small entities which should be handled with an administrator privilege.
  89
  90 The northbound interface between the Notifier and the Consumer/Administrator is
  91 specified in Section 5.5.
  92
  93 Sequence
  94 --------
  95
  96 Fault Management
  97 ^^^^^^^^^^^^^^^^
  98
  99 The detailed work flow for fault management is as follows (see also :num:`Figure
 100 #figure7`):
 101
 102 1. Request to subscribe to monitor specific virtual resources. A query filter
 103    can be used to narrow down the alarms the Consumer wants to be informed
 104    about.
 105 2. Each subscription request is acknowledged with a subscribe response message.
 106    The response message contains information about the subscribed virtual
 107    resources, in particular if a subscribed virtual resource is in "alarm"
 108    state.
 109 3. The NFVI sends monitoring events for resources the VIM has been subscribed
 110    to. Note: this subscription message exchange between the VIM and NFVI is not
 111    shown in this message flow.
 112 4. Event correlation, fault detection and aggregation in VIM.
 113 5. Database lookup to find the virtual resources affected by the detected fault.
 114 6. Fault notification to Consumer.
 115 7. The Consumer switches to standby configuration (STBY)
 116 8. Instructions to VIM requesting certain actions to be performed on the
 117    affected resources, for example migrate/update/terminate specific
 118    resource(s). After reception of such instructions, the VIM is executing the
 119    requested action, e.g. it will migrate or terminate a virtual resource.
 120
 121    a. Query request from Consumer to VIM to get information about the current
 122    status of a resource.
 123    b. Response to the query request with information about the current status of
 124    the queried resource. In case the resource is in "fault" state, information
 125    about the related fault(s) is returned.
 126
 127 In order to allow for quick reaction to failures, the time interval between
 128 fault detection in step 3 and the corresponding recovery actions in step 7 and 8
 129 shall be less than 1 second.
 130
 131 .. _figure7:
 132
 133 .. figure:: images/figure7.png
 134    :width: 100%
 135
 136    Fault management work flow
 137
 138
 139 .. _figure8:
 140
 141 .. figure:: images/figure8.png
 142    :width: 100%
 143
 144    Fault management scenario
 145
 146 :num:`Figure #figure8` shows a more detailed message flow (Steps 4 to 6) between
 147 the 4 building blocks introduced in Section 5.1.
 148
 149 4. The Monitor observed a fault in the NFVI and reports the raw fault to the
 150    Inspector.
 151    The Inspector filters and aggregates the faults using pre-configured
 152    failure policies.
 153
 154 5.
 155    a) The Inspector queries the Resource Map to find the virtual resources
 156    affected by the raw fault in the NFVI.
 157    b) The Inspector updates the state of the affected virtual resources in the
 158    Resource Map.
 159    c) The Controller observes a change of the virtual resource state and informs
 160    the Notifier about the state change and the related alarm(s).
 161    Alternatively, the Inspector may directly inform the Notifier about it.
 162
 163 6. The Notifier is performing another filtering and aggregation of the changes
 164    and alarms based on the pre-configured alarm configuration. Finally, a fault
 165    notification is sent to northbound to the Consumer.
 166
 167 NFVI Maintenance
 168 ^^^^^^^^^^^^^^^^
 169
 170 The detailed work flow for NFVI maintenance is shown in :num:`Figure #figure9`
 171 and has the following steps. Note that steps 1, 2, and 5 to 8a in the NFVI
 172 maintenance work flow are very similar to the steps in the fault management work
 173 flow and share a similar implementation plan in Release 1.
 174
 175 1. Subscribe to fault/maintenance notifications.
 176 2. Response to subscribe request.
 177 3. Maintenance trigger received from administrator.
 178 4. VIM switches NFVI resources to "maintenance" state. This, e.g., means they
 179    should not be used for further allocation/migration requests
 180 5. Database lookup to find the virtual resources affected by the detected
 181    maintenance operation.
 182 6. Maintenance notification to Consumer.
 183 7. The Consumer switches to standby configuration (STBY)
 184 8. Instructions from Consumer to VIM requesting certain recovery actions to be
 185    performed (step 7a). After reception of such instructions, the VIM is
 186    executing the requested action in order to empty the physical resources (step
 187    7b).
 188 9. Maintenance response from VIM to inform the Administrator that the physical
 189    machines have been emptied (or the operation resulted in an error state).
 190 10. Administrator is coordinating and executing the maintenance operation/work
 191     on the NFVI.
 192
 193     A) Query request from Administrator to VIM to get information about the
 194     current state of a resource.
 195     B) Response to the query request with information about the current state of
 196     the queried resource(s). In case the resource is in "maintenance" state,
 197     information about the related maintenance operation is returned.
 198
 199 .. _figure9:
 200
 201 .. figure:: images/figure9.png
 202    :width: 100%
 203
 204    NFVI maintenance work flow
 205
 206
 207 .. _figure10:
 208
 209 .. figure:: images/figure10.png
 210    :width: 100%
 211
 212    NFVI Maintenance implementation plan
 213
 214 :num:`Figure #figure10` shows a more detailed message flow (Steps 4 to 6)
 215 between the 4 building blocks introduced in Section 5.1..
 216
 217 3. The Administrator is sending a StateChange request to the Controller residing
 218    in the VIM.
 219 4. The Controller queries the Resource Map to find the virtual resources
 220    affected by the planned maintenance operation.
 221 5.
 222
 223   a) The Controller updates the state of the affected virtual resources in the
 224   Resource Map database.
 225
 226   b) The Controller informs the Notifier about the virtual resources that will
 227   be affected by the maintenance operation.
 228
 229 6. A maintenance notification is sent to northbound to the Consumer.
 230
 231 ...
 232
 233 9. The Controller informs the Administrator after the physical resources have
 234    been freed.
 235
 236
 237
 238 Implementation plan for OPNFV Release 1
 239 ---------------------------------------
 240
 241 Fault management
 242 ^^^^^^^^^^^^^^^^
 243
 244 :num:`Figure #figure11` shows the implementation plan based on OpenStack and
 245 related components as planned for Release 1. Hereby, the Monitor can be realized
 246 by Zabbix. The Controller is realized by OpenStack Nova [NOVA]_, Neutron
 247 [NEUT]_, and Cinder [CIND]_ for compute, network, and storage,
 248 respectively. The Inspector can be realized by Monasca [MONA]_ or a simple
 249 script querying Nova in order to map between physical and virtual resources. The
 250 Notifier will be realized by Ceilometer [CEIL]_ receiving failure events
 251 on its notification bus.
 252
 253 :num:`Figure #figure12` shows the inner-workings of Ceilometer. After receiving
 254 an "event" on its notification bus, first a notification agent will grab the
 255 event and send a "notification" to the Collector. The collector writes the
 256 notifications received to the Ceilometer databases.
 257
 258 In the existing Ceilometer implementation, an alarm evaluator is periodically
 259 polling those databases through the APIs provided. If it finds new alarms, it
 260 will evaluate them based on the pre-defined alarm configuration, and depending
 261 on the configuration, it will hand a message to the Alarm Notifier, which in
 262 turn will send the alarm message northbound to the Consumer. :num:`Figure
 263 #figure12` also shows an optimized work flow for Ceilometer with the goal to
 264 reduce the delay for fault notifications to the Consumer. The approach is to
 265 implement a new notification agent (called "publisher" in Ceilometer
 266 terminology) which is directly sending the alarm through the "Notification Bus"
 267 to a new "Notification-driven Alarm Evaluator (NAE)" (see Sections 5.6.2 and
 268 5.6.3), thereby bypassing the Collector and avoiding the additional delay of the
 269 existing polling-based alarm evaluator. The NAE is similar to the OpenStack
 270 "Alarm Evaluator", but is triggered by incoming notifications instead of
 271 periodically polling the OpenStack "Alarms" database for new alarms. The
 272 Ceilometer "Alarms" database can hold three states: "normal", "insufficient
 273 data", and "fired". It is representing a persistent alarm database. In order to
 274 realize the Doctor requirements, we need to define new "meters" in the database
 275 (see Section 5.6.1).
 276
 277 .. _figure11:
 278
 279 .. figure:: images/figure11.png
 280    :width: 100%
 281
 282    Implementation plan in OpenStack (OPNFV Release 1 ”Arno”)
 283
 284
 285 .. _figure12:
 286
 287 .. figure:: images/figure12.png
 288    :width: 100%
 289
 290    Implementation plan in Ceilometer architecture
 291
 292
 293 NFVI Maintenance
 294 ^^^^^^^^^^^^^^^^
 295
 296 For NFVI Maintenance, a quite similar implementation plan exists. Instead of a
 297 raw fault being observed by the Monitor, the Administrator is sending a
 298 Maintenance Request through the northbound interface towards the Controller
 299 residing in the VIM. Similar to the Fault Management use case, the Controller
 300 (in our case OpenStack Nova) will send a maintenance event to the Notifier (i.e.
 301 Ceilometer in our implementation). Within Ceilometer, the same workflow as
 302 described in the previous section applies. In addition, the Controller(s) will
 303 take appropriate actions to evacuate the physical machines in order to prepare
 304 them for the planned maintenance operation. After the physical machines are
 305 emptied, the Controller will inform the Administrator that it can initiate the
 306 maintenance.
 307
 308 Information elements
 309 --------------------
 310
 311 This section introduces all attributes and information elements used in the
 312 messages exchange on the northbound interfaces between the VIM and the VNFO and
 313 VNFM.
 314
 315 Note: The information elements will be aligned with current work in ETSI NFV IFA
 316 working group.
 317
 318
 319 Simple information elements:
 320
 321 * SubscriptionID: identifies a subscription to receive fault or maintenance
 322   notifications.
 323 * NotificationID: identifies a fault or maintenance notification.
 324 * VirtualResourceID (Identifier): identifies a virtual resource affected by a
 325   fault or a maintenance action of the underlying physical resource.
 326 * PhysicalResourceID (Identifier): identifies a physical resource affected by a
 327   fault or maintenance action.
 328 * VirtualResourceState (String): state of a virtual resource, e.g. "normal",
 329   "maintenance", "down", "error".
 330 * PhysicalResourceState (String): state of a physical resource, e.g. "normal",
 331   "maintenance", "down", "error".
 332 * VirtualResourceType (String): type of the virtual resource, e.g. "virtual
 333   machine", "virtual memory", "virtual storage", "virtual CPU", or "virtual
 334   NIC".
 335 * FaultID (Identifier): identifies the related fault in the underlying physical
 336   resource. This can be used to correlate different fault notifications caused
 337   by the same fault in the physical resource.
 338 * FaultType (String): Type of the fault. The allowed values for this parameter
 339   depend on the type of the related physical resource. For example, a resource
 340   of type "compute hardware" may have faults of type "CPU failure", "memory
 341   failure", "network card failure", etc.
 342 * Severity (Integer): value expressing the severity of the fault. The higher the
 343   value, the more severe the fault.
 344 * MinSeverity (Integer): value used in filter information elements. Only faults
 345   with a severity higher than the MinSeverity value will be notified to the
 346   Consumer.
 347 * EventTime (Datetime): Time when the fault was observed.
 348 * EventStartTime and EventEndTime (Datetime): Datetime range that can be used in
 349   a FaultQueryFilter to narrow down the faults to be queried.
 350 * ProbableCause: information about the probable cause of the fault.
 351 * CorrelatedFaultID (Integer): list of other faults correlated to this fault.
 352 * isRootCause (Boolean): Parameter indicating if this fault is the root for
 353   other correlated faults. If TRUE, then the faults listed in the parameter
 354   CorrelatedFaultID are caused by this fault.
 355 * FaultDetails (Key-value pair): provides additional information about the
 356   fault, e.g. information about the threshold, monitored attributes, indication
 357   of the trend of the monitored parameter.
 358 * FirmwareVersion (String): current version of the firmware of a physical
 359   resource.
 360 * HypervisorVersion (String): current version of a hypervisor.
 361 * ZoneID (Identifier): Identifier of the resource zone. A resource zone is the
 362   logical separation of physical and software resources in an NFVI deployment
 363   for physical isolation, redundancy, or administrative designation.
 364 * Metadata (Key-Value-Pairs): provides additional information of a physical
 365   resource in maintenance/error state.
 366
 367 Complex information elements (see also UML diagrams in :num:`Figure #figure13`
 368 and :num:`Figure #figure14`):
 369
 370 * VirtualResourceInfoClass:
 371
 372   + VirtualResourceID [1] (Identifier)
 373   + VirtualResourceState [1] (String)
 374   + Faults [0..*] (FaultClass): For each resource, all faults
 375     including detailed information about the faults are provided.
 376
 377 * FaultClass: The parameters of the FaultClass are partially based on ETSI TS
 378   132 111-2 (V12.1.0) [*]_, which is specifying fault management in 3GPP, in
 379   particular describing the information elements used for alarm notifications.
 380
 381   - FaultID [1] (Identifier)
 382   - FaultType [1]
 383   - Severity [1] (Integer)
 384   - EventTime [1] (Datetime)
 385   - ProbableCause [1]
 386   - CorrelatedFaultID [0..*] (Identifier)
 387   - FaultDetails [0..*] (Key-value pair)
 388
 389 .. [*] http://www.etsi.org/deliver/etsi_ts/132100_132199/13211102/12.01.00_60/ts_13211102v120100p.pdf
 390
 391 * SubscribeFilterClass
 392
 393   - VirtualResourceType [0..*] (String)
 394   - VirtualResourceID [0..*] (Identifier)
 395   - FaultType [0..*] (String)
 396   - MinSeverity [0..1] (Integer)
 397
 398 * FaultQueryFilterClass: narrows down the FaultQueryRequest, for example it
 399   limits the query to certain physical resources, a certain zone, a given fault
 400   type/severity/cause, or a specific FaultID.
 401
 402   - VirtualResourceType [0..*] (String)
 403   - VirtualResourceID [0..*] (Identifier)
 404   - FaultType [0..*] (String)
 405   - MinSeverity [0..1] (Integer)
 406   - EventStartTime [0..1] (Datetime)
 407   - EventEndTime [0..1] (Datetime)
 408
 409 * PhysicalResourceStateClass:
 410
 411   - PhysicalResourceID [1] (Identifier)
 412   - PhysicalResourceState [1] (String): mandates the new state of the physical
 413     resource.
 414
 415 * PhysicalResourceInfoClass:
 416
 417   - PhysicalResourceID [1] (Identifier)
 418   - PhysicalResourceState [1] (String)
 419   - FirmwareVersion [0..1] (String)
 420   - HypervisorVersion [0..1] (String)
 421   - ZoneID [0..1] (Identifier)
 422
 423 * StateQueryFilterClass: narrows down a StateQueryRequest, for example it limits
 424   the query to certain physical resources, a certain zone, or a given resource
 425   state (e.g., only resources in "maintenance" state).
 426
 427   - PhysicalResourceID [1] (Identifier)
 428   - PhysicalResourceState [1] (String)
 429   - ZoneID [0..1] (Identifier)
 430
 431 Detailed northbound interface specification
 432 -------------------------------------------
 433
 434 This section is specifying the northbound interfaces for fault management and
 435 NFVI maintenance between the VIM on the one end and the Consumer and the
 436 Administrator on the other ends. For each interface all messages and related
 437 information elements are provided.
 438
 439 Note: The interface definition will be aligned with current work in ETSI NFV IFA
 440 working group .
 441
 442 All of the interfaces described below are produced by the VIM and consumed by
 443 the Consumer or Administrator.
 444
 445 Fault management interface
 446 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 447
 448 This interface allows the VIM to notify the Consumer about a virtual resource
 449 that is affected by a fault, either within the virtual resource itself or by the
 450 underlying virtualization infrastructure. The messages on this interface are
 451 shown in :num:`Figure #figure13` and explained in detail in the following
 452 subsections.
 453
 454 Note: The information elements used in this section are described in detail in
 455 Section 5.4.
 456
 457 .. _figure13:
 458
 459 .. figure:: images/figure13.png
 460    :width: 100%
 461
 462    Fault management NB I/F messages
 463
 464
 465 SubscribeRequest (Consumer -> VIM)
 466 __________________________________
 467
 468 Subscription from Consumer to VIM to be notified about faults of specific
 469 resources. The faults to be notified about can be narrowed down using a
 470 subscribe filter.
 471
 472 Parameters:
 473
 474 - SubscribeFilter [1] (SubscribeFilterClass): Optional information to narrow
 475   down the faults that shall be notified to the Consumer, for example limit to
 476   specific VirtualResourceID(s), severity, or cause of the alarm.
 477
 478 SubscribeResponse (VIM -> Consumer)
 479 ___________________________________
 480
 481 Response to a subscribe request message including information about the
 482 subscribed resources, in particular if they are in "fault/error" state.
 483
 484 Parameters:
 485
 486 * SubscriptionID [1] (Identifier): Unique identifier for the subscription. It
 487   can be used to delete or update the subscription.
 488 * VirtualResourceInfo [0..*] (VirtualResourceInfoClass): Provides additional
 489   information about the subscribed resources, i.e., a list of the related
 490   resources, the current state of the resources, etc.
 491
 492 FaultNotification (VIM -> Consumer)
 493 ___________________________________
 494
 495 Notification about a virtual resource that is affected by a fault, either within
 496 the virtual resource itself or by the underlying virtualization infrastructure.
 497 After reception of this request, the Consumer will decide on the optimal
 498 action to resolve the fault. This includes actions like switching to a hot
 499 standby virtual resource, migration of the fault virtual resource to another
 500 physical machine, termination of the faulty virtual resource and instantiation
 501 of a new virtual resource in order to provide a new hot standby resource.
 502 Existing resource management interfaces and messages between the Consumer and
 503 the VIM can be used for those actions, and there is no need to define additional
 504 actions on the Fault Management Interface.
 505
 506 Parameters:
 507
 508 * NotificationID [1] (Identifier): Unique identifier for the notification.
 509 * VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of faulty
 510   resources with detailed information about the faults.
 511
 512 FaultQueryRequest (Consumer -> VIM)
 513 ___________________________________
 514
 515 Request to find out about active alarms at the VIM. A FaultQueryFilter can be
 516 used to narrow down the alarms returned in the response message.
 517
 518 Parameters:
 519
 520 * FaultQueryFilter [1] (FaultQueryFilterClass): narrows down the
 521   FaultQueryRequest, for example it limits the query to certain physical
 522   resources, a certain zone, a given fault type/severity/cause, or a specific
 523   FaultID.
 524
 525 FaultQueryResponse (VIM -> Consumer)
 526 ____________________________________
 527
 528 List of active alarms at the VIM matching the FaultQueryFilter specified in the
 529 FaultQueryRequest.
 530
 531 Parameters:
 532
 533 * VirtualResourceInfo [0..*] (VirtualResourceInfoClass): List of faulty
 534   resources. For each resource all faults including detailed information about
 535   the faults are provided.
 536
 537 NFVI maintenance
 538 ^^^^^^^^^^^^^^^^
 539
 540 The NFVI maintenance interfaces Consumer-VIM allows the Consumer to subscribe to
 541 maintenance notifications provided by the VIM. The related maintenance interface
 542 Administrator-VIM allows the Administrator to issue maintenance requests to the
 543 VIM, i.e. requesting the VIM to take appropriate actions to empty physical
 544 machine(s) in order to execute maintenance operations on them. The interface
 545 also allows the Administrator to query the state of physical machines, e.g., in
 546 order to get details in the current status of the maintenance operation like a
 547 firmware update.
 548
 549 The messages defined in these northbound interfaces are shown in :num:`Figure
 550 #figure14` and described in detail in the following subsections.
 551
 552 .. _figure14:
 553
 554 .. figure:: images/figure14.png
 555    :width: 100%
 556
 557    NFVI maintenance NB I/F messages
 558
 559 SubscribeRequest (Consumer -> VIM)
 560 __________________________________
 561
 562 Subscription from Consumer to VIM to be notified about maintenance operations
 563 for specific virtual resources. The resources to be informed about can be
 564 narrowed down using a subscribe filter.
 565
 566 Parameters:
 567
 568 * SubscribeFilter [1] (SubscribeFilterClass): Information to narrow down the
 569   faults that shall be notified to the Consumer, for example limit to specific
 570   virtual resource type(s).
 571
 572 SubscribeResponse (VIM -> Consumer)
 573 ___________________________________
 574
 575 Response to a subscribe request message, including information about the
 576 subscribed virtual resources, in particular if they are in "maintenance" state.
 577
 578 Parameters:
 579
 580 * SubscriptionID [1] (Identifier): Unique identifier for the subscription. It
 581   can be used to delete or update the subscription.
 582 * VirtualResourceInfo [0..*] (VirtalResourceInfoClass): Provides additional
 583   information about the subscribed virtual resource(s), e.g., the ID, type and
 584   current state of the resource(s).
 585
 586 MaintenanceNotification (VIM -> Consumer)
 587 _________________________________________
 588
 589 Notification about a physical resource switched to "maintenance" state. After
 590 reception of this request, the Consumer will decide on the optimal action to
 591 address this request, e.g., to switch to the standby (STBY) configuration.
 592
 593 Parameters:
 594
 595 * VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of virtual
 596   resources where the state has been changed to maintenance.
 597
 598 StateChangeRequest (Administrator -> VIM)
 599 _________________________________________
 600
 601 Request to change the state of a list of physical resources, e.g. to
 602 "maintenance" state, in order to prepare them for a planned maintenance
 603 operation.
 604
 605 Parameters:
 606
 607 * PhysicalResourceState [1..*] (PhysicalResourceStateClass)
 608
 609 StateChangeResponse (VIM -> Administrator)
 610 __________________________________________
 611
 612 Response message to inform the Administrator that the requested resources are
 613 now in maintenance state (or the operation resulted in an error) and the
 614 maintenance operation(s) can be executed.
 615
 616 Parameters:
 617
 618 * PhysicalResourceInfo [1..*] (PhysicalResourceInfoClass)
 619
 620 StateQueryRequest (Administrator -> VIM)
 621 ________________________________________
 622
 623 In this procedure, the Administrator would like to get the information about
 624 physical machine(s), e.g. their state ("normal", "maintenance"), firmware
 625 version, hypervisor version, update status of firmware and hypervisor, etc. It
 626 can be used to check the progress during firmware update and the confirmation
 627 after update. A filter can be used to narrow down the resources returned in the
 628 response message.
 629
 630 Parameters:
 631
 632 * StateQueryFilter [1] (StateQueryFilterClass): narrows down the
 633   StateQueryRequest, for example it limits the query to certain physical
 634   resources, a certain zone, or a given resource state.
 635
 636 StateQueryResponse (VIM -> Administrator)
 637 _________________________________________
 638
 639 List of physical resources matching the filter specified in the
 640 StateQueryRequest.
 641
 642 Parameters:
 643
 644 * PhysicalResourceInfo [0..*] (PhysicalResourceInfoClass): List of physical
 645   resources. For each resource, information about the current state, the
 646   firmware version, etc. is provided.
 647
 648 Blueprints
 649 ----------
 650
 651 This section is listing a first set of blueprints that have been proposed by the
 652 Doctor project to the open source community. Further blueprints addressing other
 653 gaps identified in Section 4 will be submitted at a later stage of the OPNFV. In
 654 this section the following definitions are used:
 655
 656 * "Event" is a message emitted by other OpenStack services such as Nova and
 657   Neutron and is consumed by the "Notification Agents" in Ceilometer.
 658 * "Notification" is a message generated by a "Notification Agent" in Ceilometer
 659   based on an "event" and is delivered to the "Collectors" in Ceilometer that
 660   store those notifications (as "sample") to the Ceilometer "Databases".
 661
 662 Instance State Notification  (Ceilometer) [*]_
 663 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 664
 665 The Doctor project is planning to handle "events" and "notifications" regarding
 666 Resource Status; Instance State, Port State, Host State, etc. Currently,
 667 Ceilometer already receives "events" to identify the state of those resources,
 668 but it does not handle and store them yet. This is why we also need a new event
 669 definition to capture those resource states from "events" created by other
 670 services.
 671
 672 This BP proposes to add a new compute notification state to handle events from
 673 an instance (server) from nova. It also creates a new meter "instance.state" in
 674 OpenStack.
 675
 676 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 677
 678 Event Publisher for Alarm  (Ceilometer) [*]_
 679 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 680
 681 **Problem statement:**
 682
 683   The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically
 684   querying/polling the databases in order to check all alarms independently from
 685   other processes. This is adding additional delay to the fault notification
 686   send to the Consumer, whereas one requirement of Doctor is to react on faults
 687   as fast as possible.
 688
 689   The existing message flow is shown in :num:`Figure #figure12`: after receiving
 690   an "event", a "notification agent" (i.e. "event publisher") will send a
 691   "notification" to a "Collector". The "collector" is collecting the
 692   notifications and is updating the Ceilometer "Meter" database that is storing
 693   information about the "sample" which is capured from original "event". The
 694   "Alarm Evaluator" is periodically polling this databases then querying "Meter"
 695   database based on each alarm configuration.
 696
 697   In the current Ceilometer implementation, there is no possibility to directly
 698   trigger the "Alarm Evaluator" when a new "event" was received, but the "Alarm
 699   Evaluator" will only find out that requires firing new notification to the
 700   Consumer when polling the database.
 701
 702 **Change/feature request:**
 703
 704   This BP proposes to add a new "event publisher for alarm", which is bypassing
 705   several steps in Ceilometer in order to avoid the polling-based approach of
 706   the existing Alarm Evaluator that makes notification slow to users.
 707
 708   After receiving an "(alarm) event" by listening on the Ceilometer message
 709   queue ("notification bus"), the new "event publisher for alarm" immediately
 710   hands a "notification" about this event to a new Ceilometer component
 711   "Notification-driven alarm evaluator" proposed in the other BP (see Section
 712   5.6.3).
 713
 714   Note, the term "publisher" refers to an entity in the Ceilometer architecture
 715   (it is a "notification agent"). It offers the capability to provide
 716   notifications to other services outside of Ceilometer, but it is also used to
 717   deliver notifications to other Ceilometer components (e.g. the "Collectors")
 718   via the Ceilometer "notification bus".
 719
 720 **Implementation detail**
 721
 722   * "Event publisher for alarm" is part of Ceilometer
 723   * The standard AMQP message queue is used with a new topic string.
 724   * No new interfaces have to be added to Ceilometer.
 725   * "Event publisher for Alarm" can be configured by the Administrator of
 726     Ceilometer to be used as "Notification Agent" in addition to the existing
 727     "Notifier"
 728   * Existing alarm mechanisms of Ceilometer can be used allowing users to
 729     configure how to distribute the "notifications" transformed from "events",
 730     e.g. there is an option whether an ongoing alarm is re-issued or not
 731     ("repeat_actions").
 732
 733 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 734
 735 Notification-driven alarm evaluator (Ceilometer) [*]_
 736 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 737
 738 **Problem statement:**
 739
 740 The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically
 741 querying/polling the databases in order to check all alarms independently from
 742 other processes. This is adding additional delay to the fault notification send
 743 to the Consumer, whereas one requirement of Doctor is to react on faults as fast
 744 as possible.
 745
 746 **Change/feature request:**
 747
 748 This BP is proposing to add an alternative "Notification-driven Alarm Evaluator"
 749 for Ceilometer that is receiving "notifications" sent by the "Event Publisher
 750 for Alarm" described in the other BP. Once this new "Notification-driven Alarm
 751 Evaluator" received "notification", it finds the "alarm" configurations which
 752 may relate to the "notification" by querying the "alarm" database with some keys
 753 i.e. resource ID, then it will evaluate each alarm with the information in that
 754 "notification".
 755
 756 After the alarm evaluation, it will perform the same way as the existing "alarm
 757 evaluator" does for firing alarm notification to the Consumer. Similar to the
 758 existing Alarm Evaluator, this new "Notification-driven Alarm Evaluator" is
 759 aggregating and correlating different alarms which are then provided northbound
 760 to the Consumer via the OpenStack "Alarm Notifier". The user/administrator can
 761 register the alarm configuration via existing Ceilometer API [*]_. Thereby, he
 762 can configure whether to set an alarm or not and where to send the alarms to.
 763
 764 **Implementation detail**
 765
 766 * The new "Notification-driven Alarm Evaluator" is part of Ceilometer.
 767 * Most of the existing source code of the "Alarm Evaluator" can be re-used to
 768   implement this BP
 769 * No additional application logic is needed
 770 * It will access the Ceilometer Databases just like the existing "Alarm
 771   evaluator"
 772 * Only the polling-based approach will be replaced by a listener for
 773   "notifications" provided by the "Event Publisher for Alarm" on the Ceilometer
 774   "notification bus".
 775 * No new interfaces have to be added to Ceilometer.
 776
 777
 778 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 779 .. [*] https://wiki.openstack.org/wiki/Ceilometer/Alerting
 780
 781 Report host fault to update server state immediately (Nova) [*]_
 782 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 783
 784 **Problem statement:**
 785
 786 * Nova state change for failed or unreachable host is slow and does not reliably
 787   state host is down or not. This might cause same server instance to run twice
 788   if action taken to evacuate instance to another host.
 789 * Nova state for server(s) on failed host will not change, but remains active
 790   and running. This gives the user false information about server state.
 791 * VIM northbound interface notification of host faults towards VNFM and NFVO
 792   should be in line with OpenStack state. This fault notification is a Telco
 793   requirement defined in ETSI and will be implemented by OPNFV Doctor project.
 794 * Openstack user cannot make HA actions fast and reliably by trusting server
 795   state and host state.
 796
 797 **Proposed change:**
 798
 799 There needs to be a new API for Admin to state host is down. This API is used to
 800 mark services running in host down to reflect the real situation.
 801
 802 Example on compute node is:
 803
 804 * When compute node is up and running:::
 805
 806     vm_state: activeand power_state: running
 807     nova-compute state: up status: enabled
 808
 809 * When compute node goes down and new API is called to state host is down:::
 810
 811     vm_state: stopped power_state: shutdown
 812     nova-compute state: down status: enabled
 813
 814 **Alternatives:**
 815
 816 There is no attractive alternative to detect all different host faults than to
 817 have an external tool to detect different host faults. For this kind of tool to
 818 exist there needs to be new API in Nova to report fault. Currently there must be
 819 some kind of workarounds implemented as cannot trust or get the states from
 820 OpenStack fast enough.
 821
 822 .. [*] https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately
 823
 824 Other related BPs
 825 ^^^^^^^^^^^^^^^^^
 826
 827 This section lists some BPs related to Doctor, but proposed by drafters outside
 828 the OPNFV community.
 829
 830 pacemaker-servicegroup-driver [*]_
 831 __________________________________
 832
 833 This BP will detect and report host down quite fast to OpenStack. This however
 834 might not work properly for example when management network has some problem and
 835 host reported faulty while VM still running there. This might lead to launching
 836 same VM instance twice causing problems. Also NB IF message needs fault reason
 837 and for that the source needs to be a tool that detects different kind of faults
 838 as Doctor will be doing. Also this BP might need enhancement to change server
 839 and service states correctly.
 840
 841 .. [*] https://blueprints.launchpad.net/nova/+spec/pacemaker-servicegroup-driver
 842
 843 ..
 844  vim: set tabstop=4 expandtab textwidth=80: