requirements/05-implementation.rst

   1 Detailed architecture and interface specification
   2 =================================================
   3
   4 This section describes a detailed implementation plan, which is based on the
   5 high level architecture introduced in Section 3. Section 5.1 describes the
   6 functional blocks of the Doctor architecture, which is followed by a high level
   7 message flow in Section 5.2. Section 5.3 provides a mapping of selected existing
   8 open source components to the building blocks of the Doctor architecture.
   9 Thereby, the selection of components is based on their maturity and the gap
  10 analysis executed in Section 4. Sections 5.4 and 5.5 detail the specification of
  11 the related northbound interface and the related information elements. Finally,
  12 Section 5.6 provides a first set of blueprints to address selected gaps required
  13 for the realization functionalities of the Doctor project.
  14
  15 .. _impl_fb:
  16
  17 Functional Blocks
  18 -----------------
  19
  20 This section introduces the functional blocks to form the VIM. OpenStack was
  21 selected as the candidate for implementation. Inside the VIM, 4 different
  22 building blocks are defined (see :num:`Figure #figure6`).
  23
  24 .. _figure6:
  25
  26 .. figure:: images/figure6.png
  27    :width: 100%
  28
  29    Functional blocks
  30
  31 Monitor
  32 ^^^^^^^
  33
  34 The Monitor module has the responsibility for monitoring the virtualized
  35 infrastructure. There are already many existing tools and services (e.g. Zabbix)
  36 to monitor different aspects of hardware and software resources which can be
  37 used for this purpose.
  38
  39 Inspector
  40 ^^^^^^^^^
  41
  42 The Inspector module has the ability a) to receive various failure notifications
  43 regarding physical resource(s) from Monitor module(s), b) to find the affected
  44 virtual resource(s) by querying the resource map in the Controller, and c) to
  45 update the state of the virtual resource (and physical resource).
  46
  47 The Inspector has drivers for different types of events and resources to
  48 integrate any type of Monitor and Controller modules. It also uses a failure
  49 policy database to decide on the failure selection and aggregation from raw
  50 events. This failure policy database is configured by the Administrator.
  51
  52 The reason for separation of the Inspector and Controller modules is to make the
  53 Controller focus on simple operations by avoiding a tight integration of various
  54 health check mechanisms into the Controller.
  55
  56 Controller
  57 ^^^^^^^^^^
  58
  59 The Controller is responsible for maintaining the resource map (i.e. the mapping
  60 from physical resources to virtual resources), accepting update requests for the
  61 resource state(s) (exposing as provider API), and sending all failure events
  62 regarding virtual resources to the Notifier. Optionally, the Controller has the
  63 ability to poison the state of virtual resources mapping to physical resources
  64 for which it has received failure notifications from the Inspector. The
  65 Controller also re-calculates the capacity of the NVFI when receiving a failure
  66 notification for a physical resource.
  67
  68 In a real-world deployment, the VIM may have several controllers, one for each
  69 resource type, such as Nova, Neutron and Cinder in OpenStack. Each controller
  70 maintains a database of virtual and physical resources which shall be the master
  71 source for resource information inside the VIM.
  72
  73 Notifier
  74 ^^^^^^^^
  75
  76 The focus of the Notifier is on selecting and aggregating failure events
  77 received from the controller based on policies mandated by the Consumer.
  78 Therefore, it allows the Consumer to subscribe for alarms regarding virtual
  79 resources using a method such as API endpoint. After receiving a fault
  80 event from a Controller, it will notify the fault to the Consumer by referring
  81 to the alarm configuration which was defined by the Consumer earlier on.
  82
  83 To reduce complexity of the Controller, it is a good approach for the
  84 Controllers to emit all notifications without any filtering mechanism and have
  85 another service (i.e. Notifier) handle those notifications properly. This is the
  86 general philosophy of notifications in OpenStack. Note that a fault message
  87 consumed by the Notifier is different from the fault message received by the
  88 Inspector; the former message is related to virtual resources which are visible
  89 to users with relevant ownership, whereas the latter is related to raw devices
  90 or small entities which should be handled with an administrator privilege.
  91
  92 The northbound interface between the Notifier and the Consumer/Administrator is
  93 specified in :ref:`impl_nbi`.
  94
  95 Sequence
  96 --------
  97
  98 Fault Management
  99 ^^^^^^^^^^^^^^^^
 100
 101 The detailed work flow for fault management is as follows (see also :num:`Figure
 102 #figure7`):
 103
 104 1. Request to subscribe to monitor specific virtual resources. A query filter
 105    can be used to narrow down the alarms the Consumer wants to be informed
 106    about.
 107 2. Each subscription request is acknowledged with a subscribe response message.
 108    The response message contains information about the subscribed virtual
 109    resources, in particular if a subscribed virtual resource is in "alarm"
 110    state.
 111 3. The NFVI sends monitoring events for resources the VIM has been subscribed
 112    to. Note: this subscription message exchange between the VIM and NFVI is not
 113    shown in this message flow.
 114 4. Event correlation, fault detection and aggregation in VIM.
 115 5. Database lookup to find the virtual resources affected by the detected fault.
 116 6. Fault notification to Consumer.
 117 7. The Consumer switches to standby configuration (STBY)
 118 8. Instructions to VIM requesting certain actions to be performed on the
 119    affected resources, for example migrate/update/terminate specific
 120    resource(s). After reception of such instructions, the VIM is executing the
 121    requested action, e.g. it will migrate or terminate a virtual resource.
 122
 123    a. Query request from Consumer to VIM to get information about the current
 124    status of a resource.
 125    b. Response to the query request with information about the current status of
 126    the queried resource. In case the resource is in "fault" state, information
 127    about the related fault(s) is returned.
 128
 129 In order to allow for quick reaction to failures, the time interval between
 130 fault detection in step 3 and the corresponding recovery actions in step 7 and 8
 131 shall be less than 1 second.
 132
 133 .. _figure7:
 134
 135 .. figure:: images/figure7.png
 136    :width: 100%
 137
 138    Fault management work flow
 139
 140
 141 .. _figure8:
 142
 143 .. figure:: images/figure8.png
 144    :width: 100%
 145
 146    Fault management scenario
 147
 148 :num:`Figure #figure8` shows a more detailed message flow (Steps 4 to 6) between
 149 the 4 building blocks introduced in :ref:`impl_fb`.
 150
 151 4. The Monitor observed a fault in the NFVI and reports the raw fault to the
 152    Inspector.
 153    The Inspector filters and aggregates the faults using pre-configured
 154    failure policies.
 155
 156 5.
 157    a) The Inspector queries the Resource Map to find the virtual resources
 158    affected by the raw fault in the NFVI.
 159    b) The Inspector updates the state of the affected virtual resources in the
 160    Resource Map.
 161    c) The Controller observes a change of the virtual resource state and informs
 162    the Notifier about the state change and the related alarm(s).
 163    Alternatively, the Inspector may directly inform the Notifier about it.
 164
 165 6. The Notifier is performing another filtering and aggregation of the changes
 166    and alarms based on the pre-configured alarm configuration. Finally, a fault
 167    notification is sent to northbound to the Consumer.
 168
 169 NFVI Maintenance
 170 ^^^^^^^^^^^^^^^^
 171
 172 The detailed work flow for NFVI maintenance is shown in :num:`Figure #figure9`
 173 and has the following steps. Note that steps 1, 2, and 5 to 8a in the NFVI
 174 maintenance work flow are very similar to the steps in the fault management work
 175 flow and share a similar implementation plan in Release 1.
 176
 177 1. Subscribe to fault/maintenance notifications.
 178 2. Response to subscribe request.
 179 3. Maintenance trigger received from administrator.
 180 4. VIM switches NFVI resources to "maintenance" state. This, e.g., means they
 181    should not be used for further allocation/migration requests
 182 5. Database lookup to find the virtual resources affected by the detected
 183    maintenance operation.
 184 6. Maintenance notification to Consumer.
 185 7. The Consumer switches to standby configuration (STBY)
 186 8. Instructions from Consumer to VIM requesting certain recovery actions to be
 187    performed (step 7a). After reception of such instructions, the VIM is
 188    executing the requested action in order to empty the physical resources (step
 189    7b).
 190 9. Maintenance response from VIM to inform the Administrator that the physical
 191    machines have been emptied (or the operation resulted in an error state).
 192 10. Administrator is coordinating and executing the maintenance operation/work
 193     on the NFVI.
 194
 195     A) Query request from Administrator to VIM to get information about the
 196     current state of a resource.
 197     B) Response to the query request with information about the current state of
 198     the queried resource(s). In case the resource is in "maintenance" state,
 199     information about the related maintenance operation is returned.
 200
 201 .. _figure9:
 202
 203 .. figure:: images/figure9.png
 204    :width: 100%
 205
 206    NFVI maintenance work flow
 207
 208
 209 .. _figure10:
 210
 211 .. figure:: images/figure10.png
 212    :width: 100%
 213
 214    NFVI Maintenance implementation plan
 215
 216 :num:`Figure #figure10` shows a more detailed message flow (Steps 4 to 6)
 217 between the 4 building blocks introduced in Section 5.1..
 218
 219 3. The Administrator is sending a StateChange request to the Controller residing
 220    in the VIM.
 221 4. The Controller queries the Resource Map to find the virtual resources
 222    affected by the planned maintenance operation.
 223 5.
 224
 225   a) The Controller updates the state of the affected virtual resources in the
 226   Resource Map database.
 227
 228   b) The Controller informs the Notifier about the virtual resources that will
 229   be affected by the maintenance operation.
 230
 231 6. A maintenance notification is sent to northbound to the Consumer.
 232
 233 ...
 234
 235 9. The Controller informs the Administrator after the physical resources have
 236    been freed.
 237
 238
 239
 240 Implementation plan for OPNFV Release 1
 241 ---------------------------------------
 242
 243 Fault management
 244 ^^^^^^^^^^^^^^^^
 245
 246 :num:`Figure #figure11` shows the implementation plan based on OpenStack and
 247 related components as planned for Release 1. Hereby, the Monitor can be realized
 248 by Zabbix. The Controller is realized by OpenStack Nova [NOVA]_, Neutron
 249 [NEUT]_, and Cinder [CIND]_ for compute, network, and storage,
 250 respectively. The Inspector can be realized by Monasca [MONA]_ or a simple
 251 script querying Nova in order to map between physical and virtual resources. The
 252 Notifier will be realized by Ceilometer [CEIL]_ receiving failure events
 253 on its notification bus.
 254
 255 :num:`Figure #figure12` shows the inner-workings of Ceilometer. After receiving
 256 an "event" on its notification bus, first a notification agent will grab the
 257 event and send a "notification" to the Collector. The collector writes the
 258 notifications received to the Ceilometer databases.
 259
 260 In the existing Ceilometer implementation, an alarm evaluator is periodically
 261 polling those databases through the APIs provided. If it finds new alarms, it
 262 will evaluate them based on the pre-defined alarm configuration, and depending
 263 on the configuration, it will hand a message to the Alarm Notifier, which in
 264 turn will send the alarm message northbound to the Consumer. :num:`Figure
 265 #figure12` also shows an optimized work flow for Ceilometer with the goal to
 266 reduce the delay for fault notifications to the Consumer. The approach is to
 267 implement a new notification agent (called "publisher" in Ceilometer
 268 terminology) which is directly sending the alarm through the "Notification Bus"
 269 to a new "Notification-driven Alarm Evaluator (NAE)" (see Sections 5.6.2 and
 270 5.6.3), thereby bypassing the Collector and avoiding the additional delay of the
 271 existing polling-based alarm evaluator. The NAE is similar to the OpenStack
 272 "Alarm Evaluator", but is triggered by incoming notifications instead of
 273 periodically polling the OpenStack "Alarms" database for new alarms. The
 274 Ceilometer "Alarms" database can hold three states: "normal", "insufficient
 275 data", and "fired". It is representing a persistent alarm database. In order to
 276 realize the Doctor requirements, we need to define new "meters" in the database
 277 (see Section 5.6.1).
 278
 279 .. _figure11:
 280
 281 .. figure:: images/figure11.png
 282    :width: 100%
 283
 284    Implementation plan in OpenStack (OPNFV Release 1 ”Arno”)
 285
 286
 287 .. _figure12:
 288
 289 .. figure:: images/figure12.png
 290    :width: 100%
 291
 292    Implementation plan in Ceilometer architecture
 293
 294
 295 NFVI Maintenance
 296 ^^^^^^^^^^^^^^^^
 297
 298 For NFVI Maintenance, a quite similar implementation plan exists. Instead of a
 299 raw fault being observed by the Monitor, the Administrator is sending a
 300 Maintenance Request through the northbound interface towards the Controller
 301 residing in the VIM. Similar to the Fault Management use case, the Controller
 302 (in our case OpenStack Nova) will send a maintenance event to the Notifier (i.e.
 303 Ceilometer in our implementation). Within Ceilometer, the same workflow as
 304 described in the previous section applies. In addition, the Controller(s) will
 305 take appropriate actions to evacuate the physical machines in order to prepare
 306 them for the planned maintenance operation. After the physical machines are
 307 emptied, the Controller will inform the Administrator that it can initiate the
 308 maintenance.
 309
 310 Information elements
 311 --------------------
 312
 313 This section introduces all attributes and information elements used in the
 314 messages exchange on the northbound interfaces between the VIM and the VNFO and
 315 VNFM.
 316
 317 Note: The information elements will be aligned with current work in ETSI NFV IFA
 318 working group.
 319
 320
 321 Simple information elements:
 322
 323 * SubscriptionID: identifies a subscription to receive fault or maintenance
 324   notifications.
 325 * NotificationID: identifies a fault or maintenance notification.
 326 * VirtualResourceID (Identifier): identifies a virtual resource affected by a
 327   fault or a maintenance action of the underlying physical resource.
 328 * PhysicalResourceID (Identifier): identifies a physical resource affected by a
 329   fault or maintenance action.
 330 * VirtualResourceState (String): state of a virtual resource, e.g. "normal",
 331   "maintenance", "down", "error".
 332 * PhysicalResourceState (String): state of a physical resource, e.g. "normal",
 333   "maintenance", "down", "error".
 334 * VirtualResourceType (String): type of the virtual resource, e.g. "virtual
 335   machine", "virtual memory", "virtual storage", "virtual CPU", or "virtual
 336   NIC".
 337 * FaultID (Identifier): identifies the related fault in the underlying physical
 338   resource. This can be used to correlate different fault notifications caused
 339   by the same fault in the physical resource.
 340 * FaultType (String): Type of the fault. The allowed values for this parameter
 341   depend on the type of the related physical resource. For example, a resource
 342   of type "compute hardware" may have faults of type "CPU failure", "memory
 343   failure", "network card failure", etc.
 344 * Severity (Integer): value expressing the severity of the fault. The higher the
 345   value, the more severe the fault.
 346 * MinSeverity (Integer): value used in filter information elements. Only faults
 347   with a severity higher than the MinSeverity value will be notified to the
 348   Consumer.
 349 * EventTime (Datetime): Time when the fault was observed.
 350 * EventStartTime and EventEndTime (Datetime): Datetime range that can be used in
 351   a FaultQueryFilter to narrow down the faults to be queried.
 352 * ProbableCause: information about the probable cause of the fault.
 353 * CorrelatedFaultID (Integer): list of other faults correlated to this fault.
 354 * isRootCause (Boolean): Parameter indicating if this fault is the root for
 355   other correlated faults. If TRUE, then the faults listed in the parameter
 356   CorrelatedFaultID are caused by this fault.
 357 * FaultDetails (Key-value pair): provides additional information about the
 358   fault, e.g. information about the threshold, monitored attributes, indication
 359   of the trend of the monitored parameter.
 360 * FirmwareVersion (String): current version of the firmware of a physical
 361   resource.
 362 * HypervisorVersion (String): current version of a hypervisor.
 363 * ZoneID (Identifier): Identifier of the resource zone. A resource zone is the
 364   logical separation of physical and software resources in an NFVI deployment
 365   for physical isolation, redundancy, or administrative designation.
 366 * Metadata (Key-Value-Pairs): provides additional information of a physical
 367   resource in maintenance/error state.
 368
 369 Complex information elements (see also UML diagrams in :num:`Figure #figure13`
 370 and :num:`Figure #figure14`):
 371
 372 * VirtualResourceInfoClass:
 373
 374   + VirtualResourceID [1] (Identifier)
 375   + VirtualResourceState [1] (String)
 376   + Faults [0..*] (FaultClass): For each resource, all faults
 377     including detailed information about the faults are provided.
 378
 379 * FaultClass: The parameters of the FaultClass are partially based on ETSI TS
 380   132 111-2 (V12.1.0) [*]_, which is specifying fault management in 3GPP, in
 381   particular describing the information elements used for alarm notifications.
 382
 383   - FaultID [1] (Identifier)
 384   - FaultType [1]
 385   - Severity [1] (Integer)
 386   - EventTime [1] (Datetime)
 387   - ProbableCause [1]
 388   - CorrelatedFaultID [0..*] (Identifier)
 389   - FaultDetails [0..*] (Key-value pair)
 390
 391 .. [*] http://www.etsi.org/deliver/etsi_ts/132100_132199/13211102/12.01.00_60/ts_13211102v120100p.pdf
 392
 393 * SubscribeFilterClass
 394
 395   - VirtualResourceType [0..*] (String)
 396   - VirtualResourceID [0..*] (Identifier)
 397   - FaultType [0..*] (String)
 398   - MinSeverity [0..1] (Integer)
 399
 400 * FaultQueryFilterClass: narrows down the FaultQueryRequest, for example it
 401   limits the query to certain physical resources, a certain zone, a given fault
 402   type/severity/cause, or a specific FaultID.
 403
 404   - VirtualResourceType [0..*] (String)
 405   - VirtualResourceID [0..*] (Identifier)
 406   - FaultType [0..*] (String)
 407   - MinSeverity [0..1] (Integer)
 408   - EventStartTime [0..1] (Datetime)
 409   - EventEndTime [0..1] (Datetime)
 410
 411 * PhysicalResourceStateClass:
 412
 413   - PhysicalResourceID [1] (Identifier)
 414   - PhysicalResourceState [1] (String): mandates the new state of the physical
 415     resource.
 416
 417 * PhysicalResourceInfoClass:
 418
 419   - PhysicalResourceID [1] (Identifier)
 420   - PhysicalResourceState [1] (String)
 421   - FirmwareVersion [0..1] (String)
 422   - HypervisorVersion [0..1] (String)
 423   - ZoneID [0..1] (Identifier)
 424
 425 * StateQueryFilterClass: narrows down a StateQueryRequest, for example it limits
 426   the query to certain physical resources, a certain zone, or a given resource
 427   state (e.g., only resources in "maintenance" state).
 428
 429   - PhysicalResourceID [1] (Identifier)
 430   - PhysicalResourceState [1] (String)
 431   - ZoneID [0..1] (Identifier)
 432
 433 .. _impl_nbi:
 434
 435 Detailed northbound interface specification
 436 -------------------------------------------
 437
 438 This section is specifying the northbound interfaces for fault management and
 439 NFVI maintenance between the VIM on the one end and the Consumer and the
 440 Administrator on the other ends. For each interface all messages and related
 441 information elements are provided.
 442
 443 Note: The interface definition will be aligned with current work in ETSI NFV IFA
 444 working group .
 445
 446 All of the interfaces described below are produced by the VIM and consumed by
 447 the Consumer or Administrator.
 448
 449 Fault management interface
 450 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 451
 452 This interface allows the VIM to notify the Consumer about a virtual resource
 453 that is affected by a fault, either within the virtual resource itself or by the
 454 underlying virtualization infrastructure. The messages on this interface are
 455 shown in :num:`Figure #figure13` and explained in detail in the following
 456 subsections.
 457
 458 Note: The information elements used in this section are described in detail in
 459 Section 5.4.
 460
 461 .. _figure13:
 462
 463 .. figure:: images/figure13.png
 464    :width: 100%
 465
 466    Fault management NB I/F messages
 467
 468
 469 SubscribeRequest (Consumer -> VIM)
 470 __________________________________
 471
 472 Subscription from Consumer to VIM to be notified about faults of specific
 473 resources. The faults to be notified about can be narrowed down using a
 474 subscribe filter.
 475
 476 Parameters:
 477
 478 - SubscribeFilter [1] (SubscribeFilterClass): Optional information to narrow
 479   down the faults that shall be notified to the Consumer, for example limit to
 480   specific VirtualResourceID(s), severity, or cause of the alarm.
 481
 482 SubscribeResponse (VIM -> Consumer)
 483 ___________________________________
 484
 485 Response to a subscribe request message including information about the
 486 subscribed resources, in particular if they are in "fault/error" state.
 487
 488 Parameters:
 489
 490 * SubscriptionID [1] (Identifier): Unique identifier for the subscription. It
 491   can be used to delete or update the subscription.
 492 * VirtualResourceInfo [0..*] (VirtualResourceInfoClass): Provides additional
 493   information about the subscribed resources, i.e., a list of the related
 494   resources, the current state of the resources, etc.
 495
 496 FaultNotification (VIM -> Consumer)
 497 ___________________________________
 498
 499 Notification about a virtual resource that is affected by a fault, either within
 500 the virtual resource itself or by the underlying virtualization infrastructure.
 501 After reception of this request, the Consumer will decide on the optimal
 502 action to resolve the fault. This includes actions like switching to a hot
 503 standby virtual resource, migration of the fault virtual resource to another
 504 physical machine, termination of the faulty virtual resource and instantiation
 505 of a new virtual resource in order to provide a new hot standby resource.
 506 Existing resource management interfaces and messages between the Consumer and
 507 the VIM can be used for those actions, and there is no need to define additional
 508 actions on the Fault Management Interface.
 509
 510 Parameters:
 511
 512 * NotificationID [1] (Identifier): Unique identifier for the notification.
 513 * VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of faulty
 514   resources with detailed information about the faults.
 515
 516 FaultQueryRequest (Consumer -> VIM)
 517 ___________________________________
 518
 519 Request to find out about active alarms at the VIM. A FaultQueryFilter can be
 520 used to narrow down the alarms returned in the response message.
 521
 522 Parameters:
 523
 524 * FaultQueryFilter [1] (FaultQueryFilterClass): narrows down the
 525   FaultQueryRequest, for example it limits the query to certain physical
 526   resources, a certain zone, a given fault type/severity/cause, or a specific
 527   FaultID.
 528
 529 FaultQueryResponse (VIM -> Consumer)
 530 ____________________________________
 531
 532 List of active alarms at the VIM matching the FaultQueryFilter specified in the
 533 FaultQueryRequest.
 534
 535 Parameters:
 536
 537 * VirtualResourceInfo [0..*] (VirtualResourceInfoClass): List of faulty
 538   resources. For each resource all faults including detailed information about
 539   the faults are provided.
 540
 541 NFVI maintenance
 542 ^^^^^^^^^^^^^^^^
 543
 544 The NFVI maintenance interfaces Consumer-VIM allows the Consumer to subscribe to
 545 maintenance notifications provided by the VIM. The related maintenance interface
 546 Administrator-VIM allows the Administrator to issue maintenance requests to the
 547 VIM, i.e. requesting the VIM to take appropriate actions to empty physical
 548 machine(s) in order to execute maintenance operations on them. The interface
 549 also allows the Administrator to query the state of physical machines, e.g., in
 550 order to get details in the current status of the maintenance operation like a
 551 firmware update.
 552
 553 The messages defined in these northbound interfaces are shown in :num:`Figure
 554 #figure14` and described in detail in the following subsections.
 555
 556 .. _figure14:
 557
 558 .. figure:: images/figure14.png
 559    :width: 100%
 560
 561    NFVI maintenance NB I/F messages
 562
 563 SubscribeRequest (Consumer -> VIM)
 564 __________________________________
 565
 566 Subscription from Consumer to VIM to be notified about maintenance operations
 567 for specific virtual resources. The resources to be informed about can be
 568 narrowed down using a subscribe filter.
 569
 570 Parameters:
 571
 572 * SubscribeFilter [1] (SubscribeFilterClass): Information to narrow down the
 573   faults that shall be notified to the Consumer, for example limit to specific
 574   virtual resource type(s).
 575
 576 SubscribeResponse (VIM -> Consumer)
 577 ___________________________________
 578
 579 Response to a subscribe request message, including information about the
 580 subscribed virtual resources, in particular if they are in "maintenance" state.
 581
 582 Parameters:
 583
 584 * SubscriptionID [1] (Identifier): Unique identifier for the subscription. It
 585   can be used to delete or update the subscription.
 586 * VirtualResourceInfo [0..*] (VirtalResourceInfoClass): Provides additional
 587   information about the subscribed virtual resource(s), e.g., the ID, type and
 588   current state of the resource(s).
 589
 590 MaintenanceNotification (VIM -> Consumer)
 591 _________________________________________
 592
 593 Notification about a physical resource switched to "maintenance" state. After
 594 reception of this request, the Consumer will decide on the optimal action to
 595 address this request, e.g., to switch to the standby (STBY) configuration.
 596
 597 Parameters:
 598
 599 * VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of virtual
 600   resources where the state has been changed to maintenance.
 601
 602 StateChangeRequest (Administrator -> VIM)
 603 _________________________________________
 604
 605 Request to change the state of a list of physical resources, e.g. to
 606 "maintenance" state, in order to prepare them for a planned maintenance
 607 operation.
 608
 609 Parameters:
 610
 611 * PhysicalResourceState [1..*] (PhysicalResourceStateClass)
 612
 613 StateChangeResponse (VIM -> Administrator)
 614 __________________________________________
 615
 616 Response message to inform the Administrator that the requested resources are
 617 now in maintenance state (or the operation resulted in an error) and the
 618 maintenance operation(s) can be executed.
 619
 620 Parameters:
 621
 622 * PhysicalResourceInfo [1..*] (PhysicalResourceInfoClass)
 623
 624 StateQueryRequest (Administrator -> VIM)
 625 ________________________________________
 626
 627 In this procedure, the Administrator would like to get the information about
 628 physical machine(s), e.g. their state ("normal", "maintenance"), firmware
 629 version, hypervisor version, update status of firmware and hypervisor, etc. It
 630 can be used to check the progress during firmware update and the confirmation
 631 after update. A filter can be used to narrow down the resources returned in the
 632 response message.
 633
 634 Parameters:
 635
 636 * StateQueryFilter [1] (StateQueryFilterClass): narrows down the
 637   StateQueryRequest, for example it limits the query to certain physical
 638   resources, a certain zone, or a given resource state.
 639
 640 StateQueryResponse (VIM -> Administrator)
 641 _________________________________________
 642
 643 List of physical resources matching the filter specified in the
 644 StateQueryRequest.
 645
 646 Parameters:
 647
 648 * PhysicalResourceInfo [0..*] (PhysicalResourceInfoClass): List of physical
 649   resources. For each resource, information about the current state, the
 650   firmware version, etc. is provided.
 651
 652 Blueprints
 653 ----------
 654
 655 This section is listing a first set of blueprints that have been proposed by the
 656 Doctor project to the open source community. Further blueprints addressing other
 657 gaps identified in Section 4 will be submitted at a later stage of the OPNFV. In
 658 this section the following definitions are used:
 659
 660 * "Event" is a message emitted by other OpenStack services such as Nova and
 661   Neutron and is consumed by the "Notification Agents" in Ceilometer.
 662 * "Notification" is a message generated by a "Notification Agent" in Ceilometer
 663   based on an "event" and is delivered to the "Collectors" in Ceilometer that
 664   store those notifications (as "sample") to the Ceilometer "Databases".
 665
 666 Instance State Notification  (Ceilometer) [*]_
 667 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 668
 669 The Doctor project is planning to handle "events" and "notifications" regarding
 670 Resource Status; Instance State, Port State, Host State, etc. Currently,
 671 Ceilometer already receives "events" to identify the state of those resources,
 672 but it does not handle and store them yet. This is why we also need a new event
 673 definition to capture those resource states from "events" created by other
 674 services.
 675
 676 This BP proposes to add a new compute notification state to handle events from
 677 an instance (server) from nova. It also creates a new meter "instance.state" in
 678 OpenStack.
 679
 680 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 681
 682 Event Publisher for Alarm  (Ceilometer) [*]_
 683 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 684
 685 **Problem statement:**
 686
 687   The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically
 688   querying/polling the databases in order to check all alarms independently from
 689   other processes. This is adding additional delay to the fault notification
 690   send to the Consumer, whereas one requirement of Doctor is to react on faults
 691   as fast as possible.
 692
 693   The existing message flow is shown in :num:`Figure #figure12`: after receiving
 694   an "event", a "notification agent" (i.e. "event publisher") will send a
 695   "notification" to a "Collector". The "collector" is collecting the
 696   notifications and is updating the Ceilometer "Meter" database that is storing
 697   information about the "sample" which is capured from original "event". The
 698   "Alarm Evaluator" is periodically polling this databases then querying "Meter"
 699   database based on each alarm configuration.
 700
 701   In the current Ceilometer implementation, there is no possibility to directly
 702   trigger the "Alarm Evaluator" when a new "event" was received, but the "Alarm
 703   Evaluator" will only find out that requires firing new notification to the
 704   Consumer when polling the database.
 705
 706 **Change/feature request:**
 707
 708   This BP proposes to add a new "event publisher for alarm", which is bypassing
 709   several steps in Ceilometer in order to avoid the polling-based approach of
 710   the existing Alarm Evaluator that makes notification slow to users.
 711
 712   After receiving an "(alarm) event" by listening on the Ceilometer message
 713   queue ("notification bus"), the new "event publisher for alarm" immediately
 714   hands a "notification" about this event to a new Ceilometer component
 715   "Notification-driven alarm evaluator" proposed in the other BP (see Section
 716   5.6.3).
 717
 718   Note, the term "publisher" refers to an entity in the Ceilometer architecture
 719   (it is a "notification agent"). It offers the capability to provide
 720   notifications to other services outside of Ceilometer, but it is also used to
 721   deliver notifications to other Ceilometer components (e.g. the "Collectors")
 722   via the Ceilometer "notification bus".
 723
 724 **Implementation detail**
 725
 726   * "Event publisher for alarm" is part of Ceilometer
 727   * The standard AMQP message queue is used with a new topic string.
 728   * No new interfaces have to be added to Ceilometer.
 729   * "Event publisher for Alarm" can be configured by the Administrator of
 730     Ceilometer to be used as "Notification Agent" in addition to the existing
 731     "Notifier"
 732   * Existing alarm mechanisms of Ceilometer can be used allowing users to
 733     configure how to distribute the "notifications" transformed from "events",
 734     e.g. there is an option whether an ongoing alarm is re-issued or not
 735     ("repeat_actions").
 736
 737 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 738
 739 Notification-driven alarm evaluator (Ceilometer) [*]_
 740 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 741
 742 **Problem statement:**
 743
 744 The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically
 745 querying/polling the databases in order to check all alarms independently from
 746 other processes. This is adding additional delay to the fault notification send
 747 to the Consumer, whereas one requirement of Doctor is to react on faults as fast
 748 as possible.
 749
 750 **Change/feature request:**
 751
 752 This BP is proposing to add an alternative "Notification-driven Alarm Evaluator"
 753 for Ceilometer that is receiving "notifications" sent by the "Event Publisher
 754 for Alarm" described in the other BP. Once this new "Notification-driven Alarm
 755 Evaluator" received "notification", it finds the "alarm" configurations which
 756 may relate to the "notification" by querying the "alarm" database with some keys
 757 i.e. resource ID, then it will evaluate each alarm with the information in that
 758 "notification".
 759
 760 After the alarm evaluation, it will perform the same way as the existing "alarm
 761 evaluator" does for firing alarm notification to the Consumer. Similar to the
 762 existing Alarm Evaluator, this new "Notification-driven Alarm Evaluator" is
 763 aggregating and correlating different alarms which are then provided northbound
 764 to the Consumer via the OpenStack "Alarm Notifier". The user/administrator can
 765 register the alarm configuration via existing Ceilometer API [*]_. Thereby, he
 766 can configure whether to set an alarm or not and where to send the alarms to.
 767
 768 **Implementation detail**
 769
 770 * The new "Notification-driven Alarm Evaluator" is part of Ceilometer.
 771 * Most of the existing source code of the "Alarm Evaluator" can be re-used to
 772   implement this BP
 773 * No additional application logic is needed
 774 * It will access the Ceilometer Databases just like the existing "Alarm
 775   evaluator"
 776 * Only the polling-based approach will be replaced by a listener for
 777   "notifications" provided by the "Event Publisher for Alarm" on the Ceilometer
 778   "notification bus".
 779 * No new interfaces have to be added to Ceilometer.
 780
 781
 782 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 783 .. [*] https://wiki.openstack.org/wiki/Ceilometer/Alerting
 784
 785 Report host fault to update server state immediately (Nova) [*]_
 786 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 787
 788 **Problem statement:**
 789
 790 * Nova state change for failed or unreachable host is slow and does not reliably
 791   state host is down or not. This might cause same server instance to run twice
 792   if action taken to evacuate instance to another host.
 793 * Nova state for server(s) on failed host will not change, but remains active
 794   and running. This gives the user false information about server state.
 795 * VIM northbound interface notification of host faults towards VNFM and NFVO
 796   should be in line with OpenStack state. This fault notification is a Telco
 797   requirement defined in ETSI and will be implemented by OPNFV Doctor project.
 798 * Openstack user cannot make HA actions fast and reliably by trusting server
 799   state and host state.
 800
 801 **Proposed change:**
 802
 803 There needs to be a new API for Admin to state host is down. This API is used to
 804 mark services running in host down to reflect the real situation.
 805
 806 Example on compute node is:
 807
 808 * When compute node is up and running:::
 809
 810     vm_state: activeand power_state: running
 811     nova-compute state: up status: enabled
 812
 813 * When compute node goes down and new API is called to state host is down:::
 814
 815     vm_state: stopped power_state: shutdown
 816     nova-compute state: down status: enabled
 817
 818 **Alternatives:**
 819
 820 There is no attractive alternative to detect all different host faults than to
 821 have an external tool to detect different host faults. For this kind of tool to
 822 exist there needs to be new API in Nova to report fault. Currently there must be
 823 some kind of workarounds implemented as cannot trust or get the states from
 824 OpenStack fast enough.
 825
 826 .. [*] https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately
 827
 828 Other related BPs
 829 ^^^^^^^^^^^^^^^^^
 830
 831 This section lists some BPs related to Doctor, but proposed by drafters outside
 832 the OPNFV community.
 833
 834 pacemaker-servicegroup-driver [*]_
 835 __________________________________
 836
 837 This BP will detect and report host down quite fast to OpenStack. This however
 838 might not work properly for example when management network has some problem and
 839 host reported faulty while VM still running there. This might lead to launching
 840 same VM instance twice causing problems. Also NB IF message needs fault reason
 841 and for that the source needs to be a tool that detects different kind of faults
 842 as Doctor will be doing. Also this BP might need enhancement to change server
 843 and service states correctly.
 844
 845 .. [*] https://blueprints.launchpad.net/nova/+spec/pacemaker-servicegroup-driver
 846
 847 ..
 848  vim: set tabstop=4 expandtab textwidth=80: