requirements/05-implementation.rst

   1 Detailed architecture and interface specification
   2 =================================================
   3
   4 This section describes a detailed implementation plan, which is based on the
   5 high level architecture introduced in Section 3. Section 5.1 describes the
   6 functional blocks of the Doctor architecture, which is followed by a high level
   7 message flow in Section 5.2. Section 5.3 provides a mapping of selected existing
   8 open source components to the building blocks of the Doctor architecture.
   9 Thereby, the selection of components is based on their maturity and the gap
  10 analysis executed in Section 4. Sections 5.4 and 5.5 detail the specification of
  11 the related northbound interface and the related information elements. Finally,
  12 Section 5.6 provides a first set of blueprints to address selected gaps required
  13 for the realization functionalities of the Doctor project.
  14
  15 .. _impl_fb:
  16
  17 Functional Blocks
  18 -----------------
  19
  20 This section introduces the functional blocks to form the VIM. OpenStack was
  21 selected as the candidate for implementation. Inside the VIM, 4 different
  22 building blocks are defined (see :num:`Figure #figure6`).
  23
  24 .. _figure6:
  25
  26 .. figure:: images/figure6.png
  27    :width: 100%
  28
  29    Functional blocks
  30
  31 Monitor
  32 ^^^^^^^
  33
  34 The Monitor module has the responsibility for monitoring the virtualized
  35 infrastructure. There are already many existing tools and services (e.g. Zabbix)
  36 to monitor different aspects of hardware and software resources which can be
  37 used for this purpose.
  38
  39 Inspector
  40 ^^^^^^^^^
  41
  42 The Inspector module has the ability a) to receive various failure notifications
  43 regarding physical resource(s) from Monitor module(s), b) to find the affected
  44 virtual resource(s) by querying the resource map in the Controller, and c) to
  45 update the state of the virtual resource (and physical resource).
  46
  47 The Inspector has drivers for different types of events and resources to
  48 integrate any type of Monitor and Controller modules. It also uses a failure
  49 policy database to decide on the failure selection and aggregation from raw
  50 events. This failure policy database is configured by the Administrator.
  51
  52 The reason for separation of the Inspector and Controller modules is to make the
  53 Controller focus on simple operations by avoiding a tight integration of various
  54 health check mechanisms into the Controller.
  55
  56 Controller
  57 ^^^^^^^^^^
  58
  59 The Controller is responsible for maintaining the resource map (i.e. the mapping
  60 from physical resources to virtual resources), accepting update requests for the
  61 resource state(s) (exposing as provider API), and sending all failure events
  62 regarding virtual resources to the Notifier. Optionally, the Controller has the
  63 ability to force the state of a given physical resource to down in the resource
  64 mapping when it receives failure notifications from the Inspector for that
  65 given physical resource.
  66 The Controller also re-calculates the capacity of the NVFI when receiving a
  67 failure notification for a physical resource.
  68
  69 In a real-world deployment, the VIM may have several controllers, one for each
  70 resource type, such as Nova, Neutron and Cinder in OpenStack. Each controller
  71 maintains a database of virtual and physical resources which shall be the master
  72 source for resource information inside the VIM.
  73
  74 Notifier
  75 ^^^^^^^^
  76
  77 The focus of the Notifier is on selecting and aggregating failure events
  78 received from the controller based on policies mandated by the Consumer.
  79 Therefore, it allows the Consumer to subscribe for alarms regarding virtual
  80 resources using a method such as API endpoint. After receiving a fault
  81 event from a Controller, it will notify the fault to the Consumer by referring
  82 to the alarm configuration which was defined by the Consumer earlier on.
  83
  84 To reduce complexity of the Controller, it is a good approach for the
  85 Controllers to emit all notifications without any filtering mechanism and have
  86 another service (i.e. Notifier) handle those notifications properly. This is the
  87 general philosophy of notifications in OpenStack. Note that a fault message
  88 consumed by the Notifier is different from the fault message received by the
  89 Inspector; the former message is related to virtual resources which are visible
  90 to users with relevant ownership, whereas the latter is related to raw devices
  91 or small entities which should be handled with an administrator privilege.
  92
  93 The northbound interface between the Notifier and the Consumer/Administrator is
  94 specified in :ref:`impl_nbi`.
  95
  96 Sequence
  97 --------
  98
  99 Fault Management
 100 ^^^^^^^^^^^^^^^^
 101
 102 The detailed work flow for fault management is as follows (see also :num:`Figure
 103 #figure7`):
 104
 105 1. Request to subscribe to monitor specific virtual resources. A query filter
 106    can be used to narrow down the alarms the Consumer wants to be informed
 107    about.
 108 2. Each subscription request is acknowledged with a subscribe response message.
 109    The response message contains information about the subscribed virtual
 110    resources, in particular if a subscribed virtual resource is in "alarm"
 111    state.
 112 3. The NFVI sends monitoring events for resources the VIM has been subscribed
 113    to. Note: this subscription message exchange between the VIM and NFVI is not
 114    shown in this message flow.
 115 4. Event correlation, fault detection and aggregation in VIM.
 116 5. Database lookup to find the virtual resources affected by the detected fault.
 117 6. Fault notification to Consumer.
 118 7. The Consumer switches to standby configuration (STBY)
 119 8. Instructions to VIM requesting certain actions to be performed on the
 120    affected resources, for example migrate/update/terminate specific
 121    resource(s). After reception of such instructions, the VIM is executing the
 122    requested action, e.g. it will migrate or terminate a virtual resource.
 123
 124    a. Query request from Consumer to VIM to get information about the current
 125    status of a resource.
 126    b. Response to the query request with information about the current status of
 127    the queried resource. In case the resource is in "fault" state, information
 128    about the related fault(s) is returned.
 129
 130 In order to allow for quick reaction to failures, the time interval between
 131 fault detection in step 3 and the corresponding recovery actions in step 7 and 8
 132 shall be less than 1 second.
 133
 134 .. _figure7:
 135
 136 .. figure:: images/figure7.png
 137    :width: 100%
 138
 139    Fault management work flow
 140
 141
 142 .. _figure8:
 143
 144 .. figure:: images/figure8.png
 145    :width: 100%
 146
 147    Fault management scenario
 148
 149 :num:`Figure #figure8` shows a more detailed message flow (Steps 4 to 6) between
 150 the 4 building blocks introduced in :ref:`impl_fb`.
 151
 152 4. The Monitor observed a fault in the NFVI and reports the raw fault to the
 153    Inspector.
 154    The Inspector filters and aggregates the faults using pre-configured
 155    failure policies.
 156
 157 5.
 158    a) The Inspector queries the Resource Map to find the virtual resources
 159    affected by the raw fault in the NFVI.
 160    b) The Inspector updates the state of the affected virtual resources in the
 161    Resource Map.
 162    c) The Controller observes a change of the virtual resource state and informs
 163    the Notifier about the state change and the related alarm(s).
 164    Alternatively, the Inspector may directly inform the Notifier about it.
 165
 166 6. The Notifier is performing another filtering and aggregation of the changes
 167    and alarms based on the pre-configured alarm configuration. Finally, a fault
 168    notification is sent to northbound to the Consumer.
 169
 170 NFVI Maintenance
 171 ^^^^^^^^^^^^^^^^
 172
 173 The detailed work flow for NFVI maintenance is shown in :num:`Figure #figure9`
 174 and has the following steps. Note that steps 1, 2, and 5 to 8a in the NFVI
 175 maintenance work flow are very similar to the steps in the fault management work
 176 flow and share a similar implementation plan in Release 1.
 177
 178 1. Subscribe to fault/maintenance notifications.
 179 2. Response to subscribe request.
 180 3. Maintenance trigger received from administrator.
 181 4. VIM switches NFVI resources to "maintenance" state. This, e.g., means they
 182    should not be used for further allocation/migration requests
 183 5. Database lookup to find the virtual resources affected by the detected
 184    maintenance operation.
 185 6. Maintenance notification to Consumer.
 186 7. The Consumer switches to standby configuration (STBY)
 187 8. Instructions from Consumer to VIM requesting certain recovery actions to be
 188    performed (step 7a). After reception of such instructions, the VIM is
 189    executing the requested action in order to empty the physical resources (step
 190    7b).
 191 9. Maintenance response from VIM to inform the Administrator that the physical
 192    machines have been emptied (or the operation resulted in an error state).
 193 10. Administrator is coordinating and executing the maintenance operation/work
 194     on the NFVI.
 195
 196     A) Query request from Administrator to VIM to get information about the
 197     current state of a resource.
 198     B) Response to the query request with information about the current state of
 199     the queried resource(s). In case the resource is in "maintenance" state,
 200     information about the related maintenance operation is returned.
 201
 202 .. _figure9:
 203
 204 .. figure:: images/figure9.png
 205    :width: 100%
 206
 207    NFVI maintenance work flow
 208
 209
 210 .. _figure10:
 211
 212 .. figure:: images/figure10.png
 213    :width: 100%
 214
 215    NFVI Maintenance implementation plan
 216
 217 :num:`Figure #figure10` shows a more detailed message flow (Steps 4 to 6)
 218 between the 4 building blocks introduced in Section 5.1..
 219
 220 3. The Administrator is sending a StateChange request to the Controller residing
 221    in the VIM.
 222 4. The Controller queries the Resource Map to find the virtual resources
 223    affected by the planned maintenance operation.
 224 5.
 225
 226   a) The Controller updates the state of the affected virtual resources in the
 227   Resource Map database.
 228
 229   b) The Controller informs the Notifier about the virtual resources that will
 230   be affected by the maintenance operation.
 231
 232 6. A maintenance notification is sent to northbound to the Consumer.
 233
 234 ...
 235
 236 9. The Controller informs the Administrator after the physical resources have
 237    been freed.
 238
 239
 240
 241 Implementation plan for OPNFV Release 1
 242 ---------------------------------------
 243
 244 Fault management
 245 ^^^^^^^^^^^^^^^^
 246
 247 :num:`Figure #figure11` shows the implementation plan based on OpenStack and
 248 related components as planned for Release 1. Hereby, the Monitor can be realized
 249 by Zabbix. The Controller is realized by OpenStack Nova [NOVA]_, Neutron
 250 [NEUT]_, and Cinder [CIND]_ for compute, network, and storage,
 251 respectively. The Inspector can be realized by Monasca [MONA]_ or a simple
 252 script querying Nova in order to map between physical and virtual resources. The
 253 Notifier will be realized by Ceilometer [CEIL]_ receiving failure events
 254 on its notification bus.
 255
 256 :num:`Figure #figure12` shows the inner-workings of Ceilometer. After receiving
 257 an "event" on its notification bus, first a notification agent will grab the
 258 event and send a "notification" to the Collector. The collector writes the
 259 notifications received to the Ceilometer databases.
 260
 261 In the existing Ceilometer implementation, an alarm evaluator is periodically
 262 polling those databases through the APIs provided. If it finds new alarms, it
 263 will evaluate them based on the pre-defined alarm configuration, and depending
 264 on the configuration, it will hand a message to the Alarm Notifier, which in
 265 turn will send the alarm message northbound to the Consumer. :num:`Figure
 266 #figure12` also shows an optimized work flow for Ceilometer with the goal to
 267 reduce the delay for fault notifications to the Consumer. The approach is to
 268 implement a new notification agent (called "publisher" in Ceilometer
 269 terminology) which is directly sending the alarm through the "Notification Bus"
 270 to a new "Notification-driven Alarm Evaluator (NAE)" (see Sections 5.6.2 and
 271 5.6.3), thereby bypassing the Collector and avoiding the additional delay of the
 272 existing polling-based alarm evaluator. The NAE is similar to the OpenStack
 273 "Alarm Evaluator", but is triggered by incoming notifications instead of
 274 periodically polling the OpenStack "Alarms" database for new alarms. The
 275 Ceilometer "Alarms" database can hold three states: "normal", "insufficient
 276 data", and "fired". It is representing a persistent alarm database. In order to
 277 realize the Doctor requirements, we need to define new "meters" in the database
 278 (see Section 5.6.1).
 279
 280 .. _figure11:
 281
 282 .. figure:: images/figure11.png
 283    :width: 100%
 284
 285    Implementation plan in OpenStack (OPNFV Release 1 ”Arno”)
 286
 287
 288 .. _figure12:
 289
 290 .. figure:: images/figure12.png
 291    :width: 100%
 292
 293    Implementation plan in Ceilometer architecture
 294
 295
 296 NFVI Maintenance
 297 ^^^^^^^^^^^^^^^^
 298
 299 For NFVI Maintenance, a quite similar implementation plan exists. Instead of a
 300 raw fault being observed by the Monitor, the Administrator is sending a
 301 Maintenance Request through the northbound interface towards the Controller
 302 residing in the VIM. Similar to the Fault Management use case, the Controller
 303 (in our case OpenStack Nova) will send a maintenance event to the Notifier (i.e.
 304 Ceilometer in our implementation). Within Ceilometer, the same workflow as
 305 described in the previous section applies. In addition, the Controller(s) will
 306 take appropriate actions to evacuate the physical machines in order to prepare
 307 them for the planned maintenance operation. After the physical machines are
 308 emptied, the Controller will inform the Administrator that it can initiate the
 309 maintenance.
 310
 311 Information elements
 312 --------------------
 313
 314 This section introduces all attributes and information elements used in the
 315 messages exchange on the northbound interfaces between the VIM and the VNFO and
 316 VNFM.
 317
 318 Note: The information elements will be aligned with current work in ETSI NFV IFA
 319 working group.
 320
 321
 322 Simple information elements:
 323
 324 * SubscriptionID: identifies a subscription to receive fault or maintenance
 325   notifications.
 326 * NotificationID: identifies a fault or maintenance notification.
 327 * VirtualResourceID (Identifier): identifies a virtual resource affected by a
 328   fault or a maintenance action of the underlying physical resource.
 329 * PhysicalResourceID (Identifier): identifies a physical resource affected by a
 330   fault or maintenance action.
 331 * VirtualResourceState (String): state of a virtual resource, e.g. "normal",
 332   "maintenance", "down", "error".
 333 * PhysicalResourceState (String): state of a physical resource, e.g. "normal",
 334   "maintenance", "down", "error".
 335 * VirtualResourceType (String): type of the virtual resource, e.g. "virtual
 336   machine", "virtual memory", "virtual storage", "virtual CPU", or "virtual
 337   NIC".
 338 * FaultID (Identifier): identifies the related fault in the underlying physical
 339   resource. This can be used to correlate different fault notifications caused
 340   by the same fault in the physical resource.
 341 * FaultType (String): Type of the fault. The allowed values for this parameter
 342   depend on the type of the related physical resource. For example, a resource
 343   of type "compute hardware" may have faults of type "CPU failure", "memory
 344   failure", "network card failure", etc.
 345 * Severity (Integer): value expressing the severity of the fault. The higher the
 346   value, the more severe the fault.
 347 * MinSeverity (Integer): value used in filter information elements. Only faults
 348   with a severity higher than the MinSeverity value will be notified to the
 349   Consumer.
 350 * EventTime (Datetime): Time when the fault was observed.
 351 * EventStartTime and EventEndTime (Datetime): Datetime range that can be used in
 352   a FaultQueryFilter to narrow down the faults to be queried.
 353 * ProbableCause: information about the probable cause of the fault.
 354 * CorrelatedFaultID (Integer): list of other faults correlated to this fault.
 355 * isRootCause (Boolean): Parameter indicating if this fault is the root for
 356   other correlated faults. If TRUE, then the faults listed in the parameter
 357   CorrelatedFaultID are caused by this fault.
 358 * FaultDetails (Key-value pair): provides additional information about the
 359   fault, e.g. information about the threshold, monitored attributes, indication
 360   of the trend of the monitored parameter.
 361 * FirmwareVersion (String): current version of the firmware of a physical
 362   resource.
 363 * HypervisorVersion (String): current version of a hypervisor.
 364 * ZoneID (Identifier): Identifier of the resource zone. A resource zone is the
 365   logical separation of physical and software resources in an NFVI deployment
 366   for physical isolation, redundancy, or administrative designation.
 367 * Metadata (Key-Value-Pairs): provides additional information of a physical
 368   resource in maintenance/error state.
 369
 370 Complex information elements (see also UML diagrams in :num:`Figure #figure13`
 371 and :num:`Figure #figure14`):
 372
 373 * VirtualResourceInfoClass:
 374
 375   + VirtualResourceID [1] (Identifier)
 376   + VirtualResourceState [1] (String)
 377   + Faults [0..*] (FaultClass): For each resource, all faults
 378     including detailed information about the faults are provided.
 379
 380 * FaultClass: The parameters of the FaultClass are partially based on ETSI TS
 381   132 111-2 (V12.1.0) [*]_, which is specifying fault management in 3GPP, in
 382   particular describing the information elements used for alarm notifications.
 383
 384   - FaultID [1] (Identifier)
 385   - FaultType [1]
 386   - Severity [1] (Integer)
 387   - EventTime [1] (Datetime)
 388   - ProbableCause [1]
 389   - CorrelatedFaultID [0..*] (Identifier)
 390   - FaultDetails [0..*] (Key-value pair)
 391
 392 .. [*] http://www.etsi.org/deliver/etsi_ts/132100_132199/13211102/12.01.00_60/ts_13211102v120100p.pdf
 393
 394 * SubscribeFilterClass
 395
 396   - VirtualResourceType [0..*] (String)
 397   - VirtualResourceID [0..*] (Identifier)
 398   - FaultType [0..*] (String)
 399   - MinSeverity [0..1] (Integer)
 400
 401 * FaultQueryFilterClass: narrows down the FaultQueryRequest, for example it
 402   limits the query to certain physical resources, a certain zone, a given fault
 403   type/severity/cause, or a specific FaultID.
 404
 405   - VirtualResourceType [0..*] (String)
 406   - VirtualResourceID [0..*] (Identifier)
 407   - FaultType [0..*] (String)
 408   - MinSeverity [0..1] (Integer)
 409   - EventStartTime [0..1] (Datetime)
 410   - EventEndTime [0..1] (Datetime)
 411
 412 * PhysicalResourceStateClass:
 413
 414   - PhysicalResourceID [1] (Identifier)
 415   - PhysicalResourceState [1] (String): mandates the new state of the physical
 416     resource.
 417
 418 * PhysicalResourceInfoClass:
 419
 420   - PhysicalResourceID [1] (Identifier)
 421   - PhysicalResourceState [1] (String)
 422   - FirmwareVersion [0..1] (String)
 423   - HypervisorVersion [0..1] (String)
 424   - ZoneID [0..1] (Identifier)
 425
 426 * StateQueryFilterClass: narrows down a StateQueryRequest, for example it limits
 427   the query to certain physical resources, a certain zone, or a given resource
 428   state (e.g., only resources in "maintenance" state).
 429
 430   - PhysicalResourceID [1] (Identifier)
 431   - PhysicalResourceState [1] (String)
 432   - ZoneID [0..1] (Identifier)
 433
 434 .. _impl_nbi:
 435
 436 Detailed northbound interface specification
 437 -------------------------------------------
 438
 439 This section is specifying the northbound interfaces for fault management and
 440 NFVI maintenance between the VIM on the one end and the Consumer and the
 441 Administrator on the other ends. For each interface all messages and related
 442 information elements are provided.
 443
 444 Note: The interface definition will be aligned with current work in ETSI NFV IFA
 445 working group .
 446
 447 All of the interfaces described below are produced by the VIM and consumed by
 448 the Consumer or Administrator.
 449
 450 Fault management interface
 451 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 452
 453 This interface allows the VIM to notify the Consumer about a virtual resource
 454 that is affected by a fault, either within the virtual resource itself or by the
 455 underlying virtualization infrastructure. The messages on this interface are
 456 shown in :num:`Figure #figure13` and explained in detail in the following
 457 subsections.
 458
 459 Note: The information elements used in this section are described in detail in
 460 Section 5.4.
 461
 462 .. _figure13:
 463
 464 .. figure:: images/figure13.png
 465    :width: 100%
 466
 467    Fault management NB I/F messages
 468
 469
 470 SubscribeRequest (Consumer -> VIM)
 471 __________________________________
 472
 473 Subscription from Consumer to VIM to be notified about faults of specific
 474 resources. The faults to be notified about can be narrowed down using a
 475 subscribe filter.
 476
 477 Parameters:
 478
 479 - SubscribeFilter [1] (SubscribeFilterClass): Optional information to narrow
 480   down the faults that shall be notified to the Consumer, for example limit to
 481   specific VirtualResourceID(s), severity, or cause of the alarm.
 482
 483 SubscribeResponse (VIM -> Consumer)
 484 ___________________________________
 485
 486 Response to a subscribe request message including information about the
 487 subscribed resources, in particular if they are in "fault/error" state.
 488
 489 Parameters:
 490
 491 * SubscriptionID [1] (Identifier): Unique identifier for the subscription. It
 492   can be used to delete or update the subscription.
 493 * VirtualResourceInfo [0..*] (VirtualResourceInfoClass): Provides additional
 494   information about the subscribed resources, i.e., a list of the related
 495   resources, the current state of the resources, etc.
 496
 497 FaultNotification (VIM -> Consumer)
 498 ___________________________________
 499
 500 Notification about a virtual resource that is affected by a fault, either within
 501 the virtual resource itself or by the underlying virtualization infrastructure.
 502 After reception of this request, the Consumer will decide on the optimal
 503 action to resolve the fault. This includes actions like switching to a hot
 504 standby virtual resource, migration of the fault virtual resource to another
 505 physical machine, termination of the faulty virtual resource and instantiation
 506 of a new virtual resource in order to provide a new hot standby resource.
 507 Existing resource management interfaces and messages between the Consumer and
 508 the VIM can be used for those actions, and there is no need to define additional
 509 actions on the Fault Management Interface.
 510
 511 Parameters:
 512
 513 * NotificationID [1] (Identifier): Unique identifier for the notification.
 514 * VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of faulty
 515   resources with detailed information about the faults.
 516
 517 FaultQueryRequest (Consumer -> VIM)
 518 ___________________________________
 519
 520 Request to find out about active alarms at the VIM. A FaultQueryFilter can be
 521 used to narrow down the alarms returned in the response message.
 522
 523 Parameters:
 524
 525 * FaultQueryFilter [1] (FaultQueryFilterClass): narrows down the
 526   FaultQueryRequest, for example it limits the query to certain physical
 527   resources, a certain zone, a given fault type/severity/cause, or a specific
 528   FaultID.
 529
 530 FaultQueryResponse (VIM -> Consumer)
 531 ____________________________________
 532
 533 List of active alarms at the VIM matching the FaultQueryFilter specified in the
 534 FaultQueryRequest.
 535
 536 Parameters:
 537
 538 * VirtualResourceInfo [0..*] (VirtualResourceInfoClass): List of faulty
 539   resources. For each resource all faults including detailed information about
 540   the faults are provided.
 541
 542 NFVI maintenance
 543 ^^^^^^^^^^^^^^^^
 544
 545 The NFVI maintenance interfaces Consumer-VIM allows the Consumer to subscribe to
 546 maintenance notifications provided by the VIM. The related maintenance interface
 547 Administrator-VIM allows the Administrator to issue maintenance requests to the
 548 VIM, i.e. requesting the VIM to take appropriate actions to empty physical
 549 machine(s) in order to execute maintenance operations on them. The interface
 550 also allows the Administrator to query the state of physical machines, e.g., in
 551 order to get details in the current status of the maintenance operation like a
 552 firmware update.
 553
 554 The messages defined in these northbound interfaces are shown in :num:`Figure
 555 #figure14` and described in detail in the following subsections.
 556
 557 .. _figure14:
 558
 559 .. figure:: images/figure14.png
 560    :width: 100%
 561
 562    NFVI maintenance NB I/F messages
 563
 564 SubscribeRequest (Consumer -> VIM)
 565 __________________________________
 566
 567 Subscription from Consumer to VIM to be notified about maintenance operations
 568 for specific virtual resources. The resources to be informed about can be
 569 narrowed down using a subscribe filter.
 570
 571 Parameters:
 572
 573 * SubscribeFilter [1] (SubscribeFilterClass): Information to narrow down the
 574   faults that shall be notified to the Consumer, for example limit to specific
 575   virtual resource type(s).
 576
 577 SubscribeResponse (VIM -> Consumer)
 578 ___________________________________
 579
 580 Response to a subscribe request message, including information about the
 581 subscribed virtual resources, in particular if they are in "maintenance" state.
 582
 583 Parameters:
 584
 585 * SubscriptionID [1] (Identifier): Unique identifier for the subscription. It
 586   can be used to delete or update the subscription.
 587 * VirtualResourceInfo [0..*] (VirtalResourceInfoClass): Provides additional
 588   information about the subscribed virtual resource(s), e.g., the ID, type and
 589   current state of the resource(s).
 590
 591 MaintenanceNotification (VIM -> Consumer)
 592 _________________________________________
 593
 594 Notification about a physical resource switched to "maintenance" state. After
 595 reception of this request, the Consumer will decide on the optimal action to
 596 address this request, e.g., to switch to the standby (STBY) configuration.
 597
 598 Parameters:
 599
 600 * VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of virtual
 601   resources where the state has been changed to maintenance.
 602
 603 StateChangeRequest (Administrator -> VIM)
 604 _________________________________________
 605
 606 Request to change the state of a list of physical resources, e.g. to
 607 "maintenance" state, in order to prepare them for a planned maintenance
 608 operation.
 609
 610 Parameters:
 611
 612 * PhysicalResourceState [1..*] (PhysicalResourceStateClass)
 613
 614 StateChangeResponse (VIM -> Administrator)
 615 __________________________________________
 616
 617 Response message to inform the Administrator that the requested resources are
 618 now in maintenance state (or the operation resulted in an error) and the
 619 maintenance operation(s) can be executed.
 620
 621 Parameters:
 622
 623 * PhysicalResourceInfo [1..*] (PhysicalResourceInfoClass)
 624
 625 StateQueryRequest (Administrator -> VIM)
 626 ________________________________________
 627
 628 In this procedure, the Administrator would like to get the information about
 629 physical machine(s), e.g. their state ("normal", "maintenance"), firmware
 630 version, hypervisor version, update status of firmware and hypervisor, etc. It
 631 can be used to check the progress during firmware update and the confirmation
 632 after update. A filter can be used to narrow down the resources returned in the
 633 response message.
 634
 635 Parameters:
 636
 637 * StateQueryFilter [1] (StateQueryFilterClass): narrows down the
 638   StateQueryRequest, for example it limits the query to certain physical
 639   resources, a certain zone, or a given resource state.
 640
 641 StateQueryResponse (VIM -> Administrator)
 642 _________________________________________
 643
 644 List of physical resources matching the filter specified in the
 645 StateQueryRequest.
 646
 647 Parameters:
 648
 649 * PhysicalResourceInfo [0..*] (PhysicalResourceInfoClass): List of physical
 650   resources. For each resource, information about the current state, the
 651   firmware version, etc. is provided.
 652
 653 Blueprints
 654 ----------
 655
 656 This section is listing a first set of blueprints that have been proposed by the
 657 Doctor project to the open source community. Further blueprints addressing other
 658 gaps identified in Section 4 will be submitted at a later stage of the OPNFV. In
 659 this section the following definitions are used:
 660
 661 * "Event" is a message emitted by other OpenStack services such as Nova and
 662   Neutron and is consumed by the "Notification Agents" in Ceilometer.
 663 * "Notification" is a message generated by a "Notification Agent" in Ceilometer
 664   based on an "event" and is delivered to the "Collectors" in Ceilometer that
 665   store those notifications (as "sample") to the Ceilometer "Databases".
 666
 667 Instance State Notification  (Ceilometer) [*]_
 668 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 669
 670 The Doctor project is planning to handle "events" and "notifications" regarding
 671 Resource Status; Instance State, Port State, Host State, etc. Currently,
 672 Ceilometer already receives "events" to identify the state of those resources,
 673 but it does not handle and store them yet. This is why we also need a new event
 674 definition to capture those resource states from "events" created by other
 675 services.
 676
 677 This BP proposes to add a new compute notification state to handle events from
 678 an instance (server) from nova. It also creates a new meter "instance.state" in
 679 OpenStack.
 680
 681 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 682
 683 Event Publisher for Alarm  (Ceilometer) [*]_
 684 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 685
 686 **Problem statement:**
 687
 688   The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically
 689   querying/polling the databases in order to check all alarms independently from
 690   other processes. This is adding additional delay to the fault notification
 691   send to the Consumer, whereas one requirement of Doctor is to react on faults
 692   as fast as possible.
 693
 694   The existing message flow is shown in :num:`Figure #figure12`: after receiving
 695   an "event", a "notification agent" (i.e. "event publisher") will send a
 696   "notification" to a "Collector". The "collector" is collecting the
 697   notifications and is updating the Ceilometer "Meter" database that is storing
 698   information about the "sample" which is capured from original "event". The
 699   "Alarm Evaluator" is periodically polling this databases then querying "Meter"
 700   database based on each alarm configuration.
 701
 702   In the current Ceilometer implementation, there is no possibility to directly
 703   trigger the "Alarm Evaluator" when a new "event" was received, but the "Alarm
 704   Evaluator" will only find out that requires firing new notification to the
 705   Consumer when polling the database.
 706
 707 **Change/feature request:**
 708
 709   This BP proposes to add a new "event publisher for alarm", which is bypassing
 710   several steps in Ceilometer in order to avoid the polling-based approach of
 711   the existing Alarm Evaluator that makes notification slow to users.
 712
 713   After receiving an "(alarm) event" by listening on the Ceilometer message
 714   queue ("notification bus"), the new "event publisher for alarm" immediately
 715   hands a "notification" about this event to a new Ceilometer component
 716   "Notification-driven alarm evaluator" proposed in the other BP (see Section
 717   5.6.3).
 718
 719   Note, the term "publisher" refers to an entity in the Ceilometer architecture
 720   (it is a "notification agent"). It offers the capability to provide
 721   notifications to other services outside of Ceilometer, but it is also used to
 722   deliver notifications to other Ceilometer components (e.g. the "Collectors")
 723   via the Ceilometer "notification bus".
 724
 725 **Implementation detail**
 726
 727   * "Event publisher for alarm" is part of Ceilometer
 728   * The standard AMQP message queue is used with a new topic string.
 729   * No new interfaces have to be added to Ceilometer.
 730   * "Event publisher for Alarm" can be configured by the Administrator of
 731     Ceilometer to be used as "Notification Agent" in addition to the existing
 732     "Notifier"
 733   * Existing alarm mechanisms of Ceilometer can be used allowing users to
 734     configure how to distribute the "notifications" transformed from "events",
 735     e.g. there is an option whether an ongoing alarm is re-issued or not
 736     ("repeat_actions").
 737
 738 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 739
 740 Notification-driven alarm evaluator (Ceilometer) [*]_
 741 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 742
 743 **Problem statement:**
 744
 745 The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically
 746 querying/polling the databases in order to check all alarms independently from
 747 other processes. This is adding additional delay to the fault notification send
 748 to the Consumer, whereas one requirement of Doctor is to react on faults as fast
 749 as possible.
 750
 751 **Change/feature request:**
 752
 753 This BP is proposing to add an alternative "Notification-driven Alarm Evaluator"
 754 for Ceilometer that is receiving "notifications" sent by the "Event Publisher
 755 for Alarm" described in the other BP. Once this new "Notification-driven Alarm
 756 Evaluator" received "notification", it finds the "alarm" configurations which
 757 may relate to the "notification" by querying the "alarm" database with some keys
 758 i.e. resource ID, then it will evaluate each alarm with the information in that
 759 "notification".
 760
 761 After the alarm evaluation, it will perform the same way as the existing "alarm
 762 evaluator" does for firing alarm notification to the Consumer. Similar to the
 763 existing Alarm Evaluator, this new "Notification-driven Alarm Evaluator" is
 764 aggregating and correlating different alarms which are then provided northbound
 765 to the Consumer via the OpenStack "Alarm Notifier". The user/administrator can
 766 register the alarm configuration via existing Ceilometer API [*]_. Thereby, he
 767 can configure whether to set an alarm or not and where to send the alarms to.
 768
 769 **Implementation detail**
 770
 771 * The new "Notification-driven Alarm Evaluator" is part of Ceilometer.
 772 * Most of the existing source code of the "Alarm Evaluator" can be re-used to
 773   implement this BP
 774 * No additional application logic is needed
 775 * It will access the Ceilometer Databases just like the existing "Alarm
 776   evaluator"
 777 * Only the polling-based approach will be replaced by a listener for
 778   "notifications" provided by the "Event Publisher for Alarm" on the Ceilometer
 779   "notification bus".
 780 * No new interfaces have to be added to Ceilometer.
 781
 782
 783 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 784 .. [*] https://wiki.openstack.org/wiki/Ceilometer/Alerting
 785
 786 Report host fault to update server state immediately (Nova) [*]_
 787 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 788
 789 **Problem statement:**
 790
 791 * Nova state change for failed or unreachable host is slow and does not reliably
 792   state host is down or not. This might cause same server instance to run twice
 793   if action taken to evacuate instance to another host.
 794 * Nova state for server(s) on failed host will not change, but remains active
 795   and running. This gives the user false information about server state.
 796 * VIM northbound interface notification of host faults towards VNFM and NFVO
 797   should be in line with OpenStack state. This fault notification is a Telco
 798   requirement defined in ETSI and will be implemented by OPNFV Doctor project.
 799 * Openstack user cannot make HA actions fast and reliably by trusting server
 800   state and host state.
 801
 802 **Proposed change:**
 803
 804 There needs to be a new API for Admin to state host is down. This API is used to
 805 mark services running in host down to reflect the real situation.
 806
 807 Example on compute node is:
 808
 809 * When compute node is up and running:::
 810
 811     vm_state: activeand power_state: running
 812     nova-compute state: up status: enabled
 813
 814 * When compute node goes down and new API is called to state host is down:::
 815
 816     vm_state: stopped power_state: shutdown
 817     nova-compute state: down status: enabled
 818
 819 **Alternatives:**
 820
 821 There is no attractive alternative to detect all different host faults than to
 822 have an external tool to detect different host faults. For this kind of tool to
 823 exist there needs to be new API in Nova to report fault. Currently there must be
 824 some kind of workarounds implemented as cannot trust or get the states from
 825 OpenStack fast enough.
 826
 827 .. [*] https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately
 828
 829 Other related BPs
 830 ^^^^^^^^^^^^^^^^^
 831
 832 This section lists some BPs related to Doctor, but proposed by drafters outside
 833 the OPNFV community.
 834
 835 pacemaker-servicegroup-driver [*]_
 836 __________________________________
 837
 838 This BP will detect and report host down quite fast to OpenStack. This however
 839 might not work properly for example when management network has some problem and
 840 host reported faulty while VM still running there. This might lead to launching
 841 same VM instance twice causing problems. Also NB IF message needs fault reason
 842 and for that the source needs to be a tool that detects different kind of faults
 843 as Doctor will be doing. Also this BP might need enhancement to change server
 844 and service states correctly.
 845
 846 .. [*] https://blueprints.launchpad.net/nova/+spec/pacemaker-servicegroup-driver
 847
 848 ..
 849  vim: set tabstop=4 expandtab textwidth=80: