docs/requirements/05-implementation.rst

   1 Detailed architecture and interface specification
   2 =================================================
   3
   4 This section describes a detailed implementation plan, which is based on the
   5 high level architecture introduced in Section 3. Section 5.1 describes the
   6 functional blocks of the Doctor architecture, which is followed by a high level
   7 message flow in Section 5.2. Section 5.3 provides a mapping of selected existing
   8 open source components to the building blocks of the Doctor architecture.
   9 Thereby, the selection of components is based on their maturity and the gap
  10 analysis executed in Section 4. Sections 5.4 and 5.5 detail the specification of
  11 the related northbound interface and the related information elements. Finally,
  12 Section 5.6 provides a first set of blueprints to address selected gaps required
  13 for the realization functionalities of the Doctor project.
  14
  15 .. _impl_fb:
  16
  17 Functional Blocks
  18 -----------------
  19
  20 This section introduces the functional blocks to form the VIM. OpenStack was
  21 selected as the candidate for implementation. Inside the VIM, 4 different
  22 building blocks are defined (see :numref:`figure6`).
  23
  24 .. figure:: images/figure6.png
  25    :name: figure6
  26    :width: 100%
  27
  28    Functional blocks
  29
  30 Monitor
  31 ^^^^^^^
  32
  33 The Monitor module has the responsibility for monitoring the virtualized
  34 infrastructure. There are already many existing tools and services (e.g. Zabbix)
  35 to monitor different aspects of hardware and software resources which can be
  36 used for this purpose.
  37
  38 Inspector
  39 ^^^^^^^^^
  40
  41 The Inspector module has the ability a) to receive various failure notifications
  42 regarding physical resource(s) from Monitor module(s), b) to find the affected
  43 virtual resource(s) by querying the resource map in the Controller, and c) to
  44 update the state of the virtual resource (and physical resource).
  45
  46 The Inspector has drivers for different types of events and resources to
  47 integrate any type of Monitor and Controller modules. It also uses a failure
  48 policy database to decide on the failure selection and aggregation from raw
  49 events. This failure policy database is configured by the Administrator.
  50
  51 The reason for separation of the Inspector and Controller modules is to make the
  52 Controller focus on simple operations by avoiding a tight integration of various
  53 health check mechanisms into the Controller.
  54
  55 Controller
  56 ^^^^^^^^^^
  57
  58 The Controller is responsible for maintaining the resource map (i.e. the mapping
  59 from physical resources to virtual resources), accepting update requests for the
  60 resource state(s) (exposing as provider API), and sending all failure events
  61 regarding virtual resources to the Notifier. Optionally, the Controller has the
  62 ability to force the state of a given physical resource to down in the resource
  63 mapping when it receives failure notifications from the Inspector for that
  64 given physical resource.
  65 The Controller also re-calculates the capacity of the NVFI when receiving a
  66 failure notification for a physical resource.
  67
  68 In a real-world deployment, the VIM may have several controllers, one for each
  69 resource type, such as Nova, Neutron and Cinder in OpenStack. Each controller
  70 maintains a database of virtual and physical resources which shall be the master
  71 source for resource information inside the VIM.
  72
  73 Notifier
  74 ^^^^^^^^
  75
  76 The focus of the Notifier is on selecting and aggregating failure events
  77 received from the controller based on policies mandated by the Consumer.
  78 Therefore, it allows the Consumer to subscribe for alarms regarding virtual
  79 resources using a method such as API endpoint. After receiving a fault
  80 event from a Controller, it will notify the fault to the Consumer by referring
  81 to the alarm configuration which was defined by the Consumer earlier on.
  82
  83 To reduce complexity of the Controller, it is a good approach for the
  84 Controllers to emit all notifications without any filtering mechanism and have
  85 another service (i.e. Notifier) handle those notifications properly. This is the
  86 general philosophy of notifications in OpenStack. Note that a fault message
  87 consumed by the Notifier is different from the fault message received by the
  88 Inspector; the former message is related to virtual resources which are visible
  89 to users with relevant ownership, whereas the latter is related to raw devices
  90 or small entities which should be handled with an administrator privilege.
  91
  92 The northbound interface between the Notifier and the Consumer/Administrator is
  93 specified in :ref:`impl_nbi`.
  94
  95 Sequence
  96 --------
  97
  98 Fault Management
  99 ^^^^^^^^^^^^^^^^
 100
 101 The detailed work flow for fault management is as follows (see also :numref:`figure7`):
 102
 103 1. Request to subscribe to monitor specific virtual resources. A query filter
 104    can be used to narrow down the alarms the Consumer wants to be informed
 105    about.
 106 2. Each subscription request is acknowledged with a subscribe response message.
 107    The response message contains information about the subscribed virtual
 108    resources, in particular if a subscribed virtual resource is in "alarm"
 109    state.
 110 3. The NFVI sends monitoring events for resources the VIM has been subscribed
 111    to. Note: this subscription message exchange between the VIM and NFVI is not
 112    shown in this message flow.
 113 4. Event correlation, fault detection and aggregation in VIM.
 114 5. Database lookup to find the virtual resources affected by the detected fault.
 115 6. Fault notification to Consumer.
 116 7. The Consumer switches to standby configuration (STBY)
 117 8. Instructions to VIM requesting certain actions to be performed on the
 118    affected resources, for example migrate/update/terminate specific
 119    resource(s). After reception of such instructions, the VIM is executing the
 120    requested action, e.g. it will migrate or terminate a virtual resource.
 121
 122    a. Query request from Consumer to VIM to get information about the current
 123    status of a resource.
 124    b. Response to the query request with information about the current status of
 125    the queried resource. In case the resource is in "fault" state, information
 126    about the related fault(s) is returned.
 127
 128 In order to allow for quick reaction to failures, the time interval between
 129 fault detection in step 3 and the corresponding recovery actions in step 7 and 8
 130 shall be less than 1 second.
 131
 132 .. figure:: images/figure7.png
 133    :name: figure7
 134    :width: 100%
 135
 136    Fault management work flow
 137
 138 .. figure:: images/figure8.png
 139    :name: figure8
 140    :width: 100%
 141
 142    Fault management scenario
 143
 144 :numref:`figure8` shows a more detailed message flow (Steps 4 to 6) between
 145 the 4 building blocks introduced in :ref:`impl_fb`.
 146
 147 4. The Monitor observed a fault in the NFVI and reports the raw fault to the
 148    Inspector.
 149    The Inspector filters and aggregates the faults using pre-configured
 150    failure policies.
 151
 152 5.
 153    a) The Inspector queries the Resource Map to find the virtual resources
 154    affected by the raw fault in the NFVI.
 155    b) The Inspector updates the state of the affected virtual resources in the
 156    Resource Map.
 157    c) The Controller observes a change of the virtual resource state and informs
 158    the Notifier about the state change and the related alarm(s).
 159    Alternatively, the Inspector may directly inform the Notifier about it.
 160
 161 6. The Notifier is performing another filtering and aggregation of the changes
 162    and alarms based on the pre-configured alarm configuration. Finally, a fault
 163    notification is sent to northbound to the Consumer.
 164
 165 NFVI Maintenance
 166 ^^^^^^^^^^^^^^^^
 167
 168 The detailed work flow for NFVI maintenance is shown in :numref:`figure9`
 169 and has the following steps. Note that steps 1, 2, and 5 to 8a in the NFVI
 170 maintenance work flow are very similar to the steps in the fault management work
 171 flow and share a similar implementation plan in Release 1.
 172
 173 1. Subscribe to fault/maintenance notifications.
 174 2. Response to subscribe request.
 175 3. Maintenance trigger received from administrator.
 176 4. VIM switches NFVI resources to "maintenance" state. This, e.g., means they
 177    should not be used for further allocation/migration requests
 178 5. Database lookup to find the virtual resources affected by the detected
 179    maintenance operation.
 180 6. Maintenance notification to Consumer.
 181 7. The Consumer switches to standby configuration (STBY)
 182 8. Instructions from Consumer to VIM requesting certain recovery actions to be
 183    performed (step 7a). After reception of such instructions, the VIM is
 184    executing the requested action in order to empty the physical resources (step
 185    7b).
 186 9. Maintenance response from VIM to inform the Administrator that the physical
 187    machines have been emptied (or the operation resulted in an error state).
 188 10. Administrator is coordinating and executing the maintenance operation/work
 189     on the NFVI.
 190
 191     A) Query request from Administrator to VIM to get information about the
 192     current state of a resource.
 193     B) Response to the query request with information about the current state of
 194     the queried resource(s). In case the resource is in "maintenance" state,
 195     information about the related maintenance operation is returned.
 196
 197 .. figure:: images/figure9.png
 198    :name: figure9
 199    :width: 100%
 200
 201    NFVI maintenance work flow
 202
 203 .. figure:: images/figure10.png
 204    :name: figure10
 205    :width: 100%
 206
 207    NFVI Maintenance implementation plan
 208
 209 :numref:`figure10` shows a more detailed message flow (Steps 4 to 6)
 210 between the 4 building blocks introduced in Section 5.1..
 211
 212 3. The Administrator is sending a StateChange request to the Controller residing
 213    in the VIM.
 214 4. The Controller queries the Resource Map to find the virtual resources
 215    affected by the planned maintenance operation.
 216 5.
 217
 218   a) The Controller updates the state of the affected virtual resources in the
 219   Resource Map database.
 220
 221   b) The Controller informs the Notifier about the virtual resources that will
 222   be affected by the maintenance operation.
 223
 224 6. A maintenance notification is sent to northbound to the Consumer.
 225
 226 ...
 227
 228 9. The Controller informs the Administrator after the physical resources have
 229    been freed.
 230
 231
 232
 233 Implementation plan for OPNFV Release 1
 234 ---------------------------------------
 235
 236 Fault management
 237 ^^^^^^^^^^^^^^^^
 238
 239 :numref:`figure11` shows the implementation plan based on OpenStack and
 240 related components as planned for Release 1. Hereby, the Monitor can be realized
 241 by Zabbix. The Controller is realized by OpenStack Nova [NOVA]_, Neutron
 242 [NEUT]_, and Cinder [CIND]_ for compute, network, and storage,
 243 respectively. The Inspector can be realized by Monasca [MONA]_ or a simple
 244 script querying Nova in order to map between physical and virtual resources. The
 245 Notifier will be realized by Ceilometer [CEIL]_ receiving failure events
 246 on its notification bus.
 247
 248 :numref:`figure12` shows the inner-workings of Ceilometer. After receiving
 249 an "event" on its notification bus, first a notification agent will grab the
 250 event and send a "notification" to the Collector. The collector writes the
 251 notifications received to the Ceilometer databases.
 252
 253 In the existing Ceilometer implementation, an alarm evaluator is periodically
 254 polling those databases through the APIs provided. If it finds new alarms, it
 255 will evaluate them based on the pre-defined alarm configuration, and depending
 256 on the configuration, it will hand a message to the Alarm Notifier, which in
 257 turn will send the alarm message northbound to the Consumer. :numref:`figure12`
 258 also shows an optimized work flow for Ceilometer with the goal to
 259 reduce the delay for fault notifications to the Consumer. The approach is to
 260 implement a new notification agent (called "publisher" in Ceilometer
 261 terminology) which is directly sending the alarm through the "Notification Bus"
 262 to a new "Notification-driven Alarm Evaluator (NAE)" (see Sections 5.6.2 and
 263 5.6.3), thereby bypassing the Collector and avoiding the additional delay of the
 264 existing polling-based alarm evaluator. The NAE is similar to the OpenStack
 265 "Alarm Evaluator", but is triggered by incoming notifications instead of
 266 periodically polling the OpenStack "Alarms" database for new alarms. The
 267 Ceilometer "Alarms" database can hold three states: "normal", "insufficient
 268 data", and "fired". It is representing a persistent alarm database. In order to
 269 realize the Doctor requirements, we need to define new "meters" in the database
 270 (see Section 5.6.1).
 271
 272 .. figure:: images/figure11.png
 273    :name: figure11
 274    :width: 100%
 275
 276    Implementation plan in OpenStack (OPNFV Release 1 ”Arno”)
 277
 278
 279 .. figure:: images/figure12.png
 280    :name: figure12
 281    :width: 100%
 282
 283    Implementation plan in Ceilometer architecture
 284
 285
 286 NFVI Maintenance
 287 ^^^^^^^^^^^^^^^^
 288
 289 For NFVI Maintenance, a quite similar implementation plan exists. Instead of a
 290 raw fault being observed by the Monitor, the Administrator is sending a
 291 Maintenance Request through the northbound interface towards the Controller
 292 residing in the VIM. Similar to the Fault Management use case, the Controller
 293 (in our case OpenStack Nova) will send a maintenance event to the Notifier (i.e.
 294 Ceilometer in our implementation). Within Ceilometer, the same workflow as
 295 described in the previous section applies. In addition, the Controller(s) will
 296 take appropriate actions to evacuate the physical machines in order to prepare
 297 them for the planned maintenance operation. After the physical machines are
 298 emptied, the Controller will inform the Administrator that it can initiate the
 299 maintenance.
 300
 301 Information elements
 302 --------------------
 303
 304 This section introduces all attributes and information elements used in the
 305 messages exchange on the northbound interfaces between the VIM and the VNFO and
 306 VNFM.
 307
 308 Note: The information elements will be aligned with current work in ETSI NFV IFA
 309 working group.
 310
 311
 312 Simple information elements:
 313
 314 * SubscriptionID: identifies a subscription to receive fault or maintenance
 315   notifications.
 316 * NotificationID: identifies a fault or maintenance notification.
 317 * VirtualResourceID (Identifier): identifies a virtual resource affected by a
 318   fault or a maintenance action of the underlying physical resource.
 319 * PhysicalResourceID (Identifier): identifies a physical resource affected by a
 320   fault or maintenance action.
 321 * VirtualResourceState (String): state of a virtual resource, e.g. "normal",
 322   "maintenance", "down", "error".
 323 * PhysicalResourceState (String): state of a physical resource, e.g. "normal",
 324   "maintenance", "down", "error".
 325 * VirtualResourceType (String): type of the virtual resource, e.g. "virtual
 326   machine", "virtual memory", "virtual storage", "virtual CPU", or "virtual
 327   NIC".
 328 * FaultID (Identifier): identifies the related fault in the underlying physical
 329   resource. This can be used to correlate different fault notifications caused
 330   by the same fault in the physical resource.
 331 * FaultType (String): Type of the fault. The allowed values for this parameter
 332   depend on the type of the related physical resource. For example, a resource
 333   of type "compute hardware" may have faults of type "CPU failure", "memory
 334   failure", "network card failure", etc.
 335 * Severity (Integer): value expressing the severity of the fault. The higher the
 336   value, the more severe the fault.
 337 * MinSeverity (Integer): value used in filter information elements. Only faults
 338   with a severity higher than the MinSeverity value will be notified to the
 339   Consumer.
 340 * EventTime (Datetime): Time when the fault was observed.
 341 * EventStartTime and EventEndTime (Datetime): Datetime range that can be used in
 342   a FaultQueryFilter to narrow down the faults to be queried.
 343 * ProbableCause: information about the probable cause of the fault.
 344 * CorrelatedFaultID (Integer): list of other faults correlated to this fault.
 345 * isRootCause (Boolean): Parameter indicating if this fault is the root for
 346   other correlated faults. If TRUE, then the faults listed in the parameter
 347   CorrelatedFaultID are caused by this fault.
 348 * FaultDetails (Key-value pair): provides additional information about the
 349   fault, e.g. information about the threshold, monitored attributes, indication
 350   of the trend of the monitored parameter.
 351 * FirmwareVersion (String): current version of the firmware of a physical
 352   resource.
 353 * HypervisorVersion (String): current version of a hypervisor.
 354 * ZoneID (Identifier): Identifier of the resource zone. A resource zone is the
 355   logical separation of physical and software resources in an NFVI deployment
 356   for physical isolation, redundancy, or administrative designation.
 357 * Metadata (Key-Value-Pairs): provides additional information of a physical
 358   resource in maintenance/error state.
 359
 360 Complex information elements (see also UML diagrams in :numref:`figure13`
 361 and :numref:`figure14`):
 362
 363 * VirtualResourceInfoClass:
 364
 365   + VirtualResourceID [1] (Identifier)
 366   + VirtualResourceState [1] (String)
 367   + Faults [0..*] (FaultClass): For each resource, all faults
 368     including detailed information about the faults are provided.
 369
 370 * FaultClass: The parameters of the FaultClass are partially based on ETSI TS
 371   132 111-2 (V12.1.0) [*]_, which is specifying fault management in 3GPP, in
 372   particular describing the information elements used for alarm notifications.
 373
 374   - FaultID [1] (Identifier)
 375   - FaultType [1]
 376   - Severity [1] (Integer)
 377   - EventTime [1] (Datetime)
 378   - ProbableCause [1]
 379   - CorrelatedFaultID [0..*] (Identifier)
 380   - FaultDetails [0..*] (Key-value pair)
 381
 382 .. [*] http://www.etsi.org/deliver/etsi_ts/132100_132199/13211102/12.01.00_60/ts_13211102v120100p.pdf
 383
 384 * SubscribeFilterClass
 385
 386   - VirtualResourceType [0..*] (String)
 387   - VirtualResourceID [0..*] (Identifier)
 388   - FaultType [0..*] (String)
 389   - MinSeverity [0..1] (Integer)
 390
 391 * FaultQueryFilterClass: narrows down the FaultQueryRequest, for example it
 392   limits the query to certain physical resources, a certain zone, a given fault
 393   type/severity/cause, or a specific FaultID.
 394
 395   - VirtualResourceType [0..*] (String)
 396   - VirtualResourceID [0..*] (Identifier)
 397   - FaultType [0..*] (String)
 398   - MinSeverity [0..1] (Integer)
 399   - EventStartTime [0..1] (Datetime)
 400   - EventEndTime [0..1] (Datetime)
 401
 402 * PhysicalResourceStateClass:
 403
 404   - PhysicalResourceID [1] (Identifier)
 405   - PhysicalResourceState [1] (String): mandates the new state of the physical
 406     resource.
 407
 408 * PhysicalResourceInfoClass:
 409
 410   - PhysicalResourceID [1] (Identifier)
 411   - PhysicalResourceState [1] (String)
 412   - FirmwareVersion [0..1] (String)
 413   - HypervisorVersion [0..1] (String)
 414   - ZoneID [0..1] (Identifier)
 415
 416 * StateQueryFilterClass: narrows down a StateQueryRequest, for example it limits
 417   the query to certain physical resources, a certain zone, or a given resource
 418   state (e.g., only resources in "maintenance" state).
 419
 420   - PhysicalResourceID [1] (Identifier)
 421   - PhysicalResourceState [1] (String)
 422   - ZoneID [0..1] (Identifier)
 423
 424 .. _impl_nbi:
 425
 426 Detailed northbound interface specification
 427 -------------------------------------------
 428
 429 This section is specifying the northbound interfaces for fault management and
 430 NFVI maintenance between the VIM on the one end and the Consumer and the
 431 Administrator on the other ends. For each interface all messages and related
 432 information elements are provided.
 433
 434 Note: The interface definition will be aligned with current work in ETSI NFV IFA
 435 working group .
 436
 437 All of the interfaces described below are produced by the VIM and consumed by
 438 the Consumer or Administrator.
 439
 440 Fault management interface
 441 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 442
 443 This interface allows the VIM to notify the Consumer about a virtual resource
 444 that is affected by a fault, either within the virtual resource itself or by the
 445 underlying virtualization infrastructure. The messages on this interface are
 446 shown in :numref:`figure13` and explained in detail in the following
 447 subsections.
 448
 449 Note: The information elements used in this section are described in detail in
 450 Section 5.4.
 451
 452 .. figure:: images/figure13.png
 453    :name: figure13
 454    :width: 100%
 455
 456    Fault management NB I/F messages
 457
 458
 459 SubscribeRequest (Consumer -> VIM)
 460 __________________________________
 461
 462 Subscription from Consumer to VIM to be notified about faults of specific
 463 resources. The faults to be notified about can be narrowed down using a
 464 subscribe filter.
 465
 466 Parameters:
 467
 468 - SubscribeFilter [1] (SubscribeFilterClass): Optional information to narrow
 469   down the faults that shall be notified to the Consumer, for example limit to
 470   specific VirtualResourceID(s), severity, or cause of the alarm.
 471
 472 SubscribeResponse (VIM -> Consumer)
 473 ___________________________________
 474
 475 Response to a subscribe request message including information about the
 476 subscribed resources, in particular if they are in "fault/error" state.
 477
 478 Parameters:
 479
 480 * SubscriptionID [1] (Identifier): Unique identifier for the subscription. It
 481   can be used to delete or update the subscription.
 482 * VirtualResourceInfo [0..*] (VirtualResourceInfoClass): Provides additional
 483   information about the subscribed resources, i.e., a list of the related
 484   resources, the current state of the resources, etc.
 485
 486 FaultNotification (VIM -> Consumer)
 487 ___________________________________
 488
 489 Notification about a virtual resource that is affected by a fault, either within
 490 the virtual resource itself or by the underlying virtualization infrastructure.
 491 After reception of this request, the Consumer will decide on the optimal
 492 action to resolve the fault. This includes actions like switching to a hot
 493 standby virtual resource, migration of the fault virtual resource to another
 494 physical machine, termination of the faulty virtual resource and instantiation
 495 of a new virtual resource in order to provide a new hot standby resource.
 496 Existing resource management interfaces and messages between the Consumer and
 497 the VIM can be used for those actions, and there is no need to define additional
 498 actions on the Fault Management Interface.
 499
 500 Parameters:
 501
 502 * NotificationID [1] (Identifier): Unique identifier for the notification.
 503 * VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of faulty
 504   resources with detailed information about the faults.
 505
 506 FaultQueryRequest (Consumer -> VIM)
 507 ___________________________________
 508
 509 Request to find out about active alarms at the VIM. A FaultQueryFilter can be
 510 used to narrow down the alarms returned in the response message.
 511
 512 Parameters:
 513
 514 * FaultQueryFilter [1] (FaultQueryFilterClass): narrows down the
 515   FaultQueryRequest, for example it limits the query to certain physical
 516   resources, a certain zone, a given fault type/severity/cause, or a specific
 517   FaultID.
 518
 519 FaultQueryResponse (VIM -> Consumer)
 520 ____________________________________
 521
 522 List of active alarms at the VIM matching the FaultQueryFilter specified in the
 523 FaultQueryRequest.
 524
 525 Parameters:
 526
 527 * VirtualResourceInfo [0..*] (VirtualResourceInfoClass): List of faulty
 528   resources. For each resource all faults including detailed information about
 529   the faults are provided.
 530
 531 NFVI maintenance
 532 ^^^^^^^^^^^^^^^^
 533
 534 The NFVI maintenance interfaces Consumer-VIM allows the Consumer to subscribe to
 535 maintenance notifications provided by the VIM. The related maintenance interface
 536 Administrator-VIM allows the Administrator to issue maintenance requests to the
 537 VIM, i.e. requesting the VIM to take appropriate actions to empty physical
 538 machine(s) in order to execute maintenance operations on them. The interface
 539 also allows the Administrator to query the state of physical machines, e.g., in
 540 order to get details in the current status of the maintenance operation like a
 541 firmware update.
 542
 543 The messages defined in these northbound interfaces are shown in :numref:`figure14`
 544 and described in detail in the following subsections.
 545
 546 .. figure:: images/figure14.png
 547    :name: figure14
 548    :width: 100%
 549
 550    NFVI maintenance NB I/F messages
 551
 552 SubscribeRequest (Consumer -> VIM)
 553 __________________________________
 554
 555 Subscription from Consumer to VIM to be notified about maintenance operations
 556 for specific virtual resources. The resources to be informed about can be
 557 narrowed down using a subscribe filter.
 558
 559 Parameters:
 560
 561 * SubscribeFilter [1] (SubscribeFilterClass): Information to narrow down the
 562   faults that shall be notified to the Consumer, for example limit to specific
 563   virtual resource type(s).
 564
 565 SubscribeResponse (VIM -> Consumer)
 566 ___________________________________
 567
 568 Response to a subscribe request message, including information about the
 569 subscribed virtual resources, in particular if they are in "maintenance" state.
 570
 571 Parameters:
 572
 573 * SubscriptionID [1] (Identifier): Unique identifier for the subscription. It
 574   can be used to delete or update the subscription.
 575 * VirtualResourceInfo [0..*] (VirtalResourceInfoClass): Provides additional
 576   information about the subscribed virtual resource(s), e.g., the ID, type and
 577   current state of the resource(s).
 578
 579 MaintenanceNotification (VIM -> Consumer)
 580 _________________________________________
 581
 582 Notification about a physical resource switched to "maintenance" state. After
 583 reception of this request, the Consumer will decide on the optimal action to
 584 address this request, e.g., to switch to the standby (STBY) configuration.
 585
 586 Parameters:
 587
 588 * VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of virtual
 589   resources where the state has been changed to maintenance.
 590
 591 StateChangeRequest (Administrator -> VIM)
 592 _________________________________________
 593
 594 Request to change the state of a list of physical resources, e.g. to
 595 "maintenance" state, in order to prepare them for a planned maintenance
 596 operation.
 597
 598 Parameters:
 599
 600 * PhysicalResourceState [1..*] (PhysicalResourceStateClass)
 601
 602 StateChangeResponse (VIM -> Administrator)
 603 __________________________________________
 604
 605 Response message to inform the Administrator that the requested resources are
 606 now in maintenance state (or the operation resulted in an error) and the
 607 maintenance operation(s) can be executed.
 608
 609 Parameters:
 610
 611 * PhysicalResourceInfo [1..*] (PhysicalResourceInfoClass)
 612
 613 StateQueryRequest (Administrator -> VIM)
 614 ________________________________________
 615
 616 In this procedure, the Administrator would like to get the information about
 617 physical machine(s), e.g. their state ("normal", "maintenance"), firmware
 618 version, hypervisor version, update status of firmware and hypervisor, etc. It
 619 can be used to check the progress during firmware update and the confirmation
 620 after update. A filter can be used to narrow down the resources returned in the
 621 response message.
 622
 623 Parameters:
 624
 625 * StateQueryFilter [1] (StateQueryFilterClass): narrows down the
 626   StateQueryRequest, for example it limits the query to certain physical
 627   resources, a certain zone, or a given resource state.
 628
 629 StateQueryResponse (VIM -> Administrator)
 630 _________________________________________
 631
 632 List of physical resources matching the filter specified in the
 633 StateQueryRequest.
 634
 635 Parameters:
 636
 637 * PhysicalResourceInfo [0..*] (PhysicalResourceInfoClass): List of physical
 638   resources. For each resource, information about the current state, the
 639   firmware version, etc. is provided.
 640
 641 Blueprints
 642 ----------
 643
 644 This section is listing a first set of blueprints that have been proposed by the
 645 Doctor project to the open source community. Further blueprints addressing other
 646 gaps identified in Section 4 will be submitted at a later stage of the OPNFV. In
 647 this section the following definitions are used:
 648
 649 * "Event" is a message emitted by other OpenStack services such as Nova and
 650   Neutron and is consumed by the "Notification Agents" in Ceilometer.
 651 * "Notification" is a message generated by a "Notification Agent" in Ceilometer
 652   based on an "event" and is delivered to the "Collectors" in Ceilometer that
 653   store those notifications (as "sample") to the Ceilometer "Databases".
 654
 655 Instance State Notification  (Ceilometer) [*]_
 656 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 657
 658 The Doctor project is planning to handle "events" and "notifications" regarding
 659 Resource Status; Instance State, Port State, Host State, etc. Currently,
 660 Ceilometer already receives "events" to identify the state of those resources,
 661 but it does not handle and store them yet. This is why we also need a new event
 662 definition to capture those resource states from "events" created by other
 663 services.
 664
 665 This BP proposes to add a new compute notification state to handle events from
 666 an instance (server) from nova. It also creates a new meter "instance.state" in
 667 OpenStack.
 668
 669 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 670
 671 Event Publisher for Alarm  (Ceilometer) [*]_
 672 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 673
 674 **Problem statement:**
 675
 676   The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically
 677   querying/polling the databases in order to check all alarms independently from
 678   other processes. This is adding additional delay to the fault notification
 679   send to the Consumer, whereas one requirement of Doctor is to react on faults
 680   as fast as possible.
 681
 682   The existing message flow is shown in :numref:`figure12`: after receiving
 683   an "event", a "notification agent" (i.e. "event publisher") will send a
 684   "notification" to a "Collector". The "collector" is collecting the
 685   notifications and is updating the Ceilometer "Meter" database that is storing
 686   information about the "sample" which is capured from original "event". The
 687   "Alarm Evaluator" is periodically polling this databases then querying "Meter"
 688   database based on each alarm configuration.
 689
 690   In the current Ceilometer implementation, there is no possibility to directly
 691   trigger the "Alarm Evaluator" when a new "event" was received, but the "Alarm
 692   Evaluator" will only find out that requires firing new notification to the
 693   Consumer when polling the database.
 694
 695 **Change/feature request:**
 696
 697   This BP proposes to add a new "event publisher for alarm", which is bypassing
 698   several steps in Ceilometer in order to avoid the polling-based approach of
 699   the existing Alarm Evaluator that makes notification slow to users.
 700
 701   After receiving an "(alarm) event" by listening on the Ceilometer message
 702   queue ("notification bus"), the new "event publisher for alarm" immediately
 703   hands a "notification" about this event to a new Ceilometer component
 704   "Notification-driven alarm evaluator" proposed in the other BP (see Section
 705   5.6.3).
 706
 707   Note, the term "publisher" refers to an entity in the Ceilometer architecture
 708   (it is a "notification agent"). It offers the capability to provide
 709   notifications to other services outside of Ceilometer, but it is also used to
 710   deliver notifications to other Ceilometer components (e.g. the "Collectors")
 711   via the Ceilometer "notification bus".
 712
 713 **Implementation detail**
 714
 715   * "Event publisher for alarm" is part of Ceilometer
 716   * The standard AMQP message queue is used with a new topic string.
 717   * No new interfaces have to be added to Ceilometer.
 718   * "Event publisher for Alarm" can be configured by the Administrator of
 719     Ceilometer to be used as "Notification Agent" in addition to the existing
 720     "Notifier"
 721   * Existing alarm mechanisms of Ceilometer can be used allowing users to
 722     configure how to distribute the "notifications" transformed from "events",
 723     e.g. there is an option whether an ongoing alarm is re-issued or not
 724     ("repeat_actions").
 725
 726 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 727
 728 Notification-driven alarm evaluator (Ceilometer) [*]_
 729 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 730
 731 **Problem statement:**
 732
 733 The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically
 734 querying/polling the databases in order to check all alarms independently from
 735 other processes. This is adding additional delay to the fault notification send
 736 to the Consumer, whereas one requirement of Doctor is to react on faults as fast
 737 as possible.
 738
 739 **Change/feature request:**
 740
 741 This BP is proposing to add an alternative "Notification-driven Alarm Evaluator"
 742 for Ceilometer that is receiving "notifications" sent by the "Event Publisher
 743 for Alarm" described in the other BP. Once this new "Notification-driven Alarm
 744 Evaluator" received "notification", it finds the "alarm" configurations which
 745 may relate to the "notification" by querying the "alarm" database with some keys
 746 i.e. resource ID, then it will evaluate each alarm with the information in that
 747 "notification".
 748
 749 After the alarm evaluation, it will perform the same way as the existing "alarm
 750 evaluator" does for firing alarm notification to the Consumer. Similar to the
 751 existing Alarm Evaluator, this new "Notification-driven Alarm Evaluator" is
 752 aggregating and correlating different alarms which are then provided northbound
 753 to the Consumer via the OpenStack "Alarm Notifier". The user/administrator can
 754 register the alarm configuration via existing Ceilometer API [*]_. Thereby, he
 755 can configure whether to set an alarm or not and where to send the alarms to.
 756
 757 **Implementation detail**
 758
 759 * The new "Notification-driven Alarm Evaluator" is part of Ceilometer.
 760 * Most of the existing source code of the "Alarm Evaluator" can be re-used to
 761   implement this BP
 762 * No additional application logic is needed
 763 * It will access the Ceilometer Databases just like the existing "Alarm
 764   evaluator"
 765 * Only the polling-based approach will be replaced by a listener for
 766   "notifications" provided by the "Event Publisher for Alarm" on the Ceilometer
 767   "notification bus".
 768 * No new interfaces have to be added to Ceilometer.
 769
 770
 771 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 772 .. [*] https://wiki.openstack.org/wiki/Ceilometer/Alerting
 773
 774 Report host fault to update server state immediately (Nova) [*]_
 775 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 776
 777 **Problem statement:**
 778
 779 * Nova state change for failed or unreachable host is slow and does not reliably
 780   state host is down or not. This might cause same server instance to run twice
 781   if action taken to evacuate instance to another host.
 782 * Nova state for server(s) on failed host will not change, but remains active
 783   and running. This gives the user false information about server state.
 784 * VIM northbound interface notification of host faults towards VNFM and NFVO
 785   should be in line with OpenStack state. This fault notification is a Telco
 786   requirement defined in ETSI and will be implemented by OPNFV Doctor project.
 787 * Openstack user cannot make HA actions fast and reliably by trusting server
 788   state and host state.
 789
 790 **Proposed change:**
 791
 792 There needs to be a new API for Admin to state host is down. This API is used to
 793 mark services running in host down to reflect the real situation.
 794
 795 Example on compute node is:
 796
 797 * When compute node is up and running:::
 798
 799     vm_state: activeand power_state: running
 800     nova-compute state: up status: enabled
 801
 802 * When compute node goes down and new API is called to state host is down:::
 803
 804     vm_state: stopped power_state: shutdown
 805     nova-compute state: down status: enabled
 806
 807 **Alternatives:**
 808
 809 There is no attractive alternative to detect all different host faults than to
 810 have an external tool to detect different host faults. For this kind of tool to
 811 exist there needs to be new API in Nova to report fault. Currently there must be
 812 some kind of workarounds implemented as cannot trust or get the states from
 813 OpenStack fast enough.
 814
 815 .. [*] https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately
 816
 817 Other related BPs
 818 ^^^^^^^^^^^^^^^^^
 819
 820 This section lists some BPs related to Doctor, but proposed by drafters outside
 821 the OPNFV community.
 822
 823 pacemaker-servicegroup-driver [*]_
 824 __________________________________
 825
 826 This BP will detect and report host down quite fast to OpenStack. This however
 827 might not work properly for example when management network has some problem and
 828 host reported faulty while VM still running there. This might lead to launching
 829 same VM instance twice causing problems. Also NB IF message needs fault reason
 830 and for that the source needs to be a tool that detects different kind of faults
 831 as Doctor will be doing. Also this BP might need enhancement to change server
 832 and service states correctly.
 833
 834 .. [*] https://blueprints.launchpad.net/nova/+spec/pacemaker-servicegroup-driver
 835
 836 ..
 837  vim: set tabstop=4 expandtab textwidth=80: