requirements/05-implementation.rst

   1 Detailed architecture and interface specification
   2 =================================================
   3
   4 This section describes a detailed implementation plan, which is based on the
   5 high level architecture introduced in Section 3. Section 5.1 describes the
   6 functional blocks of the Doctor architecture, which is followed by a high level
   7 message flow in Section 5.2. Section 5.3 provides a mapping of selected existing
   8 open source components to the building blocks of the Doctor architecture.
   9 Thereby, the selection of components is based on their maturity and the gap
  10 analysis executed in Section 4. Sections 5.4 and 5.5 detail the specification of
  11 the related northbound interface and the related information elements. Finally,
  12 Section 5.6 provides a first set of blueprints to address selected gaps required
  13 for the realization functionalities of the Doctor project.
  14
  15 .. _impl_fb:
  16
  17 Functional Blocks
  18 -----------------
  19
  20 This section introduces the functional blocks to form the VIM. OpenStack was
  21 selected as the candidate for implementation. Inside the VIM, 4 different
  22 building blocks are defined (see :numref:`figure6`).
  23
  24 .. figure:: images/figure6.png
  25    :name: figure6
  26    :width: 100%
  27
  28    Functional blocks
  29
  30 Monitor
  31 ^^^^^^^
  32
  33 The Monitor module has the responsibility for monitoring the virtualized
  34 infrastructure. There are already many existing tools and services (e.g. Zabbix)
  35 to monitor different aspects of hardware and software resources which can be
  36 used for this purpose.
  37
  38 Inspector
  39 ^^^^^^^^^
  40
  41 The Inspector module has the ability a) to receive various failure notifications
  42 regarding physical resource(s) from Monitor module(s), b) to find the affected
  43 virtual resource(s) by querying the resource map in the Controller, and c) to
  44 update the state of the virtual resource (and physical resource).
  45
  46 The Inspector has drivers for different types of events and resources to
  47 integrate any type of Monitor and Controller modules. It also uses a failure
  48 policy database to decide on the failure selection and aggregation from raw
  49 events. This failure policy database is configured by the Administrator.
  50
  51 The reason for separation of the Inspector and Controller modules is to make the
  52 Controller focus on simple operations by avoiding a tight integration of various
  53 health check mechanisms into the Controller.
  54
  55 Controller
  56 ^^^^^^^^^^
  57
  58 The Controller is responsible for maintaining the resource map (i.e. the mapping
  59 from physical resources to virtual resources), accepting update requests for the
  60 resource state(s) (exposing as provider API), and sending all failure events
  61 regarding virtual resources to the Notifier. Optionally, the Controller has the
  62 ability to poison the state of virtual resources mapping to physical resources
  63 for which it has received failure notifications from the Inspector. The
  64 Controller also re-calculates the capacity of the NVFI when receiving a failure
  65 notification for a physical resource.
  66
  67 In a real-world deployment, the VIM may have several controllers, one for each
  68 resource type, such as Nova, Neutron and Cinder in OpenStack. Each controller
  69 maintains a database of virtual and physical resources which shall be the master
  70 source for resource information inside the VIM.
  71
  72 Notifier
  73 ^^^^^^^^
  74
  75 The focus of the Notifier is on selecting and aggregating failure events
  76 received from the controller based on policies mandated by the Consumer.
  77 Therefore, it allows the Consumer to subscribe for alarms regarding virtual
  78 resources using a method such as API endpoint. After receiving a fault
  79 event from a Controller, it will notify the fault to the Consumer by referring
  80 to the alarm configuration which was defined by the Consumer earlier on.
  81
  82 To reduce complexity of the Controller, it is a good approach for the
  83 Controllers to emit all notifications without any filtering mechanism and have
  84 another service (i.e. Notifier) handle those notifications properly. This is the
  85 general philosophy of notifications in OpenStack. Note that a fault message
  86 consumed by the Notifier is different from the fault message received by the
  87 Inspector; the former message is related to virtual resources which are visible
  88 to users with relevant ownership, whereas the latter is related to raw devices
  89 or small entities which should be handled with an administrator privilege.
  90
  91 The northbound interface between the Notifier and the Consumer/Administrator is
  92 specified in :ref:`impl_nbi`.
  93
  94 Sequence
  95 --------
  96
  97 Fault Management
  98 ^^^^^^^^^^^^^^^^
  99
 100 The detailed work flow for fault management is as follows (see also :numref:`figure7`):
 101
 102 1. Request to subscribe to monitor specific virtual resources. A query filter
 103    can be used to narrow down the alarms the Consumer wants to be informed
 104    about.
 105 2. Each subscription request is acknowledged with a subscribe response message.
 106    The response message contains information about the subscribed virtual
 107    resources, in particular if a subscribed virtual resource is in "alarm"
 108    state.
 109 3. The NFVI sends monitoring events for resources the VIM has been subscribed
 110    to. Note: this subscription message exchange between the VIM and NFVI is not
 111    shown in this message flow.
 112 4. Event correlation, fault detection and aggregation in VIM.
 113 5. Database lookup to find the virtual resources affected by the detected fault.
 114 6. Fault notification to Consumer.
 115 7. The Consumer switches to standby configuration (STBY)
 116 8. Instructions to VIM requesting certain actions to be performed on the
 117    affected resources, for example migrate/update/terminate specific
 118    resource(s). After reception of such instructions, the VIM is executing the
 119    requested action, e.g. it will migrate or terminate a virtual resource.
 120
 121    a. Query request from Consumer to VIM to get information about the current
 122    status of a resource.
 123    b. Response to the query request with information about the current status of
 124    the queried resource. In case the resource is in "fault" state, information
 125    about the related fault(s) is returned.
 126
 127 In order to allow for quick reaction to failures, the time interval between
 128 fault detection in step 3 and the corresponding recovery actions in step 7 and 8
 129 shall be less than 1 second.
 130
 131 .. figure:: images/figure7.png
 132    :name: figure7
 133    :width: 100%
 134
 135    Fault management work flow
 136
 137 .. figure:: images/figure8.png
 138    :name: figure8
 139    :width: 100%
 140
 141    Fault management scenario
 142
 143 :numref:`figure8` shows a more detailed message flow (Steps 4 to 6) between
 144 the 4 building blocks introduced in :ref:`impl_fb`.
 145
 146 4. The Monitor observed a fault in the NFVI and reports the raw fault to the
 147    Inspector.
 148    The Inspector filters and aggregates the faults using pre-configured
 149    failure policies.
 150
 151 5.
 152    a) The Inspector queries the Resource Map to find the virtual resources
 153    affected by the raw fault in the NFVI.
 154    b) The Inspector updates the state of the affected virtual resources in the
 155    Resource Map.
 156    c) The Controller observes a change of the virtual resource state and informs
 157    the Notifier about the state change and the related alarm(s).
 158    Alternatively, the Inspector may directly inform the Notifier about it.
 159
 160 6. The Notifier is performing another filtering and aggregation of the changes
 161    and alarms based on the pre-configured alarm configuration. Finally, a fault
 162    notification is sent to northbound to the Consumer.
 163
 164 NFVI Maintenance
 165 ^^^^^^^^^^^^^^^^
 166
 167 The detailed work flow for NFVI maintenance is shown in :numref:`figure9`
 168 and has the following steps. Note that steps 1, 2, and 5 to 8a in the NFVI
 169 maintenance work flow are very similar to the steps in the fault management work
 170 flow and share a similar implementation plan in Release 1.
 171
 172 1. Subscribe to fault/maintenance notifications.
 173 2. Response to subscribe request.
 174 3. Maintenance trigger received from administrator.
 175 4. VIM switches NFVI resources to "maintenance" state. This, e.g., means they
 176    should not be used for further allocation/migration requests
 177 5. Database lookup to find the virtual resources affected by the detected
 178    maintenance operation.
 179 6. Maintenance notification to Consumer.
 180 7. The Consumer switches to standby configuration (STBY)
 181 8. Instructions from Consumer to VIM requesting certain recovery actions to be
 182    performed (step 7a). After reception of such instructions, the VIM is
 183    executing the requested action in order to empty the physical resources (step
 184    7b).
 185 9. Maintenance response from VIM to inform the Administrator that the physical
 186    machines have been emptied (or the operation resulted in an error state).
 187 10. Administrator is coordinating and executing the maintenance operation/work
 188     on the NFVI.
 189
 190     A) Query request from Administrator to VIM to get information about the
 191     current state of a resource.
 192     B) Response to the query request with information about the current state of
 193     the queried resource(s). In case the resource is in "maintenance" state,
 194     information about the related maintenance operation is returned.
 195
 196 .. figure:: images/figure9.png
 197    :name: figure9
 198    :width: 100%
 199
 200    NFVI maintenance work flow
 201
 202 .. figure:: images/figure10.png
 203    :name: figure10
 204    :width: 100%
 205
 206    NFVI Maintenance implementation plan
 207
 208 :numref:`figure10` shows a more detailed message flow (Steps 4 to 6)
 209 between the 4 building blocks introduced in Section 5.1..
 210
 211 3. The Administrator is sending a StateChange request to the Controller residing
 212    in the VIM.
 213 4. The Controller queries the Resource Map to find the virtual resources
 214    affected by the planned maintenance operation.
 215 5.
 216
 217   a) The Controller updates the state of the affected virtual resources in the
 218   Resource Map database.
 219
 220   b) The Controller informs the Notifier about the virtual resources that will
 221   be affected by the maintenance operation.
 222
 223 6. A maintenance notification is sent to northbound to the Consumer.
 224
 225 ...
 226
 227 9. The Controller informs the Administrator after the physical resources have
 228    been freed.
 229
 230
 231
 232 Implementation plan for OPNFV Release 1
 233 ---------------------------------------
 234
 235 Fault management
 236 ^^^^^^^^^^^^^^^^
 237
 238 :numref:`figure11` shows the implementation plan based on OpenStack and
 239 related components as planned for Release 1. Hereby, the Monitor can be realized
 240 by Zabbix. The Controller is realized by OpenStack Nova [NOVA]_, Neutron
 241 [NEUT]_, and Cinder [CIND]_ for compute, network, and storage,
 242 respectively. The Inspector can be realized by Monasca [MONA]_ or a simple
 243 script querying Nova in order to map between physical and virtual resources. The
 244 Notifier will be realized by Ceilometer [CEIL]_ receiving failure events
 245 on its notification bus.
 246
 247 :numref:`figure12` shows the inner-workings of Ceilometer. After receiving
 248 an "event" on its notification bus, first a notification agent will grab the
 249 event and send a "notification" to the Collector. The collector writes the
 250 notifications received to the Ceilometer databases.
 251
 252 In the existing Ceilometer implementation, an alarm evaluator is periodically
 253 polling those databases through the APIs provided. If it finds new alarms, it
 254 will evaluate them based on the pre-defined alarm configuration, and depending
 255 on the configuration, it will hand a message to the Alarm Notifier, which in
 256 turn will send the alarm message northbound to the Consumer. :numref:`figure12`
 257 also shows an optimized work flow for Ceilometer with the goal to
 258 reduce the delay for fault notifications to the Consumer. The approach is to
 259 implement a new notification agent (called "publisher" in Ceilometer
 260 terminology) which is directly sending the alarm through the "Notification Bus"
 261 to a new "Notification-driven Alarm Evaluator (NAE)" (see Sections 5.6.2 and
 262 5.6.3), thereby bypassing the Collector and avoiding the additional delay of the
 263 existing polling-based alarm evaluator. The NAE is similar to the OpenStack
 264 "Alarm Evaluator", but is triggered by incoming notifications instead of
 265 periodically polling the OpenStack "Alarms" database for new alarms. The
 266 Ceilometer "Alarms" database can hold three states: "normal", "insufficient
 267 data", and "fired". It is representing a persistent alarm database. In order to
 268 realize the Doctor requirements, we need to define new "meters" in the database
 269 (see Section 5.6.1).
 270
 271 .. figure:: images/figure11.png
 272    :name: figure11
 273    :width: 100%
 274
 275    Implementation plan in OpenStack (OPNFV Release 1 ”Arno”)
 276
 277
 278 .. figure:: images/figure12.png
 279    :name: figure12
 280    :width: 100%
 281
 282    Implementation plan in Ceilometer architecture
 283
 284
 285 NFVI Maintenance
 286 ^^^^^^^^^^^^^^^^
 287
 288 For NFVI Maintenance, a quite similar implementation plan exists. Instead of a
 289 raw fault being observed by the Monitor, the Administrator is sending a
 290 Maintenance Request through the northbound interface towards the Controller
 291 residing in the VIM. Similar to the Fault Management use case, the Controller
 292 (in our case OpenStack Nova) will send a maintenance event to the Notifier (i.e.
 293 Ceilometer in our implementation). Within Ceilometer, the same workflow as
 294 described in the previous section applies. In addition, the Controller(s) will
 295 take appropriate actions to evacuate the physical machines in order to prepare
 296 them for the planned maintenance operation. After the physical machines are
 297 emptied, the Controller will inform the Administrator that it can initiate the
 298 maintenance.
 299
 300 Information elements
 301 --------------------
 302
 303 This section introduces all attributes and information elements used in the
 304 messages exchange on the northbound interfaces between the VIM and the VNFO and
 305 VNFM.
 306
 307 Note: The information elements will be aligned with current work in ETSI NFV IFA
 308 working group.
 309
 310
 311 Simple information elements:
 312
 313 * SubscriptionID: identifies a subscription to receive fault or maintenance
 314   notifications.
 315 * NotificationID: identifies a fault or maintenance notification.
 316 * VirtualResourceID (Identifier): identifies a virtual resource affected by a
 317   fault or a maintenance action of the underlying physical resource.
 318 * PhysicalResourceID (Identifier): identifies a physical resource affected by a
 319   fault or maintenance action.
 320 * VirtualResourceState (String): state of a virtual resource, e.g. "normal",
 321   "maintenance", "down", "error".
 322 * PhysicalResourceState (String): state of a physical resource, e.g. "normal",
 323   "maintenance", "down", "error".
 324 * VirtualResourceType (String): type of the virtual resource, e.g. "virtual
 325   machine", "virtual memory", "virtual storage", "virtual CPU", or "virtual
 326   NIC".
 327 * FaultID (Identifier): identifies the related fault in the underlying physical
 328   resource. This can be used to correlate different fault notifications caused
 329   by the same fault in the physical resource.
 330 * FaultType (String): Type of the fault. The allowed values for this parameter
 331   depend on the type of the related physical resource. For example, a resource
 332   of type "compute hardware" may have faults of type "CPU failure", "memory
 333   failure", "network card failure", etc.
 334 * Severity (Integer): value expressing the severity of the fault. The higher the
 335   value, the more severe the fault.
 336 * MinSeverity (Integer): value used in filter information elements. Only faults
 337   with a severity higher than the MinSeverity value will be notified to the
 338   Consumer.
 339 * EventTime (Datetime): Time when the fault was observed.
 340 * EventStartTime and EventEndTime (Datetime): Datetime range that can be used in
 341   a FaultQueryFilter to narrow down the faults to be queried.
 342 * ProbableCause: information about the probable cause of the fault.
 343 * CorrelatedFaultID (Integer): list of other faults correlated to this fault.
 344 * isRootCause (Boolean): Parameter indicating if this fault is the root for
 345   other correlated faults. If TRUE, then the faults listed in the parameter
 346   CorrelatedFaultID are caused by this fault.
 347 * FaultDetails (Key-value pair): provides additional information about the
 348   fault, e.g. information about the threshold, monitored attributes, indication
 349   of the trend of the monitored parameter.
 350 * FirmwareVersion (String): current version of the firmware of a physical
 351   resource.
 352 * HypervisorVersion (String): current version of a hypervisor.
 353 * ZoneID (Identifier): Identifier of the resource zone. A resource zone is the
 354   logical separation of physical and software resources in an NFVI deployment
 355   for physical isolation, redundancy, or administrative designation.
 356 * Metadata (Key-Value-Pairs): provides additional information of a physical
 357   resource in maintenance/error state.
 358
 359 Complex information elements (see also UML diagrams in :numref:`figure13`
 360 and :numref:`figure14`):
 361
 362 * VirtualResourceInfoClass:
 363
 364   + VirtualResourceID [1] (Identifier)
 365   + VirtualResourceState [1] (String)
 366   + Faults [0..*] (FaultClass): For each resource, all faults
 367     including detailed information about the faults are provided.
 368
 369 * FaultClass: The parameters of the FaultClass are partially based on ETSI TS
 370   132 111-2 (V12.1.0) [*]_, which is specifying fault management in 3GPP, in
 371   particular describing the information elements used for alarm notifications.
 372
 373   - FaultID [1] (Identifier)
 374   - FaultType [1]
 375   - Severity [1] (Integer)
 376   - EventTime [1] (Datetime)
 377   - ProbableCause [1]
 378   - CorrelatedFaultID [0..*] (Identifier)
 379   - FaultDetails [0..*] (Key-value pair)
 380
 381 .. [*] http://www.etsi.org/deliver/etsi_ts/132100_132199/13211102/12.01.00_60/ts_13211102v120100p.pdf
 382
 383 * SubscribeFilterClass
 384
 385   - VirtualResourceType [0..*] (String)
 386   - VirtualResourceID [0..*] (Identifier)
 387   - FaultType [0..*] (String)
 388   - MinSeverity [0..1] (Integer)
 389
 390 * FaultQueryFilterClass: narrows down the FaultQueryRequest, for example it
 391   limits the query to certain physical resources, a certain zone, a given fault
 392   type/severity/cause, or a specific FaultID.
 393
 394   - VirtualResourceType [0..*] (String)
 395   - VirtualResourceID [0..*] (Identifier)
 396   - FaultType [0..*] (String)
 397   - MinSeverity [0..1] (Integer)
 398   - EventStartTime [0..1] (Datetime)
 399   - EventEndTime [0..1] (Datetime)
 400
 401 * PhysicalResourceStateClass:
 402
 403   - PhysicalResourceID [1] (Identifier)
 404   - PhysicalResourceState [1] (String): mandates the new state of the physical
 405     resource.
 406
 407 * PhysicalResourceInfoClass:
 408
 409   - PhysicalResourceID [1] (Identifier)
 410   - PhysicalResourceState [1] (String)
 411   - FirmwareVersion [0..1] (String)
 412   - HypervisorVersion [0..1] (String)
 413   - ZoneID [0..1] (Identifier)
 414
 415 * StateQueryFilterClass: narrows down a StateQueryRequest, for example it limits
 416   the query to certain physical resources, a certain zone, or a given resource
 417   state (e.g., only resources in "maintenance" state).
 418
 419   - PhysicalResourceID [1] (Identifier)
 420   - PhysicalResourceState [1] (String)
 421   - ZoneID [0..1] (Identifier)
 422
 423 .. _impl_nbi:
 424
 425 Detailed northbound interface specification
 426 -------------------------------------------
 427
 428 This section is specifying the northbound interfaces for fault management and
 429 NFVI maintenance between the VIM on the one end and the Consumer and the
 430 Administrator on the other ends. For each interface all messages and related
 431 information elements are provided.
 432
 433 Note: The interface definition will be aligned with current work in ETSI NFV IFA
 434 working group .
 435
 436 All of the interfaces described below are produced by the VIM and consumed by
 437 the Consumer or Administrator.
 438
 439 Fault management interface
 440 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 441
 442 This interface allows the VIM to notify the Consumer about a virtual resource
 443 that is affected by a fault, either within the virtual resource itself or by the
 444 underlying virtualization infrastructure. The messages on this interface are
 445 shown in :numref:`figure13` and explained in detail in the following
 446 subsections.
 447
 448 Note: The information elements used in this section are described in detail in
 449 Section 5.4.
 450
 451 .. figure:: images/figure13.png
 452    :name: figure13
 453    :width: 100%
 454
 455    Fault management NB I/F messages
 456
 457
 458 SubscribeRequest (Consumer -> VIM)
 459 __________________________________
 460
 461 Subscription from Consumer to VIM to be notified about faults of specific
 462 resources. The faults to be notified about can be narrowed down using a
 463 subscribe filter.
 464
 465 Parameters:
 466
 467 - SubscribeFilter [1] (SubscribeFilterClass): Optional information to narrow
 468   down the faults that shall be notified to the Consumer, for example limit to
 469   specific VirtualResourceID(s), severity, or cause of the alarm.
 470
 471 SubscribeResponse (VIM -> Consumer)
 472 ___________________________________
 473
 474 Response to a subscribe request message including information about the
 475 subscribed resources, in particular if they are in "fault/error" state.
 476
 477 Parameters:
 478
 479 * SubscriptionID [1] (Identifier): Unique identifier for the subscription. It
 480   can be used to delete or update the subscription.
 481 * VirtualResourceInfo [0..*] (VirtualResourceInfoClass): Provides additional
 482   information about the subscribed resources, i.e., a list of the related
 483   resources, the current state of the resources, etc.
 484
 485 FaultNotification (VIM -> Consumer)
 486 ___________________________________
 487
 488 Notification about a virtual resource that is affected by a fault, either within
 489 the virtual resource itself or by the underlying virtualization infrastructure.
 490 After reception of this request, the Consumer will decide on the optimal
 491 action to resolve the fault. This includes actions like switching to a hot
 492 standby virtual resource, migration of the fault virtual resource to another
 493 physical machine, termination of the faulty virtual resource and instantiation
 494 of a new virtual resource in order to provide a new hot standby resource.
 495 Existing resource management interfaces and messages between the Consumer and
 496 the VIM can be used for those actions, and there is no need to define additional
 497 actions on the Fault Management Interface.
 498
 499 Parameters:
 500
 501 * NotificationID [1] (Identifier): Unique identifier for the notification.
 502 * VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of faulty
 503   resources with detailed information about the faults.
 504
 505 FaultQueryRequest (Consumer -> VIM)
 506 ___________________________________
 507
 508 Request to find out about active alarms at the VIM. A FaultQueryFilter can be
 509 used to narrow down the alarms returned in the response message.
 510
 511 Parameters:
 512
 513 * FaultQueryFilter [1] (FaultQueryFilterClass): narrows down the
 514   FaultQueryRequest, for example it limits the query to certain physical
 515   resources, a certain zone, a given fault type/severity/cause, or a specific
 516   FaultID.
 517
 518 FaultQueryResponse (VIM -> Consumer)
 519 ____________________________________
 520
 521 List of active alarms at the VIM matching the FaultQueryFilter specified in the
 522 FaultQueryRequest.
 523
 524 Parameters:
 525
 526 * VirtualResourceInfo [0..*] (VirtualResourceInfoClass): List of faulty
 527   resources. For each resource all faults including detailed information about
 528   the faults are provided.
 529
 530 NFVI maintenance
 531 ^^^^^^^^^^^^^^^^
 532
 533 The NFVI maintenance interfaces Consumer-VIM allows the Consumer to subscribe to
 534 maintenance notifications provided by the VIM. The related maintenance interface
 535 Administrator-VIM allows the Administrator to issue maintenance requests to the
 536 VIM, i.e. requesting the VIM to take appropriate actions to empty physical
 537 machine(s) in order to execute maintenance operations on them. The interface
 538 also allows the Administrator to query the state of physical machines, e.g., in
 539 order to get details in the current status of the maintenance operation like a
 540 firmware update.
 541
 542 The messages defined in these northbound interfaces are shown in :numref:`figure14`
 543 and described in detail in the following subsections.
 544
 545 .. figure:: images/figure14.png
 546    :name: figure14
 547    :width: 100%
 548
 549    NFVI maintenance NB I/F messages
 550
 551 SubscribeRequest (Consumer -> VIM)
 552 __________________________________
 553
 554 Subscription from Consumer to VIM to be notified about maintenance operations
 555 for specific virtual resources. The resources to be informed about can be
 556 narrowed down using a subscribe filter.
 557
 558 Parameters:
 559
 560 * SubscribeFilter [1] (SubscribeFilterClass): Information to narrow down the
 561   faults that shall be notified to the Consumer, for example limit to specific
 562   virtual resource type(s).
 563
 564 SubscribeResponse (VIM -> Consumer)
 565 ___________________________________
 566
 567 Response to a subscribe request message, including information about the
 568 subscribed virtual resources, in particular if they are in "maintenance" state.
 569
 570 Parameters:
 571
 572 * SubscriptionID [1] (Identifier): Unique identifier for the subscription. It
 573   can be used to delete or update the subscription.
 574 * VirtualResourceInfo [0..*] (VirtalResourceInfoClass): Provides additional
 575   information about the subscribed virtual resource(s), e.g., the ID, type and
 576   current state of the resource(s).
 577
 578 MaintenanceNotification (VIM -> Consumer)
 579 _________________________________________
 580
 581 Notification about a physical resource switched to "maintenance" state. After
 582 reception of this request, the Consumer will decide on the optimal action to
 583 address this request, e.g., to switch to the standby (STBY) configuration.
 584
 585 Parameters:
 586
 587 * VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of virtual
 588   resources where the state has been changed to maintenance.
 589
 590 StateChangeRequest (Administrator -> VIM)
 591 _________________________________________
 592
 593 Request to change the state of a list of physical resources, e.g. to
 594 "maintenance" state, in order to prepare them for a planned maintenance
 595 operation.
 596
 597 Parameters:
 598
 599 * PhysicalResourceState [1..*] (PhysicalResourceStateClass)
 600
 601 StateChangeResponse (VIM -> Administrator)
 602 __________________________________________
 603
 604 Response message to inform the Administrator that the requested resources are
 605 now in maintenance state (or the operation resulted in an error) and the
 606 maintenance operation(s) can be executed.
 607
 608 Parameters:
 609
 610 * PhysicalResourceInfo [1..*] (PhysicalResourceInfoClass)
 611
 612 StateQueryRequest (Administrator -> VIM)
 613 ________________________________________
 614
 615 In this procedure, the Administrator would like to get the information about
 616 physical machine(s), e.g. their state ("normal", "maintenance"), firmware
 617 version, hypervisor version, update status of firmware and hypervisor, etc. It
 618 can be used to check the progress during firmware update and the confirmation
 619 after update. A filter can be used to narrow down the resources returned in the
 620 response message.
 621
 622 Parameters:
 623
 624 * StateQueryFilter [1] (StateQueryFilterClass): narrows down the
 625   StateQueryRequest, for example it limits the query to certain physical
 626   resources, a certain zone, or a given resource state.
 627
 628 StateQueryResponse (VIM -> Administrator)
 629 _________________________________________
 630
 631 List of physical resources matching the filter specified in the
 632 StateQueryRequest.
 633
 634 Parameters:
 635
 636 * PhysicalResourceInfo [0..*] (PhysicalResourceInfoClass): List of physical
 637   resources. For each resource, information about the current state, the
 638   firmware version, etc. is provided.
 639
 640 Blueprints
 641 ----------
 642
 643 This section is listing a first set of blueprints that have been proposed by the
 644 Doctor project to the open source community. Further blueprints addressing other
 645 gaps identified in Section 4 will be submitted at a later stage of the OPNFV. In
 646 this section the following definitions are used:
 647
 648 * "Event" is a message emitted by other OpenStack services such as Nova and
 649   Neutron and is consumed by the "Notification Agents" in Ceilometer.
 650 * "Notification" is a message generated by a "Notification Agent" in Ceilometer
 651   based on an "event" and is delivered to the "Collectors" in Ceilometer that
 652   store those notifications (as "sample") to the Ceilometer "Databases".
 653
 654 Instance State Notification  (Ceilometer) [*]_
 655 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 656
 657 The Doctor project is planning to handle "events" and "notifications" regarding
 658 Resource Status; Instance State, Port State, Host State, etc. Currently,
 659 Ceilometer already receives "events" to identify the state of those resources,
 660 but it does not handle and store them yet. This is why we also need a new event
 661 definition to capture those resource states from "events" created by other
 662 services.
 663
 664 This BP proposes to add a new compute notification state to handle events from
 665 an instance (server) from nova. It also creates a new meter "instance.state" in
 666 OpenStack.
 667
 668 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 669
 670 Event Publisher for Alarm  (Ceilometer) [*]_
 671 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 672
 673 **Problem statement:**
 674
 675   The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically
 676   querying/polling the databases in order to check all alarms independently from
 677   other processes. This is adding additional delay to the fault notification
 678   send to the Consumer, whereas one requirement of Doctor is to react on faults
 679   as fast as possible.
 680
 681   The existing message flow is shown in :numref:`figure12`: after receiving
 682   an "event", a "notification agent" (i.e. "event publisher") will send a
 683   "notification" to a "Collector". The "collector" is collecting the
 684   notifications and is updating the Ceilometer "Meter" database that is storing
 685   information about the "sample" which is capured from original "event". The
 686   "Alarm Evaluator" is periodically polling this databases then querying "Meter"
 687   database based on each alarm configuration.
 688
 689   In the current Ceilometer implementation, there is no possibility to directly
 690   trigger the "Alarm Evaluator" when a new "event" was received, but the "Alarm
 691   Evaluator" will only find out that requires firing new notification to the
 692   Consumer when polling the database.
 693
 694 **Change/feature request:**
 695
 696   This BP proposes to add a new "event publisher for alarm", which is bypassing
 697   several steps in Ceilometer in order to avoid the polling-based approach of
 698   the existing Alarm Evaluator that makes notification slow to users.
 699
 700   After receiving an "(alarm) event" by listening on the Ceilometer message
 701   queue ("notification bus"), the new "event publisher for alarm" immediately
 702   hands a "notification" about this event to a new Ceilometer component
 703   "Notification-driven alarm evaluator" proposed in the other BP (see Section
 704   5.6.3).
 705
 706   Note, the term "publisher" refers to an entity in the Ceilometer architecture
 707   (it is a "notification agent"). It offers the capability to provide
 708   notifications to other services outside of Ceilometer, but it is also used to
 709   deliver notifications to other Ceilometer components (e.g. the "Collectors")
 710   via the Ceilometer "notification bus".
 711
 712 **Implementation detail**
 713
 714   * "Event publisher for alarm" is part of Ceilometer
 715   * The standard AMQP message queue is used with a new topic string.
 716   * No new interfaces have to be added to Ceilometer.
 717   * "Event publisher for Alarm" can be configured by the Administrator of
 718     Ceilometer to be used as "Notification Agent" in addition to the existing
 719     "Notifier"
 720   * Existing alarm mechanisms of Ceilometer can be used allowing users to
 721     configure how to distribute the "notifications" transformed from "events",
 722     e.g. there is an option whether an ongoing alarm is re-issued or not
 723     ("repeat_actions").
 724
 725 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 726
 727 Notification-driven alarm evaluator (Ceilometer) [*]_
 728 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 729
 730 **Problem statement:**
 731
 732 The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically
 733 querying/polling the databases in order to check all alarms independently from
 734 other processes. This is adding additional delay to the fault notification send
 735 to the Consumer, whereas one requirement of Doctor is to react on faults as fast
 736 as possible.
 737
 738 **Change/feature request:**
 739
 740 This BP is proposing to add an alternative "Notification-driven Alarm Evaluator"
 741 for Ceilometer that is receiving "notifications" sent by the "Event Publisher
 742 for Alarm" described in the other BP. Once this new "Notification-driven Alarm
 743 Evaluator" received "notification", it finds the "alarm" configurations which
 744 may relate to the "notification" by querying the "alarm" database with some keys
 745 i.e. resource ID, then it will evaluate each alarm with the information in that
 746 "notification".
 747
 748 After the alarm evaluation, it will perform the same way as the existing "alarm
 749 evaluator" does for firing alarm notification to the Consumer. Similar to the
 750 existing Alarm Evaluator, this new "Notification-driven Alarm Evaluator" is
 751 aggregating and correlating different alarms which are then provided northbound
 752 to the Consumer via the OpenStack "Alarm Notifier". The user/administrator can
 753 register the alarm configuration via existing Ceilometer API [*]_. Thereby, he
 754 can configure whether to set an alarm or not and where to send the alarms to.
 755
 756 **Implementation detail**
 757
 758 * The new "Notification-driven Alarm Evaluator" is part of Ceilometer.
 759 * Most of the existing source code of the "Alarm Evaluator" can be re-used to
 760   implement this BP
 761 * No additional application logic is needed
 762 * It will access the Ceilometer Databases just like the existing "Alarm
 763   evaluator"
 764 * Only the polling-based approach will be replaced by a listener for
 765   "notifications" provided by the "Event Publisher for Alarm" on the Ceilometer
 766   "notification bus".
 767 * No new interfaces have to be added to Ceilometer.
 768
 769
 770 .. [*] https://etherpad.opnfv.org/p/doctor_bps
 771 .. [*] https://wiki.openstack.org/wiki/Ceilometer/Alerting
 772
 773 Report host fault to update server state immediately (Nova) [*]_
 774 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 775
 776 **Problem statement:**
 777
 778 * Nova state change for failed or unreachable host is slow and does not reliably
 779   state host is down or not. This might cause same server instance to run twice
 780   if action taken to evacuate instance to another host.
 781 * Nova state for server(s) on failed host will not change, but remains active
 782   and running. This gives the user false information about server state.
 783 * VIM northbound interface notification of host faults towards VNFM and NFVO
 784   should be in line with OpenStack state. This fault notification is a Telco
 785   requirement defined in ETSI and will be implemented by OPNFV Doctor project.
 786 * Openstack user cannot make HA actions fast and reliably by trusting server
 787   state and host state.
 788
 789 **Proposed change:**
 790
 791 There needs to be a new API for Admin to state host is down. This API is used to
 792 mark services running in host down to reflect the real situation.
 793
 794 Example on compute node is:
 795
 796 * When compute node is up and running:::
 797
 798     vm_state: activeand power_state: running
 799     nova-compute state: up status: enabled
 800
 801 * When compute node goes down and new API is called to state host is down:::
 802
 803     vm_state: stopped power_state: shutdown
 804     nova-compute state: down status: enabled
 805
 806 **Alternatives:**
 807
 808 There is no attractive alternative to detect all different host faults than to
 809 have an external tool to detect different host faults. For this kind of tool to
 810 exist there needs to be new API in Nova to report fault. Currently there must be
 811 some kind of workarounds implemented as cannot trust or get the states from
 812 OpenStack fast enough.
 813
 814 .. [*] https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately
 815
 816 Other related BPs
 817 ^^^^^^^^^^^^^^^^^
 818
 819 This section lists some BPs related to Doctor, but proposed by drafters outside
 820 the OPNFV community.
 821
 822 pacemaker-servicegroup-driver [*]_
 823 __________________________________
 824
 825 This BP will detect and report host down quite fast to OpenStack. This however
 826 might not work properly for example when management network has some problem and
 827 host reported faulty while VM still running there. This might lead to launching
 828 same VM instance twice causing problems. Also NB IF message needs fault reason
 829 and for that the source needs to be a tool that detects different kind of faults
 830 as Doctor will be doing. Also this BP might need enhancement to change server
 831 and service states correctly.
 832
 833 .. [*] https://blueprints.launchpad.net/nova/+spec/pacemaker-servicegroup-driver
 834
 835 ..
 836  vim: set tabstop=4 expandtab textwidth=80: