docs/development/design/inspector-design-guideline.rst

   1 .. This work is licensed under a Creative Commons Attribution 4.0 International License.
   2 .. http://creativecommons.org/licenses/by/4.0
   3
   4 ==========================
   5 Inspector Design Guideline
   6 ==========================
   7
   8 .. NOTE::
   9    This is spec draft of design guideline for inspector component.
  10    JIRA ticket to track the update and collect comments: `DOCTOR-73`_.
  11
  12 This document summarize the best practise in designing a high performance
  13 inspector to meet the requirements in `OPNFV Doctor project`_.
  14
  15 Problem Description
  16 ===================
  17
  18 Some pitfalls has be detected during the development of sample inspector, e.g.
  19 we suffered a significant `performance degrading in listing VMs in a host`_.
  20
  21 A `patch set for caching the list`_ has been committed to solve issue. When a
  22 new inspector is integrated, it would be nice to have an evaluation of existing
  23 design and give recommendations for improvements.
  24
  25 This document can be treated as a source of related blueprints in inspector
  26 projects.
  27
  28 Guidelines
  29 ==========
  30
  31 Host specific VMs list
  32 ----------------------
  33
  34 While requirement in doctor project is to have alarm about fault to consumer in one second, it is just a limit we have
  35 set in requirements. When talking about fault management in Telco, the implementation needs to be by all means optimal
  36 and the one second is far from traditional Telco requirements.
  37
  38 One thing to be optimized in inspector is to eliminate the need to read list of host specific VMs from Nova API, when
  39 it gets a host specific failure event. Optimal way of implementation would be to initialize this list when Inspector
  40 start by reading from Nova API and after this list would be kept up-to-date by ``instance.update`` notifications
  41 received from nova. Polling Nova API can be used as a complementary channel to make snapshot of hosts and VMs list in
  42 order to keep the data consistent with reality.
  43
  44 This is enhancement and not perhaps something needed to keep under one second in a small system. Anyhow this would be
  45 something needed in case of production use.
  46
  47 This guideline can be summarized as following:
  48
  49 - cache the host VMs mapping instead of reading it on request
  50 - subscribe and handle update notifications to keep the list up to date
  51 - make snapshot periodically to ensure data consistency
  52
  53 Parallel execution
  54 ------------------
  55
  56 In doctor's architecture, the inspector is responsible to set error state for the affected VMs in order to notify the
  57 consumers of such failure. This is done by calling the nova `reset-state`_ API. However, this action is a synchronous
  58 request with many underlying steps and cost typically hundreds of milliseconds. According to the
  59 `discussion in mailing list`_, this time cost will grow linearly if the requests are sent one by one. It will become
  60 a critical issue in large scale system.
  61
  62 It is recommended to introduce **parallel execution** for actions like ``reset-state`` that takes a list of targets.
  63
  64 Shortcut notification
  65 ---------------------
  66
  67 An alternative way to improve notification performance is to take a shortcut from inspector to notifier instead of
  68 triggering it from controller. The difference between the two workflow is shown below:
  69
  70 .. figure:: images/conservative-notification.png
  71    :alt: conservative notification
  72
  73    Conservative Notification
  74
  75 .. figure:: images/shortcut-notification.png
  76    :alt: shortcut notification
  77
  78    Shortcut Notification
  79
  80 It worth noting that the shortcut notification has a side effect that cloud resource states could still be out-of-sync
  81 by the time consumer processes the alarm notification. This is out of scope of inspector design but need to be taken
  82 consideration in system level.
  83
  84 Also the call of "reset servers state to error" is not necessary in the alternative notification case where the "host
  85 forced down" is still called. "get-valid-server-state" was implemented to have valid server state while earlier one
  86 couldn't get it unless calling "reset servers state to error". When not having "reset servers state to error", states
  87 are more unlikely to be out of sync while notification and force down host would be parallel.
  88
  89 Appendix
  90 ========
  91
  92 A study has been made to evaluate the effect of parallel execution and shortcut notification on OPNFV Beijing Summit
  93 2017.
  94
  95 .. figure:: images/notification-time.png
  96    :alt: notification time
  97
  98    Notification Time
  99
 100 Download the `full presentation slides`_ here.
 101
 102 .. _DOCTOR-73: https://jira.opnfv.org/browse/DOCTOR-73
 103 .. _OPNFV Doctor project: https://wiki.opnfv.org/doctor
 104 .. _performance degrading in listing VMs in a host: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-September/012591.html
 105 .. _patch set for caching the list: https://gerrit.opnfv.org/gerrit/#/c/20877/
 106 .. _DOCTOR-76: https://jira.opnfv.org/browse/DOCTOR-76
 107 .. _discussion in mailing list: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-October/013036.html
 108 .. _reset-state: https://developer.openstack.org/api-ref/compute/#reset-server-state-os-resetstate-action
 109 .. _full presentation slides: https://wiki.opnfv.org/download/attachments/5046291/doctor_qtip_faster_higher_stronger.pdf