doc/Escalator_Requirement.rst

   1 Draft Escalator Requirement v0.4
   2 ================================
   3
   4 Authors:
   5 --------
   6
   7 | Jie Hu (ZTE, hu.jie@zte.com.cn)
   8 | Qiao Fu (China Mobile, fuqiao@chinamobile.com)
   9 | Ulrich Kleber (Huawei, Ulrich.Kleber@huawei.com)
  10 | Maria Toeroe (Ericsson, maria.toeroe@ericsson.com)
  11 | Sama, Malla Reddy (DOCOMO, sama@docomolab-euro.com)
  12 | Zhong Chao (ZTE, chao.zhong@zte.com.cn)
  13 | Julien Zhang (ZTE, zhang.jun3g@zte.com.cn)
  14 | Yuri Yuan (ZTE, yuan.yue@zte.com.cn)
  15 | Zhipeng Huang (Huawei, huangzhipeng@huawei.com)
  16 | Jia Meng (ZTE, meng.jia@zte.com.cn)
  17 | Liyi Meng (Ericsson, liyi.meng@ericsson.com)
  18 | Pasi Vaananen (Stratus, pasi.vaananen@stratus.com)
  19
  20 1. Scope
  21 --------
  22
  23 | This document describes the user requirements on the smooth upgrade
  24   function of the NFVI and VIM with respect to the upgrades of the OPNFV
  25   platform from one version to another. Smooth upgrade means that the
  26   upgrade results in no service outage for the end-users. This requires
  27   that the process of the upgrade is automatically carried out by a tool
  28   (code name: Escalator) with pre-configured data. The upgrade process
  29   includes preparation, validation, execution, monitoring and
  30   conclusion.
  31 | ==[MT] While it is good to have a tool for the entire upgrade process,
  32   but it is a challenging task, so maybe we shouldn't require automation
  33   for the entire process right away. Automation is essential at
  34   execution.==
  35 | ==[hujie] Maybe we can analysis information flows of the upgrade tool,
  36   abstract the basic / essential actions from the tool (or tools), and
  37   map them to a command set of NFVI / VIM's interfaces.==
  38
  39 The requirements are defined in a stepwise approach, i.e. in the first
  40 phase focusing on the upgrade of the VIM then widening the scope to the
  41 NFVI.
  42
  43 The requirements may apply to different NFV functions (NFVI, or VIM, or
  44 both of them) . They will be classified in the Appendix of this
  45 document.
  46
  47 2. General Requirements Background and terminology
  48 --------------------------------------------------
  49
  50 ==[MT] At the moment 2.1-2.3 seem to be more background sections than
  51 requirements. Should we rename this part?==
  52
  53 2.1 Terminologies and definitions
  54 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  55
  56 -  **NFVI** is abbreviation for Network Function Virtualization
  57    Infrastructure; sometimes it is also referred as data plane in this
  58    document.
  59 -  **VIM** is abbreviation for Virtual Infrastructure Management;
  60    sometimes it is also referred as control plane in this document.
  61 -  **Operators** are network service providers and Virtual Network
  62    Function (VNF) providers.
  63 -  **End-Users** are subscribers of Operator's services.
  64 -  **Network Service** is a service provided by an Operator to its
  65    End-users using a set of (virtualized) Network Functions
  66 -  **Infrastructure Services** are those provided by the NFV
  67    Infrastructure and the Management & Orchestration functions to the
  68    VNFs. I.e. these are the virtual resources as perceived by the VNFs.
  69 -  **Smooth Upgrade** means that the upgrade results in no service
  70    outage for the end-users.
  71 -  **Rolling Upgrade** is an upgrade strategy that upgrades each node or
  72    a subset of nodes in a wave rolling style through the data centre. It
  73    is a popular upgrade strategy to maintains service availability.
  74 -  **Parallel Universe** is an upgrade strategy that creates and deploys
  75    a new universe - a system with the new configuration - while the old
  76    system continues running. The state of the old system is transferred
  77    to the new system after sufficient testing of the later.
  78 -  **Infrastructure Resource Model** ==(suggested by MT)== is identified
  79    as: physical resources, virtualization facility resources and virtual
  80    resources.
  81 -  **Physical Resources** are the hardware of the infrastructure, may
  82    also includes the firmware that enable the hardware.
  83 -  **Virtual Resources** are resources provided as services built on top
  84    of the physical resources via the virtualization facilities; in our
  85    case, they are the components that VNF entities are built on, e.g.
  86    the VMs, virtual switches, virtual routers, virtual disks etc
  87    ==[MT] I don't think the VNF is the virtual resource. Virtual
  88    resources are the VMs, virtual switches, virtual routers, virtual
  89    disks etc. The VNF uses them, but I don't think they are equal. The
  90    VIM doesn't manage the VNF, but it does manage virtual resources.==
  91 -  **Visualization Facilities** are resources that enable the creation
  92    of virtual environments on top of the physical resources, e.g.
  93    hypervisor, OpenStack, etc.
  94
  95 2.2 Upgrade Objects
  96 ~~~~~~~~~~~~~~~~~~~
  97
  98 2.2.1 Physical Resource
  99 ^^^^^^^^^^^^^^^^^^^^^^^
 100
 101 | Most of the cloud infrastructures support dynamic addition/removal of
 102   hardware. A hardware upgrade could be done by removing the old
 103   hardware node and adding the new one. This will not be in the scope of
 104   this project.
 105 | ==[MT] Does this mean that we are excluding firmware upgrades too?==
 106
 107 2.2.2 Virtual Resources
 108 ^^^^^^^^^^^^^^^^^^^^^^^
 109
 110 | Virtual resource upgrade mainly done by users. OPNFV may facilitate
 111   the activity, but suggest to have it in long term roadmap instead of
 112   initiate release.
 113 | ==[MT] same comment here: I don't think the VNF is the virtual
 114   resource. Virtual resources are the VMs, virtual switches, virtual
 115   routers, virtual disks etc. The VNF uses them, but I don't think they
 116   are equal. For example if by some reason the hypervisor is changed and
 117   the current VMs cannot be migrated to the new hypervisor, they are
 118   incompatible, then the VMs need to be upgraded too. This is not
 119   something the NFVI user (i.e. VNFs ) would even know about.==
 120
 121 2.2.3 Virtualization Facility Resources
 122 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 123
 124 | Based on the functionality they provide, virtualization facility
 125   resources could be divided into computing node, networking node,
 126   storage node and management node.
 127 | The possible upgrade objects in these nodes are addressed below:
 128   (Note: hardware based virtualization may considered as virtualization
 129   facility resource, but from escalator perspective, it is better
 130   considered it as part of hardware upgrade. )
 131
 132 **Computing node**
 133
 134 #. OS Kernel
 135 #. Hypvervisor and virtual switch
 136 #. Other kernel modules, like driver
 137 #. User space software packages, like nova-compute agents and other
 138    control plane programs
 139
 140 | Updating 1 and 2 will cause the loss of virtualzation functionality of
 141   the compute node, which may lead to data plane services interruption
 142   if the virtual resource is not redudant.
 143 | Updating 3 might result the same.
 144 | Updating 4 might lead to control plane services interruption if not an
 145   HA deployment.
 146
 147 **Networking node**
 148
 149 #. OS kernel, optional, not all switch/router allow you to upgrade its
 150    OS since it is more like a firmware than a generic OS.
 151 #. User space software package, like neutron agents and other control
 152    plane programs
 153
 154 | Updating 1 if allowed will cause a node reboot and therefore leads to
 155   data plane services interruption if the virtual resource is not
 156   redudant.
 157 | Updating 2 might lead to control plane services interruption if not an
 158   HA deployment.
 159
 160 **Storage node**
 161
 162 #. OS kernel, optional, not all storage node allow you to upgrade its OS
 163    since it is more like a firmware than a generic OS.
 164 #. Kernel modules
 165 #. User space software packages, control plane programs
 166
 167 | Updating 1 if allowed will cause a node reboot and therefore leads to
 168   data plane services interruption if the virtual resource is not
 169   redudant.
 170 | Update 2 might result in the same.
 171 | Updating 3 might lead to control plane services interruption if not an
 172   HA deployment.
 173
 174 **Management node**
 175
 176 #. OS Kernel
 177 #. Kernel modules, like driver
 178 #. User space software packages, like database, message queue and
 179    control plane programs.
 180
 181 | Updating 1 will cause a node reboot and therefore leads to control
 182   plane services interruption if not an HA deployment. Updating 2 might
 183   result in the same.
 184 | Updating 3 might lead to control plane services interruption if not an
 185   HA deployment.
 186
 187 2.3 Upgrade Span
 188 ~~~~~~~~~~~~~~~~
 189
 190 | **Major Upgrade**
 191 | Upgrades between major releases may introducing significent changes in
 192   function, configuration and data, such as the upgrade of OPNFV from
 193   Arno to Brahmaputra.
 194
 195 | **Minor Upgrade**
 196 | Upgrades inside one major releases which would not leads to changing
 197   the stucture of the platform and may not infect the schema of the
 198   system data.
 199
 200 2.4 Upgrade Granularity
 201 ~~~~~~~~~~~~~~~~~~~~~~~
 202
 203 2.4.1 Physical/Hardware Dimension
 204 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 205
 206 Support full / partial upgrade for data centre, cluster, zone. Because
 207 of the upgrade of a data centre or a zone, it may be divided into
 208 several batches. The upgrade of a cloud environment (cluster) may also
 209 be partial. For example, in one cloud environment running a number of
 210 VNFs, we may just try one of them to check the stability and
 211 performance, before we upgrade all of them.
 212
 213 2.4.2 Software Dimension
 214 ^^^^^^^^^^^^^^^^^^^^^^^^
 215
 216 -  The upgrade of host OS or kernel may need a 'hot migration'
 217 -  The upgrade of OpenStack’s components
 218     i.the one-shot upgrade of all components
 219     ii.the partial upgrade (or bugfix patch) which only affects some
 220    components (e.g., computing, storage, network, database, message
 221    queue, etc.)
 222
 223 | ==[MT] this section seems to overlap with 2.1.==
 224 | I can see the following dimensions for the software
 225
 226 -  different software packages
 227 -  different funtions - Considering that the target versions of all
 228    software are compatible the upgrade needs to ensure that any
 229    dependencies between SW and therefore packages are taken into account
 230    in the upgrade plan, i.e. no version mismatch occurs during the
 231    upgrade therefore dependencies are not broken
 232 -  same function - This is an upgrade specific question if different
 233    versions can coexist in the system when a SW is being upgraded from
 234    one version to another. This is particularly important for stateful
 235    functions e.g. storage, networking, control services. The upgrade
 236    method must consider the compatibility of the redundant entities.
 237
 238 -  different versions of the same software package
 239 -  major version changes - they may introduce incompatibilities. Even
 240    when there are backward compatibility requirements changes may cause
 241    issues at graceful rollback
 242 -  minor version changes - they must not introduce incompatibility
 243    between versions, these should be primarily bug fixes, so live
 244    patches should be possible
 245
 246 -  different installations of the same software package
 247 -  using different installation options - they may reflect different
 248    users with different needs so redundancy issues are less likely
 249    between installations of different options; but they could be the
 250    reflection of the heterogeneous system in which case they may provide
 251    redundancy for higher availability, i.e. deeper inspection is needed
 252 -  using the same installation options - they often reflect that the are
 253    used by redundant entities across space
 254
 255 -  different distribution possibilities in space - same or different
 256    availability zones, multi-site, geo-redundancy
 257
 258 -  different entities running from the same installation of a software
 259    package
 260 -  using different startup options - they may reflect different users so
 261    redundancy may not be an issues between them
 262 -  using same startup options - they often reflect redundant
 263    entities====
 264
 265 3.5 Upgrade duration
 266 ~~~~~~~~~~~~~~~~~~~~
 267
 268 As the OPNFV end-users are primarily Telco operators, the network
 269 services provided by the VNFs deployed on the NFVI should meet the
 270 requirement of 'Carrier Grade'.
 271
 272 In telecommunication, a "carrier grade" or"carrier class" refers to a
 273 system, or a hardware or software component that is extremely reliable,
 274 well tested and proven in its capabilities. Carrier grade systems are
 275 tested and engineered to meet or exceed "five nines" high availability
 276 standards, and provide very fast fault recovery through redundancy
 277 (normally less than 50 milliseconds). [from wikipedia.org]
 278
 279 "five nines" means working all the time in ONE YEAR except 5'15".
 280
 281 We have learnt that a well prepared upgrade of OpenStack needs 10
 282 minutes. The major time slot in the outage time is used spent on
 283 synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
 284 ' by Symantec]
 285
 286 This 10 minutes of downtime of OpenStack however did not impact the
 287 users, i.e. the VMs running on the compute nodes. This was the outage of
 288 the control plane only. On the other hand with respect to the
 289 preparations this was a manually tailored upgrade specific to the
 290 particular deployment and the versions of each OpenStack service.
 291
 292 The project targets to achieve a more generic methodology, which however
 293 requires that the upgrade objects fulfill ceratin requirements. Since
 294 this is only possible on the long run we target first upgrades from
 295 version to version for the different VIM services.
 296
 297 **Questions:**
 298
 299 #. | Can we manage to upgrade OPNFV in only 5 minutes?
 300    | ==[MT] The first question is whether we have the same carrier grade
 301      requirement on the control plane as on the user plane. I.e. how
 302      much control plane outage we can/willing to tolerate?
 303    | In the above case probably if the database is only half of the size
 304      we can do the upgrade in 5 minutes, but is that good? It also means
 305      that if the database is twice as much then the outage is 20
 306      minutes.
 307    | For the user plane we should go for less as with two release yearly
 308      that means 10 minutes outage per year.==
 309    | ==[Malla] 10 minutes outage per year to the users? Plus, if we take
 310      control plane into the consideration, then total outage will be
 311      more than 10 minute in whole network, right?==
 312    | ==[MT] The control plane outage does not have to cause outage to
 313      the users, but it may of course depending on the size of the system
 314      as it's more likely that there's a failure that needs to be handled
 315      by the control plane.==
 316
 317 #. | Is it acceptable for end users ? Such as a planed service
 318      interruption will lasting more than ten minutes for software
 319      upgrade.
 320    | ==[MT] For user plane, no it's not acceptable in case of
 321      carrier-grade. The 5' 15" downtime should include unplanned and
 322      planned downtimes.==
 323    | ==[Malla] I go agree with Maria, it is not acceptable.==
 324
 325 #. | Will any VNFs still working well when VIM is down?
 326    | ==[MT] In case of OpenStack it seems yes. .:)==
 327
 328 2.5.1 The maximum duration of an upgrade
 329 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 330
 331 | The duration of an upgrade is related to and proportional with the
 332   scale and the complexity of the OPNFV platform as well as the
 333   granularity (in function and in space) of the upgrade.
 334 | [Malla] Also, if is a partial upgrade like module upgrade, it depends
 335   also on the OPNFV modules and their tight connection entites as well.
 336
 337 2.5.2 The maximum duration of a rollback when an upgrade is failed - this should be about rollback duration
 338 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 339
 340 | The duration of a rollback is short than the corresponding upgrade. It
 341   depends on the duration of restore the software and configue data from
 342   pre-upgrade backup / snapshot.
 343 | ==[MT] During the upgrade process two types of failure may happen:
 344 |  In case we can recover from the failure by undoing the upgrade
 345   actions it is possible to roll back the already executed part of the
 346   upgrade in graceful manner introducing no more service outage than
 347   what was introduced during the upgrade. Such a graceful rollback
 348   requires typically the same amount of time as the executed portion of
 349   the upgrade and impose minimal state/data loss.==
 350 | ==[MT] Requirement: It should be possible to roll back gracefully the
 351   failed upgrade of stateful services of the control plane.
 352 |  In case we cannot recover from the failure by just undoing the
 353   upgrade actions, we have to restore the upgraded entities from their
 354   backed up state. In other terms the system falls back to an earlier
 355   state, which is typically a faster recovery procedure than graceful
 356   rollback and depending on the statefulness of the entities involved it
 357   may result in significant state/data loss.==
 358 | **Two possible types of failures can happen during an upgrade**
 359
 360 #. We can recover from the failure that occured in the upgrade process:
 361    In this case, a graceful rolling back of the executed part of the
 362    upgrade may be possible which would "undo" the executed part in a
 363    similar fashion. Thus, such a roll back introduces no more service
 364    outage during an upgrade than the executed part introduced. This
 365    process typically requires the same amount of time as the executed
 366    portion of the upgrade and impose minimal state/data loss.
 367 #. We cannot recover from the failure that occured in the upgrade
 368    process: In this case, the system needs to fall back to an earlier
 369    consistent state by reloading this backed-up state. This is typically
 370    a faster recovery procedure than the graceful rollback, but can cause
 371    state/data loss. The state/data loss usually depends on the
 372    statefulness of the entities whose state is restored from the backup.
 373
 374 2.5.3 The maximum duration of a VNF interruption
 375 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 376
 377 | Since not the entire process of a smooth upgrade will affect the VNFs,
 378   the duration of the VNF interruption may be shorter than the duration
 379   of the upgrade. In some cases, the VNF running without the control
 380   from of the VIM is acceptable.
 381 | ==[MT] Should require explicitly that the NFVI should be able to
 382   provide its services to the VNFs independent of the control plane?==
 383 | ==[MT] Requirement: The upgrade of the control plane must not cause
 384   interruption of the NFVI services provided to the VNFs.==
 385 | ==[MT] With respect to carrier-grade the yearly service outage of the
 386   VNF should not exceed 5' 15" regardless whether it is planned or
 387   unplanned outage. Considering the HA requirements TL-9000 requires an
 388   ent-to-end service recovery time of 15 seconds based on which the ETSI
 389   GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
 390   availability levels (SAL). The proposed example service recovery times
 391   for these levels are:
 392 | SAL1: 5-6 seconds
 393 | SAL2: 10-15 seconds
 394 | SAL3: 20-25 seconds==
 395 | ==[Pva] my comment was actually that the downtime metrics of the
 396   underlying elements, components and services are small fraction of the
 397   total E2E service availability time. No-one on the E2E service path
 398   will get the whole downtime allocation (in this context it includes
 399   upgrade process related outages for the services provided by VIM etc.
 400   elements that are subject to upgrade process).==
 401 | ==[MT] So what you are saying is that the upgrade of any entity
 402   (component, service) shouldn't cause even this much service
 403   interruption. This was the reason I brought these figures here as well
 404   that they are posing some kind of upper-upper boundary. Ideally the
 405   interruption is in the millisecond range i.e. no more than a
 406   switchover or a live migration.==
 407 | ==[MT] Requirement: Any interruption caused to the VNF by the upgrade
 408   of the NFVI should be in the sub-second range.==
 409
 410 ==[MT] In the future we also need to consider the upgrade of the NFVI,
 411 i.e. HW, firmware, hypervisors, host OS etc.==
 412
 413 3. Functional Considerations
 414 ----------------------------
 415
 416 3.1 Requirement of Escalator's Basic Actions
 417 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 418
 419 This section describes the basic functions may required by Escalator.
 420
 421 3.1.1 Preparation (offline)
 422 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 423
 424 This is the design phase when the upgrade plan (or upgrade campaign) is
 425 being designed so that it can be executed automatically with minimal
 426 service outage. It may include the following work:
 427
 428 #. Check the dependencies of the software modules and their impact,
 429    backward compatibilities to figure out the appropriate upgrade method
 430    and ordering.
 431 #. Find out if a rolling upgrade could be planned with several rolling
 432    steps to avoid any service outage due to the upgrade some
 433    parts/services at the same time.
 434 #. Collect the proper version files and check the integration for
 435    upgrading.
 436 #. The preparation step should produce an output (i.e. upgrade
 437    campaign/plan), which is executable automatically in an NFV Famawork
 438    and which can be validated before execution.
 439
 440    -  The upgrade campaign should not be referring to scalable entities
 441       directly, but allow for adaptation to the system configuration and
 442       state at any given moment.
 443    -  The upgrade campaign should describe the ordering of the upgrade
 444       of different entities so that dependencies, redundancies can be
 445       maintained during the upgrade execution
 446    -  The upgrade campaign should provide information about the
 447       applicable recovery procedures and their ordering.
 448    -  The upgrade campaign should consider information about the
 449       verification/testing procedures to be performed during the upgrade
 450       so that upgrade failures can be detected as soon as possible and
 451       the appropriate recovery procedure can be identified and applied.
 452    -  The upgrade campaign should provide information on the expected
 453       execution time so that hanging execution can be identified
 454    -  The upgrade campaign should indicate any point in the upgrade when
 455       coordination with the users (VNFs) is required.
 456
 457 ==[hujie]Depends on the attributes of the object being upgraded, the
 458 upgrade plan may be slitted into step(s) and/or sub-plan(s), and even
 459 more small sub-plans in design phase. The plan(s) or sub-plan(s) my
 460 include step(s) or sub-plan(s).==
 461
 462 3.1.2 Validation the upgrade plan / Checking the pre-requisites of System( offline / online)
 463 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 464
 465 | The upgrade plan should be validated before the execution by testing
 466   it in a test environment which is similar to the product environment.
 467 | ==[MT]However it could also mean that we can identify some properties
 468   that it should satisfy e.g. what operations can or cannot be executed
 469   simultaneously like never take out two VMs of the same VNF.
 470 | Another question is if it requires that the system is in a particular
 471   state when the upgrade is applied. I.e. if there's certain amount of
 472   redundacy in the system, migration is enabled for VMs, when the NFVI
 473   is upgraded the VIM is healthy, when the VIM is upgraded the NFVI is
 474   healthy, etc.
 475 | I'm not sure what online validation means: Is it the validation of the
 476   upgrade plan/campaign or the validation of the system that it is in a
 477   state that the upgrade can be performed without too much risk?==
 478
 479 | Before the upgrade plan being executed, the system heathly of the
 480   online product environment should be checked and confirmed to satisfy
 481   the requirements which were described in the upgrade plan. The
 482   sysinfo, e.g. which included system alarms, performance statistics and
 483   diagnostic logs, will be collected and analyized. It is required to
 484   resolve all of the system faults or exclud the unhealthy part before
 485   executing the upgrade plan.
 486 | ==[hujie] Text merged.==
 487
 488 3.1.3 Backup/Snapshot (online)
 489 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 490
 491 For avoid loss of data when a unsuccessful upgrade was encountered, the
 492 data should be backuped and the system state snapshot should be taken
 493 before the excution of upgrade plan. This would be considered in the
 494 upgrade plan.
 495
 496 Several backups/Snapshots may be generated and stored before the single
 497 steps of changes. The following data/files are required to be
 498 considered:
 499
 500 #. running version files for each node.
 501 #. system components' configuration file and database.
 502 #. image and storage, if it is necessary.
 503    ==[MT] Does 3 imply VNF image and storage? I.e. VNF state and data?==
 504
 505 | ==[hujie] The following text is derived from previous "4. Negotiate
 506   with the VNF if it's ready for the upgrade"==
 507 | Although the upper layer, which include VNFs and VNFMs, is out of the
 508   scope of Escalator, but it is still recommended to let it ready for a
 509   smooth system upgrade. The escalator could not garanttee the safe of
 510   VNFs. The upper layer should have some safe guard mechanism in design,
 511   and ready for avoiding failure in system upgrade.
 512
 513 3.1.4 Execution (online)
 514 ^^^^^^^^^^^^^^^^^^^^^^^^
 515
 516 | The execution of upgrade plan should be a dynamical procedure which is
 517   controlled by Escalator.
 518 | ==[hujie] Revised text to be general.==
 519
 520 #. It is required to supporting execution ether in sequence or in
 521    parallel.
 522 #. It is required to checke the result of the execution and take the
 523    action according the situation and the policies in the upgrade plan.
 524 #. It is required to execute properly on various configurations of
 525    system object. I.e. stand-alone, HA, etc.
 526 #. It is required to excecute on the designated different parts of the
 527    system. I.e. physical server, virtualized server, rack, chassis,
 528    cluster, even different geographical places.
 529
 530 3.1.5 Testing (online)
 531 ^^^^^^^^^^^^^^^^^^^^^^
 532
 533 | The testing after upgrade the whole system or parts of system to make
 534   sure the upgraded system(object) is working normally.
 535 | ==[hujie] Revised text to be general.==
 536
 537 #. It is recommended to run the prepared test cases to see if the
 538    functionalities are availiable without any problem.
 539 #. It is recommended to check the sysinfo, e.g. system alarms,
 540    performance statistics and diagnostic logs to see if there are any
 541    abnormal.
 542
 543 3.1.6 Restore/Rollback (online)
 544 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 545
 546 | When upgrade is failure unfortunatly, a quick system restore or system
 547   rollback should be taken to recovery the system and the services.
 548 | ==[hujie] Revised text to be general.==
 549
 550 #. It is recommend to support system restore from backup when upgrade
 551    was failed.
 552 #. It is recommend to support gracefull rollback with reverse order
 553    steps if possible.
 554
 555 3.1.7 Monitoring (online)
 556 ^^^^^^^^^^^^^^^^^^^^^^^^^
 557
 558 | Escalator should continually monitor the process of upgrade. It is
 559   keeping update status of each module, each node, each cluster into a
 560   status table during upgrade.
 561 | ==[hujie] Revised text to be general.==
 562
 563 #. It is required to collect the status of every objects being upgraded
 564    and sending abnormal alerms during the upgrade.
 565 #. It is recommend to reuse the existing monitoring system, like alarm.
 566 #. It is recommend to support pro-actively query.
 567 #. It is recommend to support passively wait for notification.
 568
 569 | **Two possible ways for monitoring:**
 570 | **Pro-Actively Query** requires NFVI/VIM provides proper API or CLI
 571   interface. If Escalator serves as a service, it should pass on these
 572   interfaces.
 573 | **Passively Wait for Notification** requires Escalator provides
 574   callback interface, which could be used by NFVI/VIM systems or upgrade
 575   agent to send back notification.
 576 | [hujie] I am not sure why not to subscribe the notification.
 577
 578 3.1.8 Logging (online)
 579 ^^^^^^^^^^^^^^^^^^^^^^
 580
 581 Record the information generated by escalator into log files. The log
 582 file is used for manual diagnostic of exceptions.
 583
 584 #. It is required to support logging.
 585 #. It is recommended to include time stamp, object id, action name,
 586    error code, etc.
 587
 588 3.1.9 Administrative Control (online)
 589 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 590
 591 Administrative Control is used for control the privilege to start any
 592 escalator's actions for avoding unauthorized operations.
 593
 594 #. It is required to support administrative control mechenism
 595 #. It is recommed to reuse the system's own secure system.
 596 #. It is required to avoid conflicts when the system's own secure system
 597    being upgraded.
 598
 599 3.2 Requirements on system object being upgraded
 600 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 601
 602 | ==We can develope BPs in future from req of this section and GA for
 603   upper stream projects==
 604 | Escalator focus on smooth upgrade. In practical implementation, it
 605   might be combined with installer/deployer, or act as an independent
 606   tool/service. In either way, it requires targeting systems(NFVI and
 607   VIM) are developed/deployed in a way that Escalator could perform
 608   upgrade on them.
 609
 610 On NFVI system, live-migration is likely used to maintain availability
 611 because OPNFV would like to make HA transparent from end user. This
 612 requires VIM system being able to put compute node into maintenance mode
 613 and then isolated from normal service. Otherwise, new NFVI instances
 614 might risk at being schedule into the upgrading node.
 615
 616 | On VIM system, availability is likely achieved by redundancy. This
 617   impose less requirements on system/services being upgrade (see PVA
 618   comments in early version). However, there should be a way to put the
 619   target system into standby mode. Because starting upgrade on the
 620   master node in a cluster is likely a bad idea.
 621 | ==[hujie] Revised text to be general.==
 622
 623 #. It is required for NFVI/VIM to support **service handover** mechanism
 624    that minimize interruption to 0.001%(i.e. 99.999% service
 625    availability). Possible implementations are live-migration, redundant
 626    deployment, etc, (Note: for VIM, interruption could be less
 627    restrictive)
 628 #. It is required for NFVI/VIM to restore the early verion in a efficent
 629    way, such as **snapshot**.
 630 #. It is required for NFVI/VIM to **migration data** efficiently between
 631    base and upgraded system.
 632    ==[hujie] What is exact meaning of "base" here?==
 633 #. It is recomend for NFV/VIM's interface to support upgrade
 634    orchestration, e.g. reading/setting system state
 635    ==[hujie] I am not sure if it reflect the previous text.==
 636
 637 4. Use Cases
 638 ------------
 639
 640 This section describes the use cases to verify the requirements of
 641 Escalator.
 642
 643 4.1 Upgrade a system with minimal configuration
 644 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 645
 646 A minimal configuration system is normally depolyed for experimental or
 647 developement ussage, such as a OPNFV test bed. Althouth it dose not have
 648 large workload, but it is a typical system to be upgraded frequently.
 649
 650 4.2 Upgrade a system with HA configuration
 651 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 652
 653 A HA configuration system is very popular in the operator's data centre.
 654 And it is a typical product environment. It always running 7 \* 24 a
 655 week with VNFs running on it to provide services to the end users.
 656
 657 4.3 Upgrade a system with Multi-Site configuration
 658 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 659
 660 Upgrade in one site may cause service interruption to other site, if
 661 both sites are depended and sharing the same modules/data base (e.g. a
 662 keystone for both sites).
 663
 664 If a site failure during an upgrade, the rollback missing any minimal
 665 state/data loss can cause an affect/failure to the depended site.
 666
 667 ==Consider one site of ARNO release first. Then, multi-site in the
 668 future.==
 669
 670 5. RA of Escalator
 671 ------------------
 672
 673 This section describes the reference architecture, the function blocks,
 674 the function entities of Escalator for the reader to well understand how
 675 the basic functions be organized.
 676
 677 6. Information Flows
 678 --------------------
 679
 680 | This section describes the information flows among the function
 681   entities when Escalator is in actions.
 682 | We should consider a generic procedure / frameworks of upgrading. And
 683   may provide a plugin interface for specialized tasks
 684
 685 7. Interfaces
 686 -------------
 687
 688 This section describes the required interfaces of Escalator.
 689
 690 7.1 Manual Interface (CLI / GUI)
 691 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 692
 693 7.2 RESTful API
 694 ~~~~~~~~~~~~~~~
 695
 696 To support 3.3 Negotiate with the VNF if it's ready for the upgrade
 697
 698 7.3 Configuration File
 699 ~~~~~~~~~~~~~~~~~~~~~~
 700
 701 This section will suggest a format of the configuration files and how to
 702 deal with it.
 703
 704 7.4 Log File
 705 ------------
 706
 707 This section will suggest a format of the log files and how to deal with
 708 it.
 709
 710 8. Requirements from other OPNFV projects
 711 -----------------------------------------
 712
 713 | We have created a questionnaire for collecting other projects
 714   requirments
 715   (https://docs.google.com/forms/d/11o1mt15zcq0WBtXYK0n6lKF8XuIzQTwvv8ePTjmcoF0/viewform?usp=send_form),
 716   please advertise it.
 717 | ==[hujie] Can we force other OPNFV projects to complete the survey by
 718   using JIRA dependence?==
 719
 720 8.1 Doctor Project
 721 ~~~~~~~~~~~~~~~~~~
 722
 723 | ==Note: This scenario could be out of scope in Escalator project, but
 724   having the option to support this should be better to align with
 725   Doctor requirements.==
 726 | The scope of Doctor project also covers maintenance scenario in which
 727   1) the VIM administorator requests host maintenance to VIM, 2) VIM
 728   will notifiy it to consumer such as VNFM to trigger application level
 729   migration or switching active-standby nodes, and 3) VIM waits responce
 730   from the consumer for a short while.
 731
 732 -  VIM should send out notification of VM migration to consumer (VNFM)
 733    as abstracted message like "maintenance".
 734 -  VIM could wait VM migration until it receives "VM ready to
 735    maintenance" message from the owner (VNFM)
 736
 737 8.2 HA Project
 738 ~~~~~~~~~~~~~~
 739
 740 8.3 Multi-site Project
 741 ~~~~~~~~~~~~~~~~~~~~~~
 742
 743 -  Escalator upgrade one site should at least not lead to the other site
 744    API token validation failed.
 745
 746 9. Reference
 747 ------------
 748
 749 | [1] ETSI GS NFV 002 (V1.1.1): “Architectural Framework”
 750 | [2] ETSI GS NFV 003 (V1.1.1): "Terminology for Main Concepts in NFV".
 751 | [3] ETSI GS NFV-SWA001:“Virtual Network Function Architecture”
 752 | [4] ETSI GS NFV-MAN001:“Management and Orchestration”
 753 | [5] ETSI GS NFV-REL001:"Resiliency Requirements"
 754 | [6] QuEST Forum TL-9000:"Quality Management System Requirement
 755   Handbook"
 756 | [7] Service Availabilty Forum AIS:"Software Management Framework"
 757
 758 10. Useful Working Drafts of ETSI NFV
 759 -------------------------------------
 760
 761 | Access them with your own ETSI account, please DO NOT disclose the
 762   content.
 763 | [1] Migrate Virtualised Compute Resource operation @ 7.3.1.8
 764 | ftp://docbox.etsi.org/ISG/NFV/Open/Drafts/IFA005_Or-Vi_ref_point_Spec/NFV-IFA005v070.zip
 765 | [2] Reliability issues during NFV Software upgrade and improvement
 766   mechanisms @ 8
 767 | ftp://@docbox.etsi.org/ISG/NFV/Open/Drafts/REL003_E2E_reliability_models/NFV-REL003v030.zip
 768
 769 Appendix
 770 --------
 771
 772 A.1 Impact Analysis
 773 ~~~~~~~~~~~~~~~~~~~
 774
 775 Upgrading the different software modules may cause different impact on
 776 the availability of the infrastracture resources and even on the service
 777 continuity of the vNFs.
 778
 779 **Software modules in the computing nodes**
 780
 781 #. Host OS patch
 782    ==[MT] As SW module, we should list the host OS and maybe ====its
 783    drivers as well. From upgrade persepctive do we limit host OS
 784    upgrades to patches only?==
 785 #. Hypervisor, such as KVM, QEMU, XEN, libvirt
 786 #. Openstack agent in computing nodes (like Nova agent, Ceilometer
 787    agent...)
 788
 789 **Software modules in network nodes**
 790
 791 #. Neutron L2/L3 agent
 792 #. OVS, SR-IOV Driver
 793
 794 **Software modules storage nodes**
 795
 796 #. Ceph
 797
 798 The table below analyses such an impact - considering a single instance
 799 of each software module - from the following aspects:
 800
 801 -  the function which will be lost during upgrade,
 802 -  the duration of the loss of this specific function,
 803 -  if this causes the loss of the vNF function,
 804 -  if it causes incompatibility in the different parts of the software,
 805 -  what should be backed up before the upgrade,
 806 -  the duration of restoration time if the upgrade fails
 807
 808 | These values provided come from internal testing and based on some
 809   assumptions, they may vary depending on the deployment techniques.
 810   Please feel free to add if you find more efficient values during your
 811   testing.
 812 | https://wiki.opnfv.org/_media/upgrade_analysis_v0.5.xlsx
 813 | Note that no redundancy of the software modules is considered in the
 814   table.