1 Draft Escalator Requirement v0.4
2 ================================
7 | Jie Hu (ZTE, hu.jie@zte.com.cn)
8 | Qiao Fu (China Mobile, fuqiao@chinamobile.com)
9 | Ulrich Kleber (Huawei, Ulrich.Kleber@huawei.com)
10 | Maria Toeroe (Ericsson, maria.toeroe@ericsson.com)
11 | Sama, Malla Reddy (DOCOMO, sama@docomolab-euro.com)
12 | Zhong Chao (ZTE, chao.zhong@zte.com.cn)
13 | Julien Zhang (ZTE, zhang.jun3g@zte.com.cn)
14 | Yuri Yuan (ZTE, yuan.yue@zte.com.cn)
15 | Zhipeng Huang (Huawei, huangzhipeng@huawei.com)
16 | Jia Meng (ZTE, meng.jia@zte.com.cn)
17 | Liyi Meng (Ericsson, liyi.meng@ericsson.com)
18 | Pasi Vaananen (Stratus, pasi.vaananen@stratus.com)
23 | This document describes the user requirements on the smooth upgrade
24 function of the NFVI and VIM with respect to the upgrades of the OPNFV
25 platform from one version to another. Smooth upgrade means that the
26 upgrade results in no service outage for the end-users. This requires
27 that the process of the upgrade is automatically carried out by a tool
28 (code name: Escalator) with pre-configured data. The upgrade process
29 includes preparation, validation, execution, monitoring and
31 | ==[MT] While it is good to have a tool for the entire upgrade process,
32 but it is a challenging task, so maybe we shouldn't require automation
33 for the entire process right away. Automation is essential at
35 | ==[hujie] Maybe we can analysis information flows of the upgrade tool,
36 abstract the basic / essential actions from the tool (or tools), and
37 map them to a command set of NFVI / VIM's interfaces.==
39 The requirements are defined in a stepwise approach, i.e. in the first
40 phase focusing on the upgrade of the VIM then widening the scope to the
43 The requirements may apply to different NFV functions (NFVI, or VIM, or
44 both of them) . They will be classified in the Appendix of this
47 2. General Requirements Background and terminology
48 --------------------------------------------------
50 ==[MT] At the moment 2.1-2.3 seem to be more background sections than
51 requirements. Should we rename this part?==
53 2.1 Terminologies and definitions
54 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56 - **NFVI** is abbreviation for Network Function Virtualization
57 Infrastructure; sometimes it is also referred as data plane in this
59 - **VIM** is abbreviation for Virtual Infrastructure Management;
60 sometimes it is also referred as control plane in this document.
61 - **Operators** are network service providers and Virtual Network
62 Function (VNF) providers.
63 - **End-Users** are subscribers of Operator's services.
64 - **Network Service** is a service provided by an Operator to its
65 End-users using a set of (virtualized) Network Functions
66 - **Infrastructure Services** are those provided by the NFV
67 Infrastructure and the Management & Orchestration functions to the
68 VNFs. I.e. these are the virtual resources as perceived by the VNFs.
69 - **Smooth Upgrade** means that the upgrade results in no service
70 outage for the end-users.
71 - **Rolling Upgrade** is an upgrade strategy that upgrades each node or
72 a subset of nodes in a wave rolling style through the data centre. It
73 is a popular upgrade strategy to maintains service availability.
74 - **Parallel Universe** is an upgrade strategy that creates and deploys
75 a new universe - a system with the new configuration - while the old
76 system continues running. The state of the old system is transferred
77 to the new system after sufficient testing of the later.
78 - **Infrastructure Resource Model** ==(suggested by MT)== is identified
79 as: physical resources, virtualization facility resources and virtual
81 - **Physical Resources** are the hardware of the infrastructure, may
82 also includes the firmware that enable the hardware.
83 - **Virtual Resources** are resources provided as services built on top
84 of the physical resources via the virtualization facilities; in our
85 case, they are the components that VNF entities are built on, e.g.
86 the VMs, virtual switches, virtual routers, virtual disks etc
87 ==[MT] I don't think the VNF is the virtual resource. Virtual
88 resources are the VMs, virtual switches, virtual routers, virtual
89 disks etc. The VNF uses them, but I don't think they are equal. The
90 VIM doesn't manage the VNF, but it does manage virtual resources.==
91 - **Visualization Facilities** are resources that enable the creation
92 of virtual environments on top of the physical resources, e.g.
93 hypervisor, OpenStack, etc.
98 2.2.1 Physical Resource
99 ^^^^^^^^^^^^^^^^^^^^^^^
101 | Most of the cloud infrastructures support dynamic addition/removal of
102 hardware. A hardware upgrade could be done by removing the old
103 hardware node and adding the new one. This will not be in the scope of
105 | ==[MT] Does this mean that we are excluding firmware upgrades too?==
107 2.2.2 Virtual Resources
108 ^^^^^^^^^^^^^^^^^^^^^^^
110 | Virtual resource upgrade mainly done by users. OPNFV may facilitate
111 the activity, but suggest to have it in long term roadmap instead of
113 | ==[MT] same comment here: I don't think the VNF is the virtual
114 resource. Virtual resources are the VMs, virtual switches, virtual
115 routers, virtual disks etc. The VNF uses them, but I don't think they
116 are equal. For example if by some reason the hypervisor is changed and
117 the current VMs cannot be migrated to the new hypervisor, they are
118 incompatible, then the VMs need to be upgraded too. This is not
119 something the NFVI user (i.e. VNFs ) would even know about.==
121 2.2.3 Virtualization Facility Resources
122 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
124 | Based on the functionality they provide, virtualization facility
125 resources could be divided into computing node, networking node,
126 storage node and management node.
127 | The possible upgrade objects in these nodes are addressed below:
128 (Note: hardware based virtualization may considered as virtualization
129 facility resource, but from escalator perspective, it is better
130 considered it as part of hardware upgrade. )
135 #. Hypvervisor and virtual switch
136 #. Other kernel modules, like driver
137 #. User space software packages, like nova-compute agents and other
138 control plane programs
140 | Updating 1 and 2 will cause the loss of virtualzation functionality of
141 the compute node, which may lead to data plane services interruption
142 if the virtual resource is not redudant.
143 | Updating 3 might result the same.
144 | Updating 4 might lead to control plane services interruption if not an
149 #. OS kernel, optional, not all switch/router allow you to upgrade its
150 OS since it is more like a firmware than a generic OS.
151 #. User space software package, like neutron agents and other control
154 | Updating 1 if allowed will cause a node reboot and therefore leads to
155 data plane services interruption if the virtual resource is not
157 | Updating 2 might lead to control plane services interruption if not an
162 #. OS kernel, optional, not all storage node allow you to upgrade its OS
163 since it is more like a firmware than a generic OS.
165 #. User space software packages, control plane programs
167 | Updating 1 if allowed will cause a node reboot and therefore leads to
168 data plane services interruption if the virtual resource is not
170 | Update 2 might result in the same.
171 | Updating 3 might lead to control plane services interruption if not an
177 #. Kernel modules, like driver
178 #. User space software packages, like database, message queue and
179 control plane programs.
181 | Updating 1 will cause a node reboot and therefore leads to control
182 plane services interruption if not an HA deployment. Updating 2 might
184 | Updating 3 might lead to control plane services interruption if not an
191 | Upgrades between major releases may introducing significent changes in
192 function, configuration and data, such as the upgrade of OPNFV from
196 | Upgrades inside one major releases which would not leads to changing
197 the stucture of the platform and may not infect the schema of the
200 2.4 Upgrade Granularity
201 ~~~~~~~~~~~~~~~~~~~~~~~
203 2.4.1 Physical/Hardware Dimension
204 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
206 Support full / partial upgrade for data centre, cluster, zone. Because
207 of the upgrade of a data centre or a zone, it may be divided into
208 several batches. The upgrade of a cloud environment (cluster) may also
209 be partial. For example, in one cloud environment running a number of
210 VNFs, we may just try one of them to check the stability and
211 performance, before we upgrade all of them.
213 2.4.2 Software Dimension
214 ^^^^^^^^^^^^^^^^^^^^^^^^
216 - The upgrade of host OS or kernel may need a 'hot migration'
217 - The upgrade of OpenStack’s components
218 i.the one-shot upgrade of all components
219 ii.the partial upgrade (or bugfix patch) which only affects some
220 components (e.g., computing, storage, network, database, message
223 | ==[MT] this section seems to overlap with 2.1.==
224 | I can see the following dimensions for the software
226 - different software packages
227 - different funtions - Considering that the target versions of all
228 software are compatible the upgrade needs to ensure that any
229 dependencies between SW and therefore packages are taken into account
230 in the upgrade plan, i.e. no version mismatch occurs during the
231 upgrade therefore dependencies are not broken
232 - same function - This is an upgrade specific question if different
233 versions can coexist in the system when a SW is being upgraded from
234 one version to another. This is particularly important for stateful
235 functions e.g. storage, networking, control services. The upgrade
236 method must consider the compatibility of the redundant entities.
238 - different versions of the same software package
239 - major version changes - they may introduce incompatibilities. Even
240 when there are backward compatibility requirements changes may cause
241 issues at graceful rollback
242 - minor version changes - they must not introduce incompatibility
243 between versions, these should be primarily bug fixes, so live
244 patches should be possible
246 - different installations of the same software package
247 - using different installation options - they may reflect different
248 users with different needs so redundancy issues are less likely
249 between installations of different options; but they could be the
250 reflection of the heterogeneous system in which case they may provide
251 redundancy for higher availability, i.e. deeper inspection is needed
252 - using the same installation options - they often reflect that the are
253 used by redundant entities across space
255 - different distribution possibilities in space - same or different
256 availability zones, multi-site, geo-redundancy
258 - different entities running from the same installation of a software
260 - using different startup options - they may reflect different users so
261 redundancy may not be an issues between them
262 - using same startup options - they often reflect redundant
268 As the OPNFV end-users are primarily Telco operators, the network
269 services provided by the VNFs deployed on the NFVI should meet the
270 requirement of 'Carrier Grade'.
272 In telecommunication, a "carrier grade" or"carrier class" refers to a
273 system, or a hardware or software component that is extremely reliable,
274 well tested and proven in its capabilities. Carrier grade systems are
275 tested and engineered to meet or exceed "five nines" high availability
276 standards, and provide very fast fault recovery through redundancy
277 (normally less than 50 milliseconds). [from wikipedia.org]
279 "five nines" means working all the time in ONE YEAR except 5'15".
281 We have learnt that a well prepared upgrade of OpenStack needs 10
282 minutes. The major time slot in the outage time is used spent on
283 synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
286 This 10 minutes of downtime of OpenStack however did not impact the
287 users, i.e. the VMs running on the compute nodes. This was the outage of
288 the control plane only. On the other hand with respect to the
289 preparations this was a manually tailored upgrade specific to the
290 particular deployment and the versions of each OpenStack service.
292 The project targets to achieve a more generic methodology, which however
293 requires that the upgrade objects fulfill ceratin requirements. Since
294 this is only possible on the long run we target first upgrades from
295 version to version for the different VIM services.
299 #. | Can we manage to upgrade OPNFV in only 5 minutes?
300 | ==[MT] The first question is whether we have the same carrier grade
301 requirement on the control plane as on the user plane. I.e. how
302 much control plane outage we can/willing to tolerate?
303 | In the above case probably if the database is only half of the size
304 we can do the upgrade in 5 minutes, but is that good? It also means
305 that if the database is twice as much then the outage is 20
307 | For the user plane we should go for less as with two release yearly
308 that means 10 minutes outage per year.==
309 | ==[Malla] 10 minutes outage per year to the users? Plus, if we take
310 control plane into the consideration, then total outage will be
311 more than 10 minute in whole network, right?==
312 | ==[MT] The control plane outage does not have to cause outage to
313 the users, but it may of course depending on the size of the system
314 as it's more likely that there's a failure that needs to be handled
315 by the control plane.==
317 #. | Is it acceptable for end users ? Such as a planed service
318 interruption will lasting more than ten minutes for software
320 | ==[MT] For user plane, no it's not acceptable in case of
321 carrier-grade. The 5' 15" downtime should include unplanned and
323 | ==[Malla] I go agree with Maria, it is not acceptable.==
325 #. | Will any VNFs still working well when VIM is down?
326 | ==[MT] In case of OpenStack it seems yes. .:)==
328 2.5.1 The maximum duration of an upgrade
329 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
331 | The duration of an upgrade is related to and proportional with the
332 scale and the complexity of the OPNFV platform as well as the
333 granularity (in function and in space) of the upgrade.
334 | [Malla] Also, if is a partial upgrade like module upgrade, it depends
335 also on the OPNFV modules and their tight connection entites as well.
337 2.5.2 The maximum duration of a rollback when an upgrade is failed - this should be about rollback duration
338 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
340 | The duration of a rollback is short than the corresponding upgrade. It
341 depends on the duration of restore the software and configue data from
342 pre-upgrade backup / snapshot.
343 | ==[MT] During the upgrade process two types of failure may happen:
344 | In case we can recover from the failure by undoing the upgrade
345 actions it is possible to roll back the already executed part of the
346 upgrade in graceful manner introducing no more service outage than
347 what was introduced during the upgrade. Such a graceful rollback
348 requires typically the same amount of time as the executed portion of
349 the upgrade and impose minimal state/data loss.==
350 | ==[MT] Requirement: It should be possible to roll back gracefully the
351 failed upgrade of stateful services of the control plane.
352 | In case we cannot recover from the failure by just undoing the
353 upgrade actions, we have to restore the upgraded entities from their
354 backed up state. In other terms the system falls back to an earlier
355 state, which is typically a faster recovery procedure than graceful
356 rollback and depending on the statefulness of the entities involved it
357 may result in significant state/data loss.==
358 | **Two possible types of failures can happen during an upgrade**
360 #. We can recover from the failure that occured in the upgrade process:
361 In this case, a graceful rolling back of the executed part of the
362 upgrade may be possible which would "undo" the executed part in a
363 similar fashion. Thus, such a roll back introduces no more service
364 outage during an upgrade than the executed part introduced. This
365 process typically requires the same amount of time as the executed
366 portion of the upgrade and impose minimal state/data loss.
367 #. We cannot recover from the failure that occured in the upgrade
368 process: In this case, the system needs to fall back to an earlier
369 consistent state by reloading this backed-up state. This is typically
370 a faster recovery procedure than the graceful rollback, but can cause
371 state/data loss. The state/data loss usually depends on the
372 statefulness of the entities whose state is restored from the backup.
374 2.5.3 The maximum duration of a VNF interruption
375 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
377 | Since not the entire process of a smooth upgrade will affect the VNFs,
378 the duration of the VNF interruption may be shorter than the duration
379 of the upgrade. In some cases, the VNF running without the control
380 from of the VIM is acceptable.
381 | ==[MT] Should require explicitly that the NFVI should be able to
382 provide its services to the VNFs independent of the control plane?==
383 | ==[MT] Requirement: The upgrade of the control plane must not cause
384 interruption of the NFVI services provided to the VNFs.==
385 | ==[MT] With respect to carrier-grade the yearly service outage of the
386 VNF should not exceed 5' 15" regardless whether it is planned or
387 unplanned outage. Considering the HA requirements TL-9000 requires an
388 ent-to-end service recovery time of 15 seconds based on which the ETSI
389 GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
390 availability levels (SAL). The proposed example service recovery times
391 for these levels are:
393 | SAL2: 10-15 seconds
394 | SAL3: 20-25 seconds==
395 | ==[Pva] my comment was actually that the downtime metrics of the
396 underlying elements, components and services are small fraction of the
397 total E2E service availability time. No-one on the E2E service path
398 will get the whole downtime allocation (in this context it includes
399 upgrade process related outages for the services provided by VIM etc.
400 elements that are subject to upgrade process).==
401 | ==[MT] So what you are saying is that the upgrade of any entity
402 (component, service) shouldn't cause even this much service
403 interruption. This was the reason I brought these figures here as well
404 that they are posing some kind of upper-upper boundary. Ideally the
405 interruption is in the millisecond range i.e. no more than a
406 switchover or a live migration.==
407 | ==[MT] Requirement: Any interruption caused to the VNF by the upgrade
408 of the NFVI should be in the sub-second range.==
410 ==[MT] In the future we also need to consider the upgrade of the NFVI,
411 i.e. HW, firmware, hypervisors, host OS etc.==
413 3. Functional Considerations
414 ----------------------------
416 3.1 Requirement of Escalator's Basic Actions
417 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
419 This section describes the basic functions may required by Escalator.
421 3.1.1 Preparation (offline)
422 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
424 This is the design phase when the upgrade plan (or upgrade campaign) is
425 being designed so that it can be executed automatically with minimal
426 service outage. It may include the following work:
428 #. Check the dependencies of the software modules and their impact,
429 backward compatibilities to figure out the appropriate upgrade method
431 #. Find out if a rolling upgrade could be planned with several rolling
432 steps to avoid any service outage due to the upgrade some
433 parts/services at the same time.
434 #. Collect the proper version files and check the integration for
436 #. The preparation step should produce an output (i.e. upgrade
437 campaign/plan), which is executable automatically in an NFV Famawork
438 and which can be validated before execution.
440 - The upgrade campaign should not be referring to scalable entities
441 directly, but allow for adaptation to the system configuration and
442 state at any given moment.
443 - The upgrade campaign should describe the ordering of the upgrade
444 of different entities so that dependencies, redundancies can be
445 maintained during the upgrade execution
446 - The upgrade campaign should provide information about the
447 applicable recovery procedures and their ordering.
448 - The upgrade campaign should consider information about the
449 verification/testing procedures to be performed during the upgrade
450 so that upgrade failures can be detected as soon as possible and
451 the appropriate recovery procedure can be identified and applied.
452 - The upgrade campaign should provide information on the expected
453 execution time so that hanging execution can be identified
454 - The upgrade campaign should indicate any point in the upgrade when
455 coordination with the users (VNFs) is required.
457 ==[hujie]Depends on the attributes of the object being upgraded, the
458 upgrade plan may be slitted into step(s) and/or sub-plan(s), and even
459 more small sub-plans in design phase. The plan(s) or sub-plan(s) my
460 include step(s) or sub-plan(s).==
462 3.1.2 Validation the upgrade plan / Checking the pre-requisites of System( offline / online)
463 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
465 | The upgrade plan should be validated before the execution by testing
466 it in a test environment which is similar to the product environment.
467 | ==[MT]However it could also mean that we can identify some properties
468 that it should satisfy e.g. what operations can or cannot be executed
469 simultaneously like never take out two VMs of the same VNF.
470 | Another question is if it requires that the system is in a particular
471 state when the upgrade is applied. I.e. if there's certain amount of
472 redundacy in the system, migration is enabled for VMs, when the NFVI
473 is upgraded the VIM is healthy, when the VIM is upgraded the NFVI is
475 | I'm not sure what online validation means: Is it the validation of the
476 upgrade plan/campaign or the validation of the system that it is in a
477 state that the upgrade can be performed without too much risk?==
479 | Before the upgrade plan being executed, the system heathly of the
480 online product environment should be checked and confirmed to satisfy
481 the requirements which were described in the upgrade plan. The
482 sysinfo, e.g. which included system alarms, performance statistics and
483 diagnostic logs, will be collected and analyized. It is required to
484 resolve all of the system faults or exclud the unhealthy part before
485 executing the upgrade plan.
486 | ==[hujie] Text merged.==
488 3.1.3 Backup/Snapshot (online)
489 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
491 For avoid loss of data when a unsuccessful upgrade was encountered, the
492 data should be backuped and the system state snapshot should be taken
493 before the excution of upgrade plan. This would be considered in the
496 Several backups/Snapshots may be generated and stored before the single
497 steps of changes. The following data/files are required to be
500 #. running version files for each node.
501 #. system components' configuration file and database.
502 #. image and storage, if it is necessary.
503 ==[MT] Does 3 imply VNF image and storage? I.e. VNF state and data?==
505 | ==[hujie] The following text is derived from previous "4. Negotiate
506 with the VNF if it's ready for the upgrade"==
507 | Although the upper layer, which include VNFs and VNFMs, is out of the
508 scope of Escalator, but it is still recommended to let it ready for a
509 smooth system upgrade. The escalator could not garanttee the safe of
510 VNFs. The upper layer should have some safe guard mechanism in design,
511 and ready for avoiding failure in system upgrade.
513 3.1.4 Execution (online)
514 ^^^^^^^^^^^^^^^^^^^^^^^^
516 | The execution of upgrade plan should be a dynamical procedure which is
517 controlled by Escalator.
518 | ==[hujie] Revised text to be general.==
520 #. It is required to supporting execution ether in sequence or in
522 #. It is required to checke the result of the execution and take the
523 action according the situation and the policies in the upgrade plan.
524 #. It is required to execute properly on various configurations of
525 system object. I.e. stand-alone, HA, etc.
526 #. It is required to excecute on the designated different parts of the
527 system. I.e. physical server, virtualized server, rack, chassis,
528 cluster, even different geographical places.
530 3.1.5 Testing (online)
531 ^^^^^^^^^^^^^^^^^^^^^^
533 | The testing after upgrade the whole system or parts of system to make
534 sure the upgraded system(object) is working normally.
535 | ==[hujie] Revised text to be general.==
537 #. It is recommended to run the prepared test cases to see if the
538 functionalities are availiable without any problem.
539 #. It is recommended to check the sysinfo, e.g. system alarms,
540 performance statistics and diagnostic logs to see if there are any
543 3.1.6 Restore/Rollback (online)
544 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
546 | When upgrade is failure unfortunatly, a quick system restore or system
547 rollback should be taken to recovery the system and the services.
548 | ==[hujie] Revised text to be general.==
550 #. It is recommend to support system restore from backup when upgrade
552 #. It is recommend to support gracefull rollback with reverse order
555 3.1.7 Monitoring (online)
556 ^^^^^^^^^^^^^^^^^^^^^^^^^
558 | Escalator should continually monitor the process of upgrade. It is
559 keeping update status of each module, each node, each cluster into a
560 status table during upgrade.
561 | ==[hujie] Revised text to be general.==
563 #. It is required to collect the status of every objects being upgraded
564 and sending abnormal alerms during the upgrade.
565 #. It is recommend to reuse the existing monitoring system, like alarm.
566 #. It is recommend to support pro-actively query.
567 #. It is recommend to support passively wait for notification.
569 | **Two possible ways for monitoring:**
570 | **Pro-Actively Query** requires NFVI/VIM provides proper API or CLI
571 interface. If Escalator serves as a service, it should pass on these
573 | **Passively Wait for Notification** requires Escalator provides
574 callback interface, which could be used by NFVI/VIM systems or upgrade
575 agent to send back notification.
576 | [hujie] I am not sure why not to subscribe the notification.
578 3.1.8 Logging (online)
579 ^^^^^^^^^^^^^^^^^^^^^^
581 Record the information generated by escalator into log files. The log
582 file is used for manual diagnostic of exceptions.
584 #. It is required to support logging.
585 #. It is recommended to include time stamp, object id, action name,
588 3.1.9 Administrative Control (online)
589 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
591 Administrative Control is used for control the privilege to start any
592 escalator's actions for avoding unauthorized operations.
594 #. It is required to support administrative control mechenism
595 #. It is recommed to reuse the system's own secure system.
596 #. It is required to avoid conflicts when the system's own secure system
599 3.2 Requirements on system object being upgraded
600 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
602 | ==We can develope BPs in future from req of this section and GA for
603 upper stream projects==
604 | Escalator focus on smooth upgrade. In practical implementation, it
605 might be combined with installer/deployer, or act as an independent
606 tool/service. In either way, it requires targeting systems(NFVI and
607 VIM) are developed/deployed in a way that Escalator could perform
610 On NFVI system, live-migration is likely used to maintain availability
611 because OPNFV would like to make HA transparent from end user. This
612 requires VIM system being able to put compute node into maintenance mode
613 and then isolated from normal service. Otherwise, new NFVI instances
614 might risk at being schedule into the upgrading node.
616 | On VIM system, availability is likely achieved by redundancy. This
617 impose less requirements on system/services being upgrade (see PVA
618 comments in early version). However, there should be a way to put the
619 target system into standby mode. Because starting upgrade on the
620 master node in a cluster is likely a bad idea.
621 | ==[hujie] Revised text to be general.==
623 #. It is required for NFVI/VIM to support **service handover** mechanism
624 that minimize interruption to 0.001%(i.e. 99.999% service
625 availability). Possible implementations are live-migration, redundant
626 deployment, etc, (Note: for VIM, interruption could be less
628 #. It is required for NFVI/VIM to restore the early verion in a efficent
629 way, such as **snapshot**.
630 #. It is required for NFVI/VIM to **migration data** efficiently between
631 base and upgraded system.
632 ==[hujie] What is exact meaning of "base" here?==
633 #. It is recomend for NFV/VIM's interface to support upgrade
634 orchestration, e.g. reading/setting system state
635 ==[hujie] I am not sure if it reflect the previous text.==
640 This section describes the use cases to verify the requirements of
643 4.1 Upgrade a system with minimal configuration
644 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
646 A minimal configuration system is normally depolyed for experimental or
647 developement ussage, such as a OPNFV test bed. Althouth it dose not have
648 large workload, but it is a typical system to be upgraded frequently.
650 4.2 Upgrade a system with HA configuration
651 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
653 A HA configuration system is very popular in the operator's data centre.
654 And it is a typical product environment. It always running 7 \* 24 a
655 week with VNFs running on it to provide services to the end users.
657 4.3 Upgrade a system with Multi-Site configuration
658 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
660 Upgrade in one site may cause service interruption to other site, if
661 both sites are depended and sharing the same modules/data base (e.g. a
662 keystone for both sites).
664 If a site failure during an upgrade, the rollback missing any minimal
665 state/data loss can cause an affect/failure to the depended site.
667 ==Consider one site of ARNO release first. Then, multi-site in the
673 This section describes the reference architecture, the function blocks,
674 the function entities of Escalator for the reader to well understand how
675 the basic functions be organized.
680 | This section describes the information flows among the function
681 entities when Escalator is in actions.
682 | We should consider a generic procedure / frameworks of upgrading. And
683 may provide a plugin interface for specialized tasks
688 This section describes the required interfaces of Escalator.
690 7.1 Manual Interface (CLI / GUI)
691 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
696 To support 3.3 Negotiate with the VNF if it's ready for the upgrade
698 7.3 Configuration File
699 ~~~~~~~~~~~~~~~~~~~~~~
701 This section will suggest a format of the configuration files and how to
707 This section will suggest a format of the log files and how to deal with
710 8. Requirements from other OPNFV projects
711 -----------------------------------------
713 | We have created a questionnaire for collecting other projects
715 (https://docs.google.com/forms/d/11o1mt15zcq0WBtXYK0n6lKF8XuIzQTwvv8ePTjmcoF0/viewform?usp=send_form),
717 | ==[hujie] Can we force other OPNFV projects to complete the survey by
718 using JIRA dependence?==
723 | ==Note: This scenario could be out of scope in Escalator project, but
724 having the option to support this should be better to align with
725 Doctor requirements.==
726 | The scope of Doctor project also covers maintenance scenario in which
727 1) the VIM administorator requests host maintenance to VIM, 2) VIM
728 will notifiy it to consumer such as VNFM to trigger application level
729 migration or switching active-standby nodes, and 3) VIM waits responce
730 from the consumer for a short while.
732 - VIM should send out notification of VM migration to consumer (VNFM)
733 as abstracted message like "maintenance".
734 - VIM could wait VM migration until it receives "VM ready to
735 maintenance" message from the owner (VNFM)
740 8.3 Multi-site Project
741 ~~~~~~~~~~~~~~~~~~~~~~
743 - Escalator upgrade one site should at least not lead to the other site
744 API token validation failed.
749 | [1] ETSI GS NFV 002 (V1.1.1): “Architectural Framework”
750 | [2] ETSI GS NFV 003 (V1.1.1): "Terminology for Main Concepts in NFV".
751 | [3] ETSI GS NFV-SWA001:“Virtual Network Function Architecture”
752 | [4] ETSI GS NFV-MAN001:“Management and Orchestration”
753 | [5] ETSI GS NFV-REL001:"Resiliency Requirements"
754 | [6] QuEST Forum TL-9000:"Quality Management System Requirement
756 | [7] Service Availabilty Forum AIS:"Software Management Framework"
758 10. Useful Working Drafts of ETSI NFV
759 -------------------------------------
761 | Access them with your own ETSI account, please DO NOT disclose the
763 | [1] Migrate Virtualised Compute Resource operation @ 7.3.1.8
764 | ftp://docbox.etsi.org/ISG/NFV/Open/Drafts/IFA005_Or-Vi_ref_point_Spec/NFV-IFA005v070.zip
765 | [2] Reliability issues during NFV Software upgrade and improvement
767 | ftp://@docbox.etsi.org/ISG/NFV/Open/Drafts/REL003_E2E_reliability_models/NFV-REL003v030.zip
775 Upgrading the different software modules may cause different impact on
776 the availability of the infrastracture resources and even on the service
777 continuity of the vNFs.
779 **Software modules in the computing nodes**
782 ==[MT] As SW module, we should list the host OS and maybe ====its
783 drivers as well. From upgrade persepctive do we limit host OS
784 upgrades to patches only?==
785 #. Hypervisor, such as KVM, QEMU, XEN, libvirt
786 #. Openstack agent in computing nodes (like Nova agent, Ceilometer
789 **Software modules in network nodes**
791 #. Neutron L2/L3 agent
792 #. OVS, SR-IOV Driver
794 **Software modules storage nodes**
798 The table below analyses such an impact - considering a single instance
799 of each software module - from the following aspects:
801 - the function which will be lost during upgrade,
802 - the duration of the loss of this specific function,
803 - if this causes the loss of the vNF function,
804 - if it causes incompatibility in the different parts of the software,
805 - what should be backed up before the upgrade,
806 - the duration of restoration time if the upgrade fails
808 | These values provided come from internal testing and based on some
809 assumptions, they may vary depending on the deployment techniques.
810 Please feel free to add if you find more efficient values during your
812 | https://wiki.opnfv.org/_media/upgrade_analysis_v0.5.xlsx
813 | Note that no redundancy of the software modules is considered in the