1 General Requirements Background and Terminology
\r
2 -----------------------------------------------
\r
4 Terminologies and definitions
\r
5 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
\r
8 The term is an abbreviation for Network Function Virtualization
\r
9 Infrastructure; sometimes it is also referred as data plane in this
\r
13 The term is an abbreviation for Virtual Infrastructure Management;
\r
14 sometimes it is also referred as control plane in this document.
\r
17 The term refers to network service providers and Virtual Network
\r
18 Function (VNF) providers.
\r
21 The term refers to a subscriber of the Operator's services.
\r
24 The term refers to a service provided by an Operator to its
\r
25 End-users using a set of (virtualized) Network Functions
\r
27 Infrastructure Services
\r
28 The term refers to services provided by the NFV Infrastructure and the
\r
29 the Management & Orchestration functions to the VNFs. I.e.
\r
30 these are the virtual resources as perceived by the VNFs.
\r
33 The term refers to an upgrade that results in no service outage
\r
37 The term refers to an upgrade strategy that upgrades each node or
\r
38 a subset of nodes in a wave style rolling through the data centre. It
\r
39 is a popular upgrade strategy to maintain service availability.
\r
42 The term refers to an upgrade strategy that creates and deploys
\r
43 a new universe - a system with the new configuration - while the old
\r
44 system continues running. The state of the old system is transferred
\r
45 to the new system after sufficient testing of the new system.
\r
47 Infrastructure Resource Model
\r
48 The term refers to the representation of infrastructure resources,
\r
49 namely: the physical resources, the virtualization
\r
50 facility resources and the virtual resources.
\r
53 The term refers to a hardware pieces of the NFV infrastructure, which may
\r
54 also include the firmware which enables the hardware.
\r
57 The term refers to a resource, which is provided as services built on top
\r
58 of the physical resources via the virtualization facilities; in particular,
\r
59 they are the resources on which VNF entities are deployed, e.g.
\r
60 the VMs, virtual switches, virtual routers, virtual disks etc.
\r
62 .. <MT> I don't think the VNF is the virtual resource. Virtual
\r
63 resources are the VMs, virtual switches, virtual routers, virtual
\r
64 disks etc. The VNF uses them, but I don't think they are equal. The
\r
65 VIM doesn't manage the VNF, but it does manage virtual resources.
\r
67 Visualization Facility
\r
68 The term refers to a resource that enables the creation
\r
69 of virtual environments on top of the physical resources, e.g.
\r
70 hypervisor, OpenStack, etc.
\r
72 Upgrade Plan (or Campaign?)
\r
73 The term refers to a choreography that describes how the upgrade should
\r
74 be performed in terms of its targets (i.e. upgrade objects), the
\r
75 steps/actions required of upgrading each, and the coordination of these
\r
76 steps so that service availability can be maintained. It is an input to an
\r
77 upgrade tool (Escalator) to carry out the upgrade
\r
86 Most of cloud infrastructures support dynamic addition/removal of
\r
87 hardware. A hardware upgrade could be done by adding the new
\r
88 hardware node and removing the old one. From the persepctive of smooth
\r
89 upgrade the orchestration/scheduling of this actions is the primary concern.
\r
90 Upgrading a physical resource,
\r
91 like upgrading its firmware and/or modify its configuration data, may
\r
92 also be considered in the future.
\r
98 Virtual resource upgrade mainly done by users. OPNFV may facilitate
\r
99 the activity, but suggest to have it in long term roadmap instead of
\r
102 .. <MT> same comment here: I don't think the VNF is the virtual
\r
103 resource. Virtual resources are the VMs, virtual switches, virtual
\r
104 routers, virtual disks etc. The VNF uses them, but I don't think they
\r
105 are equal. For example if by some reason the hypervisor is changed and
\r
106 the current VMs cannot be migrated to the new hypervisor, they are
\r
107 incompatible, then the VMs need to be upgraded too. This is not
\r
108 something the NFVI user (i.e. VNFs ) would even know about.
\r
111 Virtualization Facility Resources
\r
112 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
\r
114 Based on the functionality they provide, virtualization facility
\r
115 resources could be divided into computing node, networking node,
\r
116 storage node and management node.
\r
118 The possible upgrade objects in these nodes are addressed below:
\r
119 (Note: hardware based virtualization may be considered as virtualization
\r
120 facility resource, but from escalator perspective, it is better to
\r
121 consider it as part of the hardware upgrade. )
\r
127 2. Hypvervisor and virtual switch
\r
129 3. Other kernel modules, like driver
\r
131 4. User space software packages, like nova-compute agents and other
\r
132 control plane programs.
\r
134 Updating 1 and 2 will cause the loss of virtualzation functionality of
\r
135 the compute node, which may lead to data plane services interruption
\r
136 if the virtual resource is not redudant.
\r
138 Updating 3 might result the same.
\r
140 Updating 4 might lead to control plane services interruption if not an
\r
143 **Networking node**
\r
145 1. OS kernel, optional, not all switches/routers allow the upgrade their
\r
146 OS since it is more like a firmware than a generic OS.
\r
148 2. User space software package, like neutron agents and other control
\r
151 Updating 1 if allowed will cause a node reboot and therefore leads to
\r
152 data plane service interruption if the virtual resource is not
\r
155 Updating 2 might lead to control plane services interruption if not an
\r
160 1. OS kernel, optional, not all storage nodes allow the upgrade their OS
\r
161 since it is more like a firmware than a generic OS.
\r
165 3. User space software packages, control plane programs
\r
167 Updating 1 if allowed will cause a node reboot and therefore leads to
\r
168 data plane services interruption if the virtual resource is not
\r
171 Update 2 might result in the same.
\r
173 Updating 3 might lead to control plane services interruption if not an
\r
176 **Management node**
\r
180 2. Kernel modules, like driver
\r
182 3. User space software packages, like database, message queue and
\r
183 control plane programs.
\r
185 Updating 1 will cause a node reboot and therefore leads to control
\r
186 plane services interruption if not an HA deployment. Updating 2 might
\r
187 result in the same.
\r
189 Updating 3 might lead to control plane services interruption if not an
\r
197 Upgrades between major releases may introducing significant changes in
\r
198 function, configuration and data, such as the upgrade of OPNFV from
\r
199 Arno to Brahmaputra.
\r
203 Upgrades inside one major releases which would not leads to changing
\r
204 the structure of the platform and may not infect the schema of the
\r
207 Upgrade Granularity
\r
208 ~~~~~~~~~~~~~~~~~~~
\r
210 Physical/Hardware Dimension
\r
211 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
\r
213 Support full / partial upgrade for data centre, cluster, zone. Because
\r
214 of the upgrade of a data centre or a zone, it may be divided into
\r
215 several batches. The upgrade of a cloud environment (cluster) may also
\r
216 be partial. For example, in one cloud environment running a number of
\r
217 VNFs, we may just try one of them to check the stability and
\r
218 performance, before we upgrade all of them.
\r
223 - The upgrade of host OS or kernel may need a 'hot migration'
\r
224 - The upgrade of OpenStack’s components
\r
226 i.the one-shot upgrade of all components
\r
228 ii.the partial upgrade (or bugfix patch) which only affects some
\r
229 components (e.g., computing, storage, network, database, message
\r
232 .. <MT> this section seems to overlap with 2.1.
\r
233 I can see the following dimensions for the software.
\r
235 .. <MT> different software packages
\r
237 .. <MT> different functions - Considering that the target versions of all
\r
238 software are compatible the upgrade needs to ensure that any
\r
239 dependencies between SW and therefore packages are taken into account
\r
240 in the upgrade plan, i.e. no version mismatch occurs during the
\r
241 upgrade therefore dependencies are not broken
\r
243 .. <MT> same function - This is an upgrade specific question if different
\r
244 versions can coexist in the system when a SW is being upgraded from
\r
245 one version to another. This is particularly important for stateful
\r
246 functions e.g. storage, networking, control services. The upgrade
\r
247 method must consider the compatibility of the redundant entities.
\r
249 .. <MT> different versions of the same software package
\r
251 .. <MT> major version changes - they may introduce incompatibilities. Even
\r
252 when there are backward compatibility requirements changes may cause
\r
253 issues at graceful roll-back
\r
255 .. <MT> minor version changes - they must not introduce incompatibility
\r
256 between versions, these should be primarily bug fixes, so live
\r
257 patches should be possible
\r
259 .. <MT> different installations of the same software package
\r
261 .. <MT> using different installation options - they may reflect different
\r
262 users with different needs so redundancy issues are less likely
\r
263 between installations of different options; but they could be the
\r
264 reflection of the heterogeneous system in which case they may provide
\r
265 redundancy for higher availability, i.e. deeper inspection is needed
\r
267 .. <MT> using the same installation options - they often reflect that the are
\r
268 used by redundant entities across space
\r
270 .. <MT> different distribution possibilities in space - same or different
\r
271 availability zones, multi-site, geo-redundancy
\r
273 .. <MT> different entities running from the same installation of a software
\r
276 .. <MT> using different start-up options - they may reflect different users so
\r
277 redundancy may not be an issues between them
\r
279 .. <MT> using same start-up options - they often reflect redundant
\r
285 As the OPNFV end-users are primarily Telecom operators, the network
\r
286 services provided by the VNFs deployed on the NFVI should meet the
\r
287 requirement of 'Carrier Grade'.::
\r
289 In telecommunication, a "carrier grade" or"carrier class" refers to a
\r
290 system, or a hardware or software component that is extremely reliable,
\r
291 well tested and proven in its capabilities. Carrier grade systems are
\r
292 tested and engineered to meet or exceed "five nines" high availability
\r
293 standards, and provide very fast fault recovery through redundancy
\r
294 (normally less than 50 milliseconds). [from wikipedia.org]
\r
296 "five nines" means working all the time in ONE YEAR except 5'15".
\r
300 We have learnt that a well prepared upgrade of OpenStack needs 10
\r
301 minutes. The major time slot in the outage time is used spent on
\r
302 synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
\r
305 This 10 minutes of downtime of the OpenStack services however did not impact the
\r
306 users, i.e. the VMs running on the compute nodes. This was the outage of
\r
307 the control plane only. On the other hand with respect to the
\r
308 preparations this was a manually tailored upgrade specific to the
\r
309 particular deployment and the versions of each OpenStack service.
\r
311 The project targets to achieve a more generic methodology, which however
\r
312 requires that the upgrade objects fulfil certain requirements. Since
\r
313 this is only possible on the long run we target first the upgrade
\r
314 of the different VIM services from version to version.
\r
318 1. Can we manage to upgrade OPNFV in only 5 minutes?
\r
320 .. <MT> The first question is whether we have the same carrier grade
\r
321 requirement on the control plane as on the user plane. I.e. how
\r
322 much control plane outage we can/willing to tolerate?
\r
323 In the above case probably if the database is only half of the size
\r
324 we can do the upgrade in 5 minutes, but is that good? It also means
\r
325 that if the database is twice as much then the outage is 20
\r
327 For the user plane we should go for less as with two release yearly
\r
328 that means 10 minutes outage per year.
\r
330 .. <Malla> 10 minutes outage per year to the users? Plus, if we take
\r
331 control plane into the consideration, then total outage will be
\r
332 more than 10 minute in whole network, right?
\r
334 .. <MT> The control plane outage does not have to cause outage to
\r
335 the users, but it may of course depending on the size of the system
\r
336 as it's more likely that there's a failure that needs to be handled
\r
337 by the control plane.
\r
339 2. Is it acceptable for end users ? Such as a planed service
\r
340 interruption will lasting more than ten minutes for software
\r
343 .. <MT> For user plane, no it's not acceptable in case of
\r
344 carrier-grade. The 5' 15" downtime should include unplanned and
\r
347 .. <Malla> I go agree with Maria, it is not acceptable.
\r
349 3. Will any VNFs still working well when VIM is down?
\r
351 .. <MT> In case of OpenStack it seems yes. .:)
\r
353 The maximum duration of an upgrade
\r
354 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
\r
356 The duration of an upgrade is related to and proportional with the
\r
357 scale and the complexity of the OPNFV platform as well as the
\r
358 granularity (in function and in space) of the upgrade.
\r
360 .. <Malla> Also, if is a partial upgrade like module upgrade, it depends
\r
361 also on the OPNFV modules and their tight connection entities as well.
\r
363 .. <MT> Since the maintenance window is shrinking and becoming non-existent
\r
364 the duration of the upgrade is secondary to the requirement of smooth upgrade.
\r
365 But probably we want to be able to put a time constraint on each upgrade
\r
366 during which it must complete otherwise it is considered failed and the system
\r
367 should be rolled back. I.e. in case of automatic execution it might not be clear
\r
368 if an upgrade is long or just hanging. The time constraints may be a function
\r
369 of the size of the system in terms of the upgrade object(s).
\r
371 The maximum duration of a roll back when an upgrade is failed
\r
372 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
\r
374 The duration of a roll back is short than the corresponding upgrade. It
\r
375 depends on the duration of restore the software and configure data from
\r
376 pre-upgrade backup / snapshot.
\r
378 .. <MT> During the upgrade process two types of failure may happen:
\r
379 In case we can recover from the failure by undoing the upgrade
\r
380 actions it is possible to roll back the already executed part of the
\r
381 upgrade in graceful manner introducing no more service outage than
\r
382 what was introduced during the upgrade. Such a graceful roll back
\r
383 requires typically the same amount of time as the executed portion of
\r
384 the upgrade and impose minimal state/data loss.
\r
386 .. <MT> Requirement: It should be possible to roll back gracefully the
\r
387 failed upgrade of stateful services of the control plane.
\r
388 In case we cannot recover from the failure by just undoing the
\r
389 upgrade actions, we have to restore the upgraded entities from their
\r
390 backed up state. In other terms the system falls back to an earlier
\r
391 state, which is typically a faster recovery procedure than graceful
\r
392 roll back and depending on the statefulness of the entities involved it
\r
393 may result in significant state/data loss.
\r
395 .. <MT> Two possible types of failures can happen during an upgrade
\r
397 .. <MT> We can recover from the failure that occurred in the upgrade process:
\r
398 In this case, a graceful rolling back of the executed part of the
\r
399 upgrade may be possible which would "undo" the executed part in a
\r
400 similar fashion. Thus, such a roll back introduces no more service
\r
401 outage during an upgrade than the executed part introduced. This
\r
402 process typically requires the same amount of time as the executed
\r
403 portion of the upgrade and impose minimal state/data loss.
\r
405 .. <MT> We cannot recover from the failure that occurred in the upgrade
\r
406 process: In this case, the system needs to fall back to an earlier
\r
407 consistent state by reloading this backed-up state. This is typically
\r
408 a faster recovery procedure than the graceful roll back, but can cause
\r
409 state/data loss. The state/data loss usually depends on the
\r
410 statefulness of the entities whose state is restored from the backup.
\r
412 The maximum duration of a VNF interruption (Service outage)
\r
413 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
\r
415 Since not the entire process of a smooth upgrade will affect the VNFs,
\r
416 the duration of the VNF interruption may be shorter than the duration
\r
417 of the upgrade. In some cases, the VNF running without the control
\r
418 from of the VIM is acceptable.
\r
420 .. <MT> Should require explicitly that the NFVI should be able to
\r
421 provide its services to the VNFs independent of the control plane?
\r
423 .. <MT> Requirement: The upgrade of the control plane must not cause
\r
424 interruption of the NFVI services provided to the VNFs.
\r
426 .. <MT> With respect to carrier-grade the yearly service outage of the
\r
427 VNF should not exceed 5' 15" regardless whether it is planned or
\r
428 unplanned outage. Considering the HA requirements TL-9000 requires an
\r
429 end-to-end service recovery time of 15 seconds based on which the ETSI
\r
430 GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
\r
431 availability levels (SAL). The proposed example service recovery times
\r
432 for these levels are:
\r
434 .. <MT> SAL1: 5-6 seconds
\r
436 .. <MT> SAL2: 10-15 seconds
\r
438 .. <MT> SAL3: 20-25 seconds
\r
440 .. <Pva> my comment was actually that the downtime metrics of the
\r
441 underlying elements, components and services are small fraction of the
\r
442 total E2E service availability time. No-one on the E2E service path
\r
443 will get the whole downtime allocation (in this context it includes
\r
444 upgrade process related outages for the services provided by VIM etc.
\r
445 elements that are subject to upgrade process).
\r
447 .. <MT> So what you are saying is that the upgrade of any entity
\r
448 (component, service) shouldn't cause even this much service
\r
449 interruption. This was the reason I brought these figures here as well
\r
450 that they are posing some kind of upper-upper boundary. Ideally the
\r
451 interruption is in the millisecond range i.e. no more than a
\r
452 switch-over or a live migration.
\r
454 .. <MT> Requirement: Any interruption caused to the VNF by the upgrade
\r
455 of the NFVI should be in the sub-second range.
\r
457 .. <MT]> In the future we also need to consider the upgrade of the NFVI,
\r
458 i.e. HW, firmware, hypervisors, host OS etc.