1 General Requirements Background and Terminology
2 -----------------------------------------------
4 Terminologies and definitions
5 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7 - **NFVI** is abbreviation for Network Function Virtualization
8 Infrastructure; sometimes it is also referred as data plane in this
10 - **VIM** is abbreviation for Virtual Infrastructure Management;
11 sometimes it is also referred as control plane in this document.
12 - **Operators** are network service providers and Virtual Network
13 Function (VNF) providers.
14 - **End-Users** are subscribers of Operator's services.
15 - **Network Service** is a service provided by an Operator to its
16 End-users using a set of (virtualized) Network Functions
17 - **Infrastructure Services** are those provided by the NFV
18 Infrastructure and the Management & Orchestration functions to the
19 VNFs. I.e. these are the virtual resources as perceived by the VNFs.
20 - **Smooth Upgrade** means that the upgrade results in no service
21 outage for the end-users.
22 - **Rolling Upgrade** is an upgrade strategy that upgrades each node or
23 a subset of nodes in a wave rolling style through the data centre. It
24 is a popular upgrade strategy to maintains service availability.
25 - **Parallel Universe** is an upgrade strategy that creates and deploys
26 a new universe - a system with the new configuration - while the old
27 system continues running. The state of the old system is transferred
28 to the new system after sufficient testing of the later.
29 - **Infrastructure Resource Model** ==(suggested by MT)== is identified
30 as: physical resources, virtualization facility resources and virtual
32 - **Physical Resources** are the hardware of the infrastructure, may
33 also includes the firmware that enable the hardware.
34 - **Virtual Resources** are resources provided as services built on top
35 of the physical resources via the virtualization facilities; in our
36 case, they are the components that VNF entities are built on, e.g.
37 the VMs, virtual switches, virtual routers, virtual disks etc
38 ==[MT] I don't think the VNF is the virtual resource. Virtual
39 resources are the VMs, virtual switches, virtual routers, virtual
40 disks etc. The VNF uses them, but I don't think they are equal. The
41 VIM doesn't manage the VNF, but it does manage virtual resources.==
42 - **Visualization Facilities** are resources that enable the creation
43 of virtual environments on top of the physical resources, e.g.
44 hypervisor, OpenStack, etc.
52 | Most of the cloud infrastructures support dynamic addition/removal of
53 hardware. A hardware upgrade could be done by removing the old
54 hardware node and adding the new one. Upgrade a physical resource,
55 like upgrade the firmware and modify the configuration data, may
56 be considered in the future.
61 | Virtual resource upgrade mainly done by users. OPNFV may facilitate
62 the activity, but suggest to have it in long term roadmap instead of
64 | ==[MT] same comment here: I don't think the VNF is the virtual
65 resource. Virtual resources are the VMs, virtual switches, virtual
66 routers, virtual disks etc. The VNF uses them, but I don't think they
67 are equal. For example if by some reason the hypervisor is changed and
68 the current VMs cannot be migrated to the new hypervisor, they are
69 incompatible, then the VMs need to be upgraded too. This is not
70 something the NFVI user (i.e. VNFs ) would even know about.==
72 Virtualization Facility Resources
73 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
75 | Based on the functionality they provide, virtualization facility
76 resources could be divided into computing node, networking node,
77 storage node and management node.
78 | The possible upgrade objects in these nodes are addressed below:
79 (Note: hardware based virtualization may considered as virtualization
80 facility resource, but from escalator perspective, it is better
81 considered it as part of hardware upgrade. )
86 2. Hypvervisor and virtual switch
87 3. Other kernel modules, like driver
88 4. User space software packages, like nova-compute agents and other
89 control plane programs
91 | Updating 1 and 2 will cause the loss of virtualzation functionality of
92 the compute node, which may lead to data plane services interruption
93 if the virtual resource is not redudant.
94 | Updating 3 might result the same.
95 | Updating 4 might lead to control plane services interruption if not an
100 1. OS kernel, optional, not all switch/router allow you to upgrade its
101 OS since it is more like a firmware than a generic OS.
102 2. User space software package, like neutron agents and other control
105 | Updating 1 if allowed will cause a node reboot and therefore leads to
106 data plane services interruption if the virtual resource is not
108 | Updating 2 might lead to control plane services interruption if not an
113 1. OS kernel, optional, not all storage node allow you to upgrade its OS
114 since it is more like a firmware than a generic OS.
116 3. User space software packages, control plane programs
118 | Updating 1 if allowed will cause a node reboot and therefore leads to
119 data plane services interruption if the virtual resource is not
121 | Update 2 might result in the same.
122 | Updating 3 might lead to control plane services interruption if not an
128 2. Kernel modules, like driver
129 3. User space software packages, like database, message queue and
130 control plane programs.
132 | Updating 1 will cause a node reboot and therefore leads to control
133 plane services interruption if not an HA deployment. Updating 2 might
135 | Updating 3 might lead to control plane services interruption if not an
142 | Upgrades between major releases may introducing significant changes in
143 function, configuration and data, such as the upgrade of OPNFV from
147 | Upgrades inside one major releases which would not leads to changing
148 the structure of the platform and may not infect the schema of the
154 Physical/Hardware Dimension
155 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
157 Support full / partial upgrade for data centre, cluster, zone. Because
158 of the upgrade of a data centre or a zone, it may be divided into
159 several batches. The upgrade of a cloud environment (cluster) may also
160 be partial. For example, in one cloud environment running a number of
161 VNFs, we may just try one of them to check the stability and
162 performance, before we upgrade all of them.
167 - The upgrade of host OS or kernel may need a 'hot migration'
168 - The upgrade of OpenStack’s components
169 i.the one-shot upgrade of all components
170 ii.the partial upgrade (or bugfix patch) which only affects some
171 components (e.g., computing, storage, network, database, message
174 | ==[MT] this section seems to overlap with 2.1.==
175 | I can see the following dimensions for the software
177 - different software packages
178 - different funtions - Considering that the target versions of all
179 software are compatible the upgrade needs to ensure that any
180 dependencies between SW and therefore packages are taken into account
181 in the upgrade plan, i.e. no version mismatch occurs during the
182 upgrade therefore dependencies are not broken
183 - same function - This is an upgrade specific question if different
184 versions can coexist in the system when a SW is being upgraded from
185 one version to another. This is particularly important for stateful
186 functions e.g. storage, networking, control services. The upgrade
187 method must consider the compatibility of the redundant entities.
189 - different versions of the same software package
190 - major version changes - they may introduce incompatibilities. Even
191 when there are backward compatibility requirements changes may cause
192 issues at graceful rollback
193 - minor version changes - they must not introduce incompatibility
194 between versions, these should be primarily bug fixes, so live
195 patches should be possible
197 - different installations of the same software package
198 - using different installation options - they may reflect different
199 users with different needs so redundancy issues are less likely
200 between installations of different options; but they could be the
201 reflection of the heterogeneous system in which case they may provide
202 redundancy for higher availability, i.e. deeper inspection is needed
203 - using the same installation options - they often reflect that the are
204 used by redundant entities across space
206 - different distribution possibilities in space - same or different
207 availability zones, multi-site, geo-redundancy
209 - different entities running from the same installation of a software
211 - using different startup options - they may reflect different users so
212 redundancy may not be an issues between them
213 - using same startup options - they often reflect redundant
219 As the OPNFV end-users are primarily Telco operators, the network
220 services provided by the VNFs deployed on the NFVI should meet the
221 requirement of 'Carrier Grade'.
223 In telecommunication, a "carrier grade" or"carrier class" refers to a
224 system, or a hardware or software component that is extremely reliable,
225 well tested and proven in its capabilities. Carrier grade systems are
226 tested and engineered to meet or exceed "five nines" high availability
227 standards, and provide very fast fault recovery through redundancy
228 (normally less than 50 milliseconds). [from wikipedia.org]
230 "five nines" means working all the time in ONE YEAR except 5'15".
232 We have learnt that a well prepared upgrade of OpenStack needs 10
233 minutes. The major time slot in the outage time is used spent on
234 synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
237 This 10 minutes of downtime of OpenStack however did not impact the
238 users, i.e. the VMs running on the compute nodes. This was the outage of
239 the control plane only. On the other hand with respect to the
240 preparations this was a manually tailored upgrade specific to the
241 particular deployment and the versions of each OpenStack service.
243 The project targets to achieve a more generic methodology, which however
244 requires that the upgrade objects fulfill ceratin requirements. Since
245 this is only possible on the long run we target first upgrades from
246 version to version for the different VIM services.
250 #. | Can we manage to upgrade OPNFV in only 5 minutes?
251 | ==[MT] The first question is whether we have the same carrier grade
252 requirement on the control plane as on the user plane. I.e. how
253 much control plane outage we can/willing to tolerate?
254 | In the above case probably if the database is only half of the size
255 we can do the upgrade in 5 minutes, but is that good? It also means
256 that if the database is twice as much then the outage is 20
258 | For the user plane we should go for less as with two release yearly
259 that means 10 minutes outage per year.==
260 | ==[Malla] 10 minutes outage per year to the users? Plus, if we take
261 control plane into the consideration, then total outage will be
262 more than 10 minute in whole network, right?==
263 | ==[MT] The control plane outage does not have to cause outage to
264 the users, but it may of course depending on the size of the system
265 as it's more likely that there's a failure that needs to be handled
266 by the control plane.==
268 #. | Is it acceptable for end users ? Such as a planed service
269 interruption will lasting more than ten minutes for software
271 | ==[MT] For user plane, no it's not acceptable in case of
272 carrier-grade. The 5' 15" downtime should include unplanned and
274 | ==[Malla] I go agree with Maria, it is not acceptable.==
276 #. | Will any VNFs still working well when VIM is down?
277 | ==[MT] In case of OpenStack it seems yes. .:)==
279 The maximum duration of an upgrade
280 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
282 | The duration of an upgrade is related to and proportional with the
283 scale and the complexity of the OPNFV platform as well as the
284 granularity (in function and in space) of the upgrade.
285 | [Malla] Also, if is a partial upgrade like module upgrade, it depends
286 also on the OPNFV modules and their tight connection entities as well.
288 The maximum duration of a roll back when an upgrade is failed
289 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
291 | The duration of a roll back is short than the corresponding upgrade. It
292 depends on the duration of restore the software and configure data from
293 pre-upgrade backup / snapshot.
294 | ==[MT] During the upgrade process two types of failure may happen:
295 | In case we can recover from the failure by undoing the upgrade
296 actions it is possible to roll back the already executed part of the
297 upgrade in graceful manner introducing no more service outage than
298 what was introduced during the upgrade. Such a graceful roll back
299 requires typically the same amount of time as the executed portion of
300 the upgrade and impose minimal state/data loss.==
301 | ==[MT] Requirement: It should be possible to roll back gracefully the
302 failed upgrade of stateful services of the control plane.
303 | In case we cannot recover from the failure by just undoing the
304 upgrade actions, we have to restore the upgraded entities from their
305 backed up state. In other terms the system falls back to an earlier
306 state, which is typically a faster recovery procedure than graceful
307 roll back and depending on the statefulness of the entities involved it
308 may result in significant state/data loss.==
309 | **Two possible types of failures can happen during an upgrade**
311 #. We can recover from the failure that occurred in the upgrade process:
312 In this case, a graceful rolling back of the executed part of the
313 upgrade may be possible which would "undo" the executed part in a
314 similar fashion. Thus, such a roll back introduces no more service
315 outage during an upgrade than the executed part introduced. This
316 process typically requires the same amount of time as the executed
317 portion of the upgrade and impose minimal state/data loss.
318 #. We cannot recover from the failure that occurred in the upgrade
319 process: In this case, the system needs to fall back to an earlier
320 consistent state by reloading this backed-up state. This is typically
321 a faster recovery procedure than the graceful roll back, but can cause
322 state/data loss. The state/data loss usually depends on the
323 statefulness of the entities whose state is restored from the backup.
325 The maximum duration of a VNF interruption (Service outage)
326 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
328 | Since not the entire process of a smooth upgrade will affect the VNFs,
329 the duration of the VNF interruption may be shorter than the duration
330 of the upgrade. In some cases, the VNF running without the control
331 from of the VIM is acceptable.
332 | ==[MT] Should require explicitly that the NFVI should be able to
333 provide its services to the VNFs independent of the control plane?==
334 | ==[MT] Requirement: The upgrade of the control plane must not cause
335 interruption of the NFVI services provided to the VNFs.==
336 | ==[MT] With respect to carrier-grade the yearly service outage of the
337 VNF should not exceed 5' 15" regardless whether it is planned or
338 unplanned outage. Considering the HA requirements TL-9000 requires an
339 ent-to-end service recovery time of 15 seconds based on which the ETSI
340 GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
341 availability levels (SAL). The proposed example service recovery times
342 for these levels are:
344 | SAL2: 10-15 seconds
345 | SAL3: 20-25 seconds==
346 | ==[Pva] my comment was actually that the downtime metrics of the
347 underlying elements, components and services are small fraction of the
348 total E2E service availability time. No-one on the E2E service path
349 will get the whole downtime allocation (in this context it includes
350 upgrade process related outages for the services provided by VIM etc.
351 elements that are subject to upgrade process).==
352 | ==[MT] So what you are saying is that the upgrade of any entity
353 (component, service) shouldn't cause even this much service
354 interruption. This was the reason I brought these figures here as well
355 that they are posing some kind of upper-upper boundary. Ideally the
356 interruption is in the millisecond range i.e. no more than a
357 switchover or a live migration.==
358 | ==[MT] Requirement: Any interruption caused to the VNF by the upgrade
359 of the NFVI should be in the sub-second range.==
361 ==[MT] In the future we also need to consider the upgrade of the NFVI,
362 i.e. HW, firmware, hypervisors, host OS etc.==