11 Most cloud infrastructures support the dynamic addition and removal of
12 hardware. Accordingly a hardware upgrade could be done by adding the new
13 piece of hardware and removing the old one. From the persepctive of smooth
14 upgrade the orchestration/scheduling of these actions is the primary concern.
16 Upgrading a physical resource may involve as well the upgrade of its firmware
17 and/or modifying its configuration data. This may require the restart of the
23 Addition and removal of virtual resources may be initiated by the users or be
24 a result of an elasticity action. Users may also request the upgrade of their
25 virtual resources using a new VM image.
27 .. Needs to be moved to requirement section: Escalator should facilitate such an
28 option and allow for a smooth upgrade.
30 On the other hand changes in the infrastructure, namely, in the hardware and/or
31 the virtualization facility resources may result in the upgrade of the virtual
32 resources. For example if by some reason the hypervisor is changed and
33 the current VMs cannot be migrated to the new hypervisor - they are
34 incompatible - then the VMs need to be upgraded too. This is not
35 something the NFVI user (i.e. VNFs ) would know about.
38 Virtualization Facility Resources
39 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
41 Based on the functionality they provide, virtualization facility
42 resources could be divided into computing node, networking node,
43 storage node and management node.
45 The possible upgrade objects in these nodes are considered below:
46 (Note: hardware based virtualization may be considered as virtualization
47 facility resource, but from escalator perspective, it is better to
48 consider it as part of the hardware upgrade. )
54 2. Hypvervisor and virtual switch
56 3. Other kernel modules, like drivers
58 4. User space software packages, like nova-compute agents and other
59 control plane programs.
61 Updating 1 and 2 will cause the loss of virtualzation functionality of
62 the compute node, which may lead to the interruption of data plane services
63 if the virtual resource is not redudant.
65 Updating 3 might have the same result.
67 Updating 4 might lead to control plane services interruption if not an
70 .. <MT> I'm not sure why would 4 cause control plane interruption on a
71 compute node. My understanding is that simply the node cannot be managed.
72 Redundancy won't help in that either.
77 1. OS kernel, optional, not all switches/routers allow the upgrade their
78 OS since it is more like a firmware than a generic OS.
80 2. User space software package, like neutron agents and other control
83 Updating 1 if allowed will cause a node reboot and therefore leads to
84 data plane service interruption if the virtual resource is not
87 Updating 2 might lead to control plane services interruption if not an
92 1. OS kernel, optional, not all storage nodes allow the upgrade their OS
93 since it is more like a firmware than a generic OS.
97 3. User space software packages, control plane programs
99 Updating 1 if allowed will cause a node reboot and therefore leads to
100 data plane services interruption if the virtual resource is not
103 Update 2 might result in the same.
105 Updating 3 might lead to control plane services interruption if not an
112 2. Kernel modules, like driver
114 3. User space software packages, like database, message queue and
115 control plane programs.
117 Updating 1 will cause a node reboot and therefore leads to control
118 plane services interruption if not an HA deployment. Updating 2 might
121 Updating 3 might lead to control plane services interruption if not an
127 The granularity of an upgrade can be characterized from two perspective:
128 - the physical dimension and
129 - the software dimension
134 The physical dimension characterizes the number of similar upgrade objects
135 targeted by the upgrade, i.e. whether it is full / partial upgrade of a
136 data centre, cluster, zone.
137 Because of the upgrade of a data centre or a zone, it may be divided into
138 several batches. Thus there is a need for efficiency in the execution of
139 upgrades of potentially huge number of upgrade objects while still maintain
140 availability to fulfill the requirement of smooth upgrade.
142 The upgrade of a cloud environment (cluster) may also
143 be partial. For example, in one cloud environment running a number of
144 VNFs, we may just try to upgrade one of them to check the stability and
145 performance, before we upgrade all of them.
146 Thus there is a need for proper organization of the artifacts associated with
147 the different upgrade objects. Also the different versions should be able
148 to coextist beyond the upgrade period.
150 From this perspective special attention may be needed when upgrading
151 objects that are collaborating in a redundancy schema as in this case
152 different versions not only need to coexist but also collaborate. This
153 puts requirement on the upgrade objects primarily. If this is not possible
154 the upgrade campaign should be designed in such a way that the proper
155 isolation is ensured.
160 The software dimension of the upgrade characterizes the upgrade object
161 type targeted and the combination in which they are upgraded together.
163 Even though the upgrade may
164 initially target only one type of upgrade object, e.g. the hypervisor
165 the dependency of other upgrade objects on this initial target object may
166 require their upgrade as well. I.e. the upgrades need to be combined. From this
167 perspective the main concern is compatibility of the dependent and
168 sponsor objects. To take into consideration of these dependencies
169 they need to be described together with the version compatility information.
170 Breaking dependencies is the major cause of outages during upgrades.
172 In other cases it is more efficient to upgrade a combination of upgrade
173 objects than to do it one by one. One aspect of the combination is how
174 the upgrade packages can be combined, whether a new image can be created for
175 them before hand or the different packages can be installed during the upgrade
176 independently, but activated together.
178 The combination of upgrade objects may span across
179 layers (e.g. software stack in the host and the VM of the VNF).
180 Thus, it may require additional coordination between the management layers.
182 With respect to each upgrade object type and even stacks we can
183 distingush major and minor upgrades:
187 Upgrades between major releases may introducing significant changes in
188 function, configuration and data, such as the upgrade of OPNFV from
193 Upgrades inside one major releases which would not leads to changing
194 the structure of the platform and may not infect the schema of the
200 Considering availability and therefore smooth upgrade, one of the major
201 concerns is the predictability and control of the outcome of the different
202 upgrade operations. Ideally an upgrade can be performed without impacting any
203 entity in the system, which means none of the operations change or potentially
204 change the behaviour of any entity in the system in an uncotrolled manner.
205 Accordingly the operations of such an upgrade can be performed any time while
206 the system is running, while all the entities are online. No entity needs to be
207 taken offline to avoid such adverse effects. Hence such upgrade operations
208 are referred as online operations. The effects of the upgrade might be activated
209 next time it is used, or may require a special activation action such as a
210 restart. Note that the activation action provides more control and predictability.
212 If an entity's behavior in the system may change due to the upgrade it may
213 be better to take it offline for the time of the relevant upgrade operations.
214 The main question is however considering the hosting relation of an upgrade
215 object what hosted entities are impacted. Accordingly we can identify a scope
216 which is impacted by taking the given upgrade object offline. The entities
217 that are in the scope of impact may need to be taken offline or moved out of
218 this scope i.e. migrated.
220 If the impacted entity is in a different layer managed by another manager
221 this may require coordination because taking out of service some
222 infrastructure resources for the time of their upgrade which support virtual
223 resources used by VNFs that should not experience outages. The hosted VNFs
224 may or may not allow for the hot migration of their VMs. In case of migration
225 the VMs placement policy should be considered.