doc/02-Background_and_Terminologies.rst

   1 General Requirements Background and Terminology
   2 -----------------------------------------------
   3
   4 Terminologies and definitions
   5 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   6
   7 -  **NFVI** is abbreviation for Network Function Virtualization
   8    Infrastructure; sometimes it is also referred as data plane in this
   9    document.
  10 -  **VIM** is abbreviation for Virtual Infrastructure Management;
  11    sometimes it is also referred as control plane in this document.
  12 -  **Operators** are network service providers and Virtual Network
  13    Function (VNF) providers.
  14 -  **End-Users** are subscribers of Operator's services.
  15 -  **Network Service** is a service provided by an Operator to its
  16    End-users using a set of (virtualized) Network Functions
  17 -  **Infrastructure Services** are those provided by the NFV
  18    Infrastructure and the Management & Orchestration functions to the
  19    VNFs. I.e. these are the virtual resources as perceived by the VNFs.
  20 -  **Smooth Upgrade** means that the upgrade results in no service
  21    outage for the end-users.
  22 -  **Rolling Upgrade** is an upgrade strategy that upgrades each node or
  23    a subset of nodes in a wave rolling style through the data centre. It
  24    is a popular upgrade strategy to maintains service availability.
  25 -  **Parallel Universe** is an upgrade strategy that creates and deploys
  26    a new universe - a system with the new configuration - while the old
  27    system continues running. The state of the old system is transferred
  28    to the new system after sufficient testing of the later.
  29 -  **Infrastructure Resource Model** ==(suggested by MT)== is identified
  30    as: physical resources, virtualization facility resources and virtual
  31    resources.
  32 -  **Physical Resources** are the hardware of the infrastructure, may
  33    also includes the firmware that enable the hardware.
  34 -  **Virtual Resources** are resources provided as services built on top
  35    of the physical resources via the virtualization facilities; in our
  36    case, they are the components that VNF entities are built on, e.g.
  37    the VMs, virtual switches, virtual routers, virtual disks etc
  38    ==[MT] I don't think the VNF is the virtual resource. Virtual
  39    resources are the VMs, virtual switches, virtual routers, virtual
  40    disks etc. The VNF uses them, but I don't think they are equal. The
  41    VIM doesn't manage the VNF, but it does manage virtual resources.==
  42 -  **Visualization Facilities** are resources that enable the creation
  43    of virtual environments on top of the physical resources, e.g.
  44    hypervisor, OpenStack, etc.
  45
  46 Upgrade Objects
  47 ~~~~~~~~~~~~~~~
  48
  49 Physical Resource
  50 ^^^^^^^^^^^^^^^^^
  51
  52 | Most of the cloud infrastructures support dynamic addition/removal of
  53   hardware. A hardware upgrade could be done by removing the old
  54   hardware node and adding the new one. Upgrade a physical resource,
  55   like upgrade the firmware and modify the configuration data, may
  56   be considered in the future.
  57
  58 Virtual Resources
  59 ^^^^^^^^^^^^^^^^^
  60
  61 | Virtual resource upgrade mainly done by users. OPNFV may facilitate
  62   the activity, but suggest to have it in long term roadmap instead of
  63   initiate release.
  64 | ==[MT] same comment here: I don't think the VNF is the virtual
  65   resource. Virtual resources are the VMs, virtual switches, virtual
  66   routers, virtual disks etc. The VNF uses them, but I don't think they
  67   are equal. For example if by some reason the hypervisor is changed and
  68   the current VMs cannot be migrated to the new hypervisor, they are
  69   incompatible, then the VMs need to be upgraded too. This is not
  70   something the NFVI user (i.e. VNFs ) would even know about.==
  71
  72 Virtualization Facility Resources
  73 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  74
  75 | Based on the functionality they provide, virtualization facility
  76   resources could be divided into computing node, networking node,
  77   storage node and management node.
  78 | The possible upgrade objects in these nodes are addressed below:
  79   (Note: hardware based virtualization may considered as virtualization
  80   facility resource, but from escalator perspective, it is better
  81   considered it as part of hardware upgrade. )
  82
  83 **Computing node**
  84
  85 1. OS Kernel
  86 2. Hypvervisor and virtual switch
  87 3. Other kernel modules, like driver
  88 4. User space software packages, like nova-compute agents and other
  89    control plane programs
  90
  91 | Updating 1 and 2 will cause the loss of virtualzation functionality of
  92   the compute node, which may lead to data plane services interruption
  93   if the virtual resource is not redudant.
  94 | Updating 3 might result the same.
  95 | Updating 4 might lead to control plane services interruption if not an
  96   HA deployment.
  97
  98 **Networking node**
  99
 100 1. OS kernel, optional, not all switch/router allow you to upgrade its
 101    OS since it is more like a firmware than a generic OS.
 102 2. User space software package, like neutron agents and other control
 103    plane programs
 104
 105 | Updating 1 if allowed will cause a node reboot and therefore leads to
 106   data plane services interruption if the virtual resource is not
 107   redudant.
 108 | Updating 2 might lead to control plane services interruption if not an
 109   HA deployment.
 110
 111 **Storage node**
 112
 113 1. OS kernel, optional, not all storage node allow you to upgrade its OS
 114    since it is more like a firmware than a generic OS.
 115 2. Kernel modules
 116 3. User space software packages, control plane programs
 117
 118 | Updating 1 if allowed will cause a node reboot and therefore leads to
 119   data plane services interruption if the virtual resource is not
 120   redudant.
 121 | Update 2 might result in the same.
 122 | Updating 3 might lead to control plane services interruption if not an
 123   HA deployment.
 124
 125 **Management node**
 126
 127 1. OS Kernel
 128 2. Kernel modules, like driver
 129 3. User space software packages, like database, message queue and
 130    control plane programs.
 131
 132 | Updating 1 will cause a node reboot and therefore leads to control
 133   plane services interruption if not an HA deployment. Updating 2 might
 134   result in the same.
 135 | Updating 3 might lead to control plane services interruption if not an
 136   HA deployment.
 137
 138 Upgrade Span
 139 ~~~~~~~~~~~~
 140
 141 | **Major Upgrade**
 142 | Upgrades between major releases may introducing significant changes in
 143   function, configuration and data, such as the upgrade of OPNFV from
 144   Arno to Brahmaputra.
 145
 146 | **Minor Upgrade**
 147 | Upgrades inside one major releases which would not leads to changing
 148   the structure of the platform and may not infect the schema of the
 149   system data.
 150
 151 Upgrade Granularity
 152 ~~~~~~~~~~~~~~~~~~~
 153
 154 Physical/Hardware Dimension
 155 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 156
 157 Support full / partial upgrade for data centre, cluster, zone. Because
 158 of the upgrade of a data centre or a zone, it may be divided into
 159 several batches. The upgrade of a cloud environment (cluster) may also
 160 be partial. For example, in one cloud environment running a number of
 161 VNFs, we may just try one of them to check the stability and
 162 performance, before we upgrade all of them.
 163
 164 Software Dimension
 165 ^^^^^^^^^^^^^^^^^^
 166
 167 -  The upgrade of host OS or kernel may need a 'hot migration'
 168 -  The upgrade of OpenStack’s components
 169     i.the one-shot upgrade of all components
 170     ii.the partial upgrade (or bugfix patch) which only affects some
 171    components (e.g., computing, storage, network, database, message
 172    queue, etc.)
 173
 174 | ==[MT] this section seems to overlap with 2.1.==
 175 | I can see the following dimensions for the software
 176
 177 -  different software packages
 178 -  different funtions - Considering that the target versions of all
 179    software are compatible the upgrade needs to ensure that any
 180    dependencies between SW and therefore packages are taken into account
 181    in the upgrade plan, i.e. no version mismatch occurs during the
 182    upgrade therefore dependencies are not broken
 183 -  same function - This is an upgrade specific question if different
 184    versions can coexist in the system when a SW is being upgraded from
 185    one version to another. This is particularly important for stateful
 186    functions e.g. storage, networking, control services. The upgrade
 187    method must consider the compatibility of the redundant entities.
 188
 189 -  different versions of the same software package
 190 -  major version changes - they may introduce incompatibilities. Even
 191    when there are backward compatibility requirements changes may cause
 192    issues at graceful rollback
 193 -  minor version changes - they must not introduce incompatibility
 194    between versions, these should be primarily bug fixes, so live
 195    patches should be possible
 196
 197 -  different installations of the same software package
 198 -  using different installation options - they may reflect different
 199    users with different needs so redundancy issues are less likely
 200    between installations of different options; but they could be the
 201    reflection of the heterogeneous system in which case they may provide
 202    redundancy for higher availability, i.e. deeper inspection is needed
 203 -  using the same installation options - they often reflect that the are
 204    used by redundant entities across space
 205
 206 -  different distribution possibilities in space - same or different
 207    availability zones, multi-site, geo-redundancy
 208
 209 -  different entities running from the same installation of a software
 210    package
 211 -  using different startup options - they may reflect different users so
 212    redundancy may not be an issues between them
 213 -  using same startup options - they often reflect redundant
 214    entities====
 215
 216 Upgrade duration
 217 ~~~~~~~~~~~~~~~~
 218
 219 As the OPNFV end-users are primarily Telco operators, the network
 220 services provided by the VNFs deployed on the NFVI should meet the
 221 requirement of 'Carrier Grade'.
 222
 223 In telecommunication, a "carrier grade" or"carrier class" refers to a
 224 system, or a hardware or software component that is extremely reliable,
 225 well tested and proven in its capabilities. Carrier grade systems are
 226 tested and engineered to meet or exceed "five nines" high availability
 227 standards, and provide very fast fault recovery through redundancy
 228 (normally less than 50 milliseconds). [from wikipedia.org]
 229
 230 "five nines" means working all the time in ONE YEAR except 5'15".
 231
 232 We have learnt that a well prepared upgrade of OpenStack needs 10
 233 minutes. The major time slot in the outage time is used spent on
 234 synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
 235 ' by Symantec]
 236
 237 This 10 minutes of downtime of OpenStack however did not impact the
 238 users, i.e. the VMs running on the compute nodes. This was the outage of
 239 the control plane only. On the other hand with respect to the
 240 preparations this was a manually tailored upgrade specific to the
 241 particular deployment and the versions of each OpenStack service.
 242
 243 The project targets to achieve a more generic methodology, which however
 244 requires that the upgrade objects fulfill ceratin requirements. Since
 245 this is only possible on the long run we target first upgrades from
 246 version to version for the different VIM services.
 247
 248 **Questions:**
 249
 250 #. | Can we manage to upgrade OPNFV in only 5 minutes?
 251    | ==[MT] The first question is whether we have the same carrier grade
 252      requirement on the control plane as on the user plane. I.e. how
 253      much control plane outage we can/willing to tolerate?
 254    | In the above case probably if the database is only half of the size
 255      we can do the upgrade in 5 minutes, but is that good? It also means
 256      that if the database is twice as much then the outage is 20
 257      minutes.
 258    | For the user plane we should go for less as with two release yearly
 259      that means 10 minutes outage per year.==
 260    | ==[Malla] 10 minutes outage per year to the users? Plus, if we take
 261      control plane into the consideration, then total outage will be
 262      more than 10 minute in whole network, right?==
 263    | ==[MT] The control plane outage does not have to cause outage to
 264      the users, but it may of course depending on the size of the system
 265      as it's more likely that there's a failure that needs to be handled
 266      by the control plane.==
 267
 268 #. | Is it acceptable for end users ? Such as a planed service
 269      interruption will lasting more than ten minutes for software
 270      upgrade.
 271    | ==[MT] For user plane, no it's not acceptable in case of
 272      carrier-grade. The 5' 15" downtime should include unplanned and
 273      planned downtimes.==
 274    | ==[Malla] I go agree with Maria, it is not acceptable.==
 275
 276 #. | Will any VNFs still working well when VIM is down?
 277    | ==[MT] In case of OpenStack it seems yes. .:)==
 278
 279 The maximum duration of an upgrade
 280 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 281
 282 | The duration of an upgrade is related to and proportional with the
 283   scale and the complexity of the OPNFV platform as well as the
 284   granularity (in function and in space) of the upgrade.
 285 | [Malla] Also, if is a partial upgrade like module upgrade, it depends
 286   also on the OPNFV modules and their tight connection entities as well.
 287
 288 The maximum duration of a roll back when an upgrade is failed
 289 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 290
 291 | The duration of a roll back is short than the corresponding upgrade. It
 292   depends on the duration of restore the software and configure data from
 293   pre-upgrade backup / snapshot.
 294 | ==[MT] During the upgrade process two types of failure may happen:
 295 |  In case we can recover from the failure by undoing the upgrade
 296   actions it is possible to roll back the already executed part of the
 297   upgrade in graceful manner introducing no more service outage than
 298   what was introduced during the upgrade. Such a graceful roll back
 299   requires typically the same amount of time as the executed portion of
 300   the upgrade and impose minimal state/data loss.==
 301 | ==[MT] Requirement: It should be possible to roll back gracefully the
 302   failed upgrade of stateful services of the control plane.
 303 |  In case we cannot recover from the failure by just undoing the
 304   upgrade actions, we have to restore the upgraded entities from their
 305   backed up state. In other terms the system falls back to an earlier
 306   state, which is typically a faster recovery procedure than graceful
 307   roll back and depending on the statefulness of the entities involved it
 308   may result in significant state/data loss.==
 309 | **Two possible types of failures can happen during an upgrade**
 310
 311 #. We can recover from the failure that occurred in the upgrade process:
 312    In this case, a graceful rolling back of the executed part of the
 313    upgrade may be possible which would "undo" the executed part in a
 314    similar fashion. Thus, such a roll back introduces no more service
 315    outage during an upgrade than the executed part introduced. This
 316    process typically requires the same amount of time as the executed
 317    portion of the upgrade and impose minimal state/data loss.
 318 #. We cannot recover from the failure that occurred in the upgrade
 319    process: In this case, the system needs to fall back to an earlier
 320    consistent state by reloading this backed-up state. This is typically
 321    a faster recovery procedure than the graceful roll back, but can cause
 322    state/data loss. The state/data loss usually depends on the
 323    statefulness of the entities whose state is restored from the backup.
 324
 325 The maximum duration of a VNF interruption (Service outage)
 326 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 327
 328 | Since not the entire process of a smooth upgrade will affect the VNFs,
 329   the duration of the VNF interruption may be shorter than the duration
 330   of the upgrade. In some cases, the VNF running without the control
 331   from of the VIM is acceptable.
 332 | ==[MT] Should require explicitly that the NFVI should be able to
 333   provide its services to the VNFs independent of the control plane?==
 334 | ==[MT] Requirement: The upgrade of the control plane must not cause
 335   interruption of the NFVI services provided to the VNFs.==
 336 | ==[MT] With respect to carrier-grade the yearly service outage of the
 337   VNF should not exceed 5' 15" regardless whether it is planned or
 338   unplanned outage. Considering the HA requirements TL-9000 requires an
 339   ent-to-end service recovery time of 15 seconds based on which the ETSI
 340   GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
 341   availability levels (SAL). The proposed example service recovery times
 342   for these levels are:
 343 | SAL1: 5-6 seconds
 344 | SAL2: 10-15 seconds
 345 | SAL3: 20-25 seconds==
 346 | ==[Pva] my comment was actually that the downtime metrics of the
 347   underlying elements, components and services are small fraction of the
 348   total E2E service availability time. No-one on the E2E service path
 349   will get the whole downtime allocation (in this context it includes
 350   upgrade process related outages for the services provided by VIM etc.
 351   elements that are subject to upgrade process).==
 352 | ==[MT] So what you are saying is that the upgrade of any entity
 353   (component, service) shouldn't cause even this much service
 354   interruption. This was the reason I brought these figures here as well
 355   that they are posing some kind of upper-upper boundary. Ideally the
 356   interruption is in the millisecond range i.e. no more than a
 357   switchover or a live migration.==
 358 | ==[MT] Requirement: Any interruption caused to the VNF by the upgrade
 359   of the NFVI should be in the sub-second range.==
 360
 361 ==[MT] In the future we also need to consider the upgrade of the NFVI,
 362 i.e. HW, firmware, hypervisors, host OS etc.==