Merge "ESCALATOR-18: Use Cases"

[escalator.git] / doc / 02-Background_and_Terminologies.rst
diff --git a/doc/02-Background_and_Terminologies.rst b/doc/02-Background_and_Terminologies.rst

index b6e9552..36a81f2 100644 (file)
--- a/doc/02-Background_and_Terminologies.rst
+++ b/doc/02-Background_and_Terminologies.rst
@@ -1,362 +1,517 @@
-General Requirements Background and Terminology
------------------------------------------------
-
-Terminologies and definitions
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
--  **NFVI** is abbreviation for Network Function Virtualization
-   Infrastructure; sometimes it is also referred as data plane in this
-   document.
--  **VIM** is abbreviation for Virtual Infrastructure Management;
-   sometimes it is also referred as control plane in this document.
--  **Operators** are network service providers and Virtual Network
-   Function (VNF) providers.
--  **End-Users** are subscribers of Operator's services.
--  **Network Service** is a service provided by an Operator to its
-   End-users using a set of (virtualized) Network Functions
--  **Infrastructure Services** are those provided by the NFV
-   Infrastructure and the Management & Orchestration functions to the
-   VNFs. I.e. these are the virtual resources as perceived by the VNFs.
--  **Smooth Upgrade** means that the upgrade results in no service
-   outage for the end-users.
--  **Rolling Upgrade** is an upgrade strategy that upgrades each node or
-   a subset of nodes in a wave rolling style through the data centre. It
-   is a popular upgrade strategy to maintains service availability.
--  **Parallel Universe** is an upgrade strategy that creates and deploys
-   a new universe - a system with the new configuration - while the old
-   system continues running. The state of the old system is transferred
-   to the new system after sufficient testing of the later.
--  **Infrastructure Resource Model** ==(suggested by MT)== is identified
-   as: physical resources, virtualization facility resources and virtual
-   resources.
--  **Physical Resources** are the hardware of the infrastructure, may
-   also includes the firmware that enable the hardware.
--  **Virtual Resources** are resources provided as services built on top
-   of the physical resources via the virtualization facilities; in our
-   case, they are the components that VNF entities are built on, e.g.
-   the VMs, virtual switches, virtual routers, virtual disks etc
-   ==[MT] I don't think the VNF is the virtual resource. Virtual
-   resources are the VMs, virtual switches, virtual routers, virtual
-   disks etc. The VNF uses them, but I don't think they are equal. The
-   VIM doesn't manage the VNF, but it does manage virtual resources.==
--  **Visualization Facilities** are resources that enable the creation
-   of virtual environments on top of the physical resources, e.g.
-   hypervisor, OpenStack, etc.
-
-Upgrade Objects
-~~~~~~~~~~~~~~~
-
-Physical Resource
-^^^^^^^^^^^^^^^^^
-
-| Most of the cloud infrastructures support dynamic addition/removal of
-  hardware. A hardware upgrade could be done by removing the old
-  hardware node and adding the new one. Upgrade a physical resource,
-  like upgrade the firmware and modify the configuration data, may
-  be considered in the future. 
-
-Virtual Resources
-^^^^^^^^^^^^^^^^^
-
-| Virtual resource upgrade mainly done by users. OPNFV may facilitate
-  the activity, but suggest to have it in long term roadmap instead of
-  initiate release.
-| ==[MT] same comment here: I don't think the VNF is the virtual
-  resource. Virtual resources are the VMs, virtual switches, virtual
-  routers, virtual disks etc. The VNF uses them, but I don't think they
-  are equal. For example if by some reason the hypervisor is changed and
-  the current VMs cannot be migrated to the new hypervisor, they are
-  incompatible, then the VMs need to be upgraded too. This is not
-  something the NFVI user (i.e. VNFs ) would even know about.==
-
-Virtualization Facility Resources
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-| Based on the functionality they provide, virtualization facility
-  resources could be divided into computing node, networking node,
-  storage node and management node.
-| The possible upgrade objects in these nodes are addressed below:
-  (Note: hardware based virtualization may considered as virtualization
-  facility resource, but from escalator perspective, it is better
-  considered it as part of hardware upgrade. )
-
-**Computing node**
-
-1. OS Kernel
-2. Hypvervisor and virtual switch
-3. Other kernel modules, like driver
-4. User space software packages, like nova-compute agents and other
-   control plane programs
-
-| Updating 1 and 2 will cause the loss of virtualzation functionality of
-  the compute node, which may lead to data plane services interruption
-  if the virtual resource is not redudant.
-| Updating 3 might result the same.
-| Updating 4 might lead to control plane services interruption if not an
-  HA deployment.
-
-**Networking node**
-
-1. OS kernel, optional, not all switch/router allow you to upgrade its
-   OS since it is more like a firmware than a generic OS.
-2. User space software package, like neutron agents and other control
-   plane programs
-
-| Updating 1 if allowed will cause a node reboot and therefore leads to
-  data plane services interruption if the virtual resource is not
-  redudant.
-| Updating 2 might lead to control plane services interruption if not an
-  HA deployment.
-
-**Storage node**
-
-1. OS kernel, optional, not all storage node allow you to upgrade its OS
-   since it is more like a firmware than a generic OS.
-2. Kernel modules
-3. User space software packages, control plane programs
-
-| Updating 1 if allowed will cause a node reboot and therefore leads to
-  data plane services interruption if the virtual resource is not
-  redudant.
-| Update 2 might result in the same.
-| Updating 3 might lead to control plane services interruption if not an
-  HA deployment.
-
-**Management node**
-
-1. OS Kernel
-2. Kernel modules, like driver
-3. User space software packages, like database, message queue and
-   control plane programs.
-
-| Updating 1 will cause a node reboot and therefore leads to control
-  plane services interruption if not an HA deployment. Updating 2 might
-  result in the same.
-| Updating 3 might lead to control plane services interruption if not an
-  HA deployment.
-
-Upgrade Span
-~~~~~~~~~~~~
-
-| **Major Upgrade**
-| Upgrades between major releases may introducing significant changes in
-  function, configuration and data, such as the upgrade of OPNFV from
-  Arno to Brahmaputra.
-
-| **Minor Upgrade**
-| Upgrades inside one major releases which would not leads to changing
-  the structure of the platform and may not infect the schema of the
-  system data.
-
-Upgrade Granularity
-~~~~~~~~~~~~~~~~~~~
-
-Physical/Hardware Dimension
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Support full / partial upgrade for data centre, cluster, zone. Because
-of the upgrade of a data centre or a zone, it may be divided into
-several batches. The upgrade of a cloud environment (cluster) may also
-be partial. For example, in one cloud environment running a number of
-VNFs, we may just try one of them to check the stability and
-performance, before we upgrade all of them.
-
-Software Dimension
-^^^^^^^^^^^^^^^^^^
-
--  The upgrade of host OS or kernel may need a 'hot migration'
--  The upgrade of OpenStack’s components
-    i.the one-shot upgrade of all components
-    ii.the partial upgrade (or bugfix patch) which only affects some
-   components (e.g., computing, storage, network, database, message
-   queue, etc.)
-
-| ==[MT] this section seems to overlap with 2.1.==
-| I can see the following dimensions for the software
-
--  different software packages
--  different funtions - Considering that the target versions of all
-   software are compatible the upgrade needs to ensure that any
-   dependencies between SW and therefore packages are taken into account
-   in the upgrade plan, i.e. no version mismatch occurs during the
-   upgrade therefore dependencies are not broken
--  same function - This is an upgrade specific question if different
-   versions can coexist in the system when a SW is being upgraded from
-   one version to another. This is particularly important for stateful
-   functions e.g. storage, networking, control services. The upgrade
-   method must consider the compatibility of the redundant entities.
-
--  different versions of the same software package
--  major version changes - they may introduce incompatibilities. Even
-   when there are backward compatibility requirements changes may cause
-   issues at graceful rollback
--  minor version changes - they must not introduce incompatibility
-   between versions, these should be primarily bug fixes, so live
-   patches should be possible
-
--  different installations of the same software package
--  using different installation options - they may reflect different
-   users with different needs so redundancy issues are less likely
-   between installations of different options; but they could be the
-   reflection of the heterogeneous system in which case they may provide
-   redundancy for higher availability, i.e. deeper inspection is needed
--  using the same installation options - they often reflect that the are
-   used by redundant entities across space
-
--  different distribution possibilities in space - same or different
-   availability zones, multi-site, geo-redundancy
-
--  different entities running from the same installation of a software
-   package
--  using different startup options - they may reflect different users so
-   redundancy may not be an issues between them
--  using same startup options - they often reflect redundant
-   entities====
-
-Upgrade duration
-~~~~~~~~~~~~~~~~
-
-As the OPNFV end-users are primarily Telco operators, the network
-services provided by the VNFs deployed on the NFVI should meet the
-requirement of 'Carrier Grade'.
-
-In telecommunication, a "carrier grade" or"carrier class" refers to a
-system, or a hardware or software component that is extremely reliable,
-well tested and proven in its capabilities. Carrier grade systems are
-tested and engineered to meet or exceed "five nines" high availability
-standards, and provide very fast fault recovery through redundancy
-(normally less than 50 milliseconds). [from wikipedia.org]
-
-"five nines" means working all the time in ONE YEAR except 5'15".
-
-We have learnt that a well prepared upgrade of OpenStack needs 10
-minutes. The major time slot in the outage time is used spent on
-synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
-' by Symantec]
-
-This 10 minutes of downtime of OpenStack however did not impact the
-users, i.e. the VMs running on the compute nodes. This was the outage of
-the control plane only. On the other hand with respect to the
-preparations this was a manually tailored upgrade specific to the
-particular deployment and the versions of each OpenStack service.
-
-The project targets to achieve a more generic methodology, which however
-requires that the upgrade objects fulfill ceratin requirements. Since
-this is only possible on the long run we target first upgrades from
-version to version for the different VIM services.
-
-**Questions:**
-
-#. | Can we manage to upgrade OPNFV in only 5 minutes?
-   | ==[MT] The first question is whether we have the same carrier grade
-     requirement on the control plane as on the user plane. I.e. how
-     much control plane outage we can/willing to tolerate?
-   | In the above case probably if the database is only half of the size
-     we can do the upgrade in 5 minutes, but is that good? It also means
-     that if the database is twice as much then the outage is 20
-     minutes.
-   | For the user plane we should go for less as with two release yearly
-     that means 10 minutes outage per year.==
-   | ==[Malla] 10 minutes outage per year to the users? Plus, if we take
-     control plane into the consideration, then total outage will be
-     more than 10 minute in whole network, right?==
-   | ==[MT] The control plane outage does not have to cause outage to
-     the users, but it may of course depending on the size of the system
-     as it's more likely that there's a failure that needs to be handled
-     by the control plane.==
-
-#. | Is it acceptable for end users ? Such as a planed service
-     interruption will lasting more than ten minutes for software
-     upgrade.
-   | ==[MT] For user plane, no it's not acceptable in case of
-     carrier-grade. The 5' 15" downtime should include unplanned and
-     planned downtimes.==
-   | ==[Malla] I go agree with Maria, it is not acceptable.==
-
-#. | Will any VNFs still working well when VIM is down?
-   | ==[MT] In case of OpenStack it seems yes. .:)==
-
-The maximum duration of an upgrade
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-| The duration of an upgrade is related to and proportional with the
-  scale and the complexity of the OPNFV platform as well as the
-  granularity (in function and in space) of the upgrade.
-| [Malla] Also, if is a partial upgrade like module upgrade, it depends
-  also on the OPNFV modules and their tight connection entities as well.
-
-The maximum duration of a roll back when an upgrade is failed 
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-| The duration of a roll back is short than the corresponding upgrade. It
-  depends on the duration of restore the software and configure data from
-  pre-upgrade backup / snapshot.
-| ==[MT] During the upgrade process two types of failure may happen:
-|  In case we can recover from the failure by undoing the upgrade
-  actions it is possible to roll back the already executed part of the
-  upgrade in graceful manner introducing no more service outage than
-  what was introduced during the upgrade. Such a graceful roll back
-  requires typically the same amount of time as the executed portion of
-  the upgrade and impose minimal state/data loss.==
-| ==[MT] Requirement: It should be possible to roll back gracefully the
-  failed upgrade of stateful services of the control plane.
-|  In case we cannot recover from the failure by just undoing the
-  upgrade actions, we have to restore the upgraded entities from their
-  backed up state. In other terms the system falls back to an earlier
-  state, which is typically a faster recovery procedure than graceful
-  roll back and depending on the statefulness of the entities involved it
-  may result in significant state/data loss.==
-| **Two possible types of failures can happen during an upgrade**
-
-#. We can recover from the failure that occurred in the upgrade process:
-   In this case, a graceful rolling back of the executed part of the
-   upgrade may be possible which would "undo" the executed part in a
-   similar fashion. Thus, such a roll back introduces no more service
-   outage during an upgrade than the executed part introduced. This
-   process typically requires the same amount of time as the executed
-   portion of the upgrade and impose minimal state/data loss.
-#. We cannot recover from the failure that occurred in the upgrade
-   process: In this case, the system needs to fall back to an earlier
-   consistent state by reloading this backed-up state. This is typically
-   a faster recovery procedure than the graceful roll back, but can cause
-   state/data loss. The state/data loss usually depends on the
-   statefulness of the entities whose state is restored from the backup.
-
-The maximum duration of a VNF interruption (Service outage)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-| Since not the entire process of a smooth upgrade will affect the VNFs,
-  the duration of the VNF interruption may be shorter than the duration
-  of the upgrade. In some cases, the VNF running without the control
-  from of the VIM is acceptable.
-| ==[MT] Should require explicitly that the NFVI should be able to
-  provide its services to the VNFs independent of the control plane?==
-| ==[MT] Requirement: The upgrade of the control plane must not cause
-  interruption of the NFVI services provided to the VNFs.==
-| ==[MT] With respect to carrier-grade the yearly service outage of the
-  VNF should not exceed 5' 15" regardless whether it is planned or
-  unplanned outage. Considering the HA requirements TL-9000 requires an
-  ent-to-end service recovery time of 15 seconds based on which the ETSI
-  GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
-  availability levels (SAL). The proposed example service recovery times
-  for these levels are:
-| SAL1: 5-6 seconds
-| SAL2: 10-15 seconds
-| SAL3: 20-25 seconds==
-| ==[Pva] my comment was actually that the downtime metrics of the
-  underlying elements, components and services are small fraction of the
-  total E2E service availability time. No-one on the E2E service path
-  will get the whole downtime allocation (in this context it includes
-  upgrade process related outages for the services provided by VIM etc.
-  elements that are subject to upgrade process).==
-| ==[MT] So what you are saying is that the upgrade of any entity
-  (component, service) shouldn't cause even this much service
-  interruption. This was the reason I brought these figures here as well
-  that they are posing some kind of upper-upper boundary. Ideally the
-  interruption is in the millisecond range i.e. no more than a
-  switchover or a live migration.==
-| ==[MT] Requirement: Any interruption caused to the VNF by the upgrade
-  of the NFVI should be in the sub-second range.==
-
-==[MT] In the future we also need to consider the upgrade of the NFVI,
-i.e. HW, firmware, hypervisors, host OS etc.==
-\ No newline at end of file
+General Requirements Background and Terminology\r
+-----------------------------------------------\r
+\r
+Terminologies and definitions\r
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r
+\r
+NFVI\r
+  The term is an abbreviation for Network Function Virtualization\r
+  Infrastructure; sometimes it is also referred as data plane in this\r
+  document.\r
+\r
+VIM\r
+  The term is an abbreviation for Virtual Infrastructure Management;\r
+  sometimes it is also referred as control plane in this document.\r
+\r
+Operator\r
+  The term refers to network service providers and Virtual Network\r
+  Function (VNF) providers.\r
+\r
+End-User\r
+  The term refers to a subscriber of the Operator's services.\r
+\r
+Network Service\r
+  The term refers to a service provided by an Operator to its\r
+  End-users using a set of (virtualized) Network Functions\r
+\r
+Infrastructure Services\r
+  The term refers to services provided by the NFV Infrastructure and the\r
+  the Management & Orchestration functions to the VNFs. I.e.\r
+  these are the virtual resources as perceived by the VNFs.\r
+\r
+Smooth Upgrade\r
+  The term refers to an upgrade that results in no service outage\r
+  for the end-users.\r
+\r
+Rolling Upgrade\r
+  The term refers to an upgrade strategy that upgrades each node or\r
+  a subset of nodes in a wave style rolling through the data centre. It\r
+  is a popular upgrade strategy to maintain service availability.\r
+\r
+Parallel Universe Upgrade\r
+  The term refers to an upgrade strategy that creates and deploys\r
+  a new universe - a system with the new configuration - while the old\r
+  system continues running. The state of the old system is transferred\r
+  to the new system after sufficient testing of the new system.\r
+\r
+Infrastructure Resource Model\r
+  The term refers to the representation of infrastructure resources,\r
+  namely: the physical resources, the virtualization\r
+  facility resources and the virtual resources.\r
+\r
+Physical Resource\r
+  The term refers to a hardware pieces of the NFV infrastructure, which may\r
+  also include the firmware which enables the hardware.\r
+\r
+Virtual Resource\r
+  The term refers to a resource, which is provided as services built on top\r
+  of the physical resources via the virtualization facilities; in particular,\r
+  they are the resources on which VNF entities are deployed, e.g.\r
+  the VMs, virtual switches, virtual routers, virtual disks etc.\r
+\r
+Visualization Facility\r
+  The term refers to a resource that enables the creation\r
+  of virtual environments on top of the physical resources, e.g.\r
+  hypervisor, OpenStack, etc.\r
+\r
+Upgrade Campaign\r
+  The term refers to a choreography that describes how the upgrade should\r
+  be performed in terms of its targets (i.e. upgrade objects), the\r
+  steps/actions required of upgrading each, and the coordination of these\r
+  steps so that service availability can be maintained. It is an input to an\r
+  upgrade tool (Escalator) to carry out the upgrade.\r
+\r
+Upgrade Duration\r
+  The duration of an upgrade characterized by the time elapsed between its\r
+  initiation and its completion. E.g. from the moment the execution of an\r
+  upgrade campaign has started until it has been committed. Depending on\r
+  the upgrade method and its target some parts of the system may be in a more\r
+  vulnerable state.\r
+\r
+Outage\r
+  The period of time during which a given service is not provided is referred\r
+  as the outage of that given service. If a subsystem or the entire system\r
+  does not provide any service, it is the outage of the given subsystem or the\r
+  system. Smooth upgrade means upgrade with no outage for the user plane, i.e.\r
+  no VNF should experience service outage.\r
+\r
+Rollback\r
+  The term refers to a failure handling strategy that reverts the changes\r
+  done by a potentially failed upgrade execution one by one in a reverse order.\r
+  I.e. it is like undoing the changes done by the upgrade.\r
+\r
+Restore\r
+  The term refers to a failure handling strategy that reverts the changes\r
+  done by an upgrade by restoring the system from some backup data. This\r
+  results in the loss of any data persisted since the backup has been taken.\r
+\r
+Rollforward\r
+  The term refers to a failure handling strategy applied after a restore\r
+  (from a backup) opertaion to recover any loss of data persisted between\r
+  the time the backup has been taken and the moment it is restored. Rollforward\r
+  requires that data that needs to survive the restore operation is logged at\r
+  a location not impacted by the restore so that it can be re-applied to the\r
+  system after its restoration from the backup.\r
+\r
+Downgrade\r
+  The term refers to an upgrade in which an earlier version of the software\r
+  is restored through the upgrade procedure. A system can be downgraded to any\r
+  earlier version and the compatibility of the versions will determine the\r
+  applicable upgrade strategies and whether service outage can be avoided.\r
+  In particular any data conversion needs special attention.\r
+\r
+\r
+\r
+Upgrade Objects\r
+~~~~~~~~~~~~~~~\r
+\r
+Physical Resource\r
+^^^^^^^^^^^^^^^^^\r
+\r
+Most cloud infrastructures support the dynamic addition/removal of\r
+hardware. Accordingly a hardware upgrade could be done by adding the new\r
+piece of hardware and removing the old one. From the persepctive of smooth\r
+upgrade the orchestration/scheduling of this actions is the primary concern.\r
+Upgrading a physical resource may involve as well the upgrade of its firmware\r
+and/or modifying its configuration data. This may require the restart of the\r
+hardware.\r
+\r
+\r
+\r
+Virtual Resources\r
+^^^^^^^^^^^^^^^^^\r
+\r
+Addition and removal of virtual resources may be initiated by the users or be\r
+a result of an elasticity action. Users may also request the upgrade of their\r
+virtual resources using a new VM image.\r
+\r
+.. Needs to be moved to requirement section: Escalator should facilitate such an\r
+option and allow for a smooth upgrade.\r
+\r
+On the other hand changes in the infrastructure, namely, in the hardware and/or\r
+the virtualization facility resources may result in the upgrade of the virtual\r
+resources. For example if by some reason the hypervisor is changed and\r
+the current VMs cannot be migrated to the new hypervisor - they are\r
+incompatible - then the VMs need to be upgraded too. This is not\r
+something the NFVI user (i.e. VNFs ) would know about. In such cases\r
+smooth upgrade is essential.\r
+\r
+\r
+Virtualization Facility Resources\r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+Based on the functionality they provide, virtualization facility\r
+resources could be divided into computing node, networking node,\r
+storage node and management node.\r
+\r
+The possible upgrade objects in these nodes are addressed below:\r
+(Note: hardware based virtualization may be considered as virtualization\r
+facility resource, but from escalator perspective, it is better to\r
+consider it as part of the hardware upgrade. )\r
+\r
+**Computing node**\r
+\r
+1. OS Kernel\r
+\r
+2. Hypvervisor and virtual switch\r
+\r
+3. Other kernel modules, like driver\r
+\r
+4. User space software packages, like nova-compute agents and other\r
+   control plane programs.\r
+\r
+Updating 1 and 2 will cause the loss of virtualzation functionality of\r
+the compute node, which may lead to data plane services interruption\r
+if the virtual resource is not redudant.\r
+\r
+Updating 3 might result the same.\r
+\r
+Updating 4 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+**Networking node**\r
+\r
+1. OS kernel, optional, not all switches/routers allow the upgrade their\r
+   OS since it is more like a firmware than a generic OS.\r
+\r
+2. User space software package, like neutron agents and other control\r
+   plane programs\r
+\r
+Updating 1 if allowed will cause a node reboot and therefore leads to\r
+data plane service interruption if the virtual resource is not\r
+redundant.\r
+\r
+Updating 2 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+**Storage node**\r
+\r
+1. OS kernel, optional, not all storage nodes allow the upgrade their OS\r
+   since it is more like a firmware than a generic OS.\r
+\r
+2. Kernel modules\r
+\r
+3. User space software packages, control plane programs\r
+\r
+Updating 1 if allowed will cause a node reboot and therefore leads to\r
+data plane services interruption if the virtual resource is not\r
+redundant.\r
+\r
+Update 2 might result in the same.\r
+\r
+Updating 3 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+**Management node**\r
+\r
+1. OS Kernel\r
+\r
+2. Kernel modules, like driver\r
+\r
+3. User space software packages, like database, message queue and\r
+   control plane programs.\r
+\r
+Updating 1 will cause a node reboot and therefore leads to control\r
+plane services interruption if not an HA deployment. Updating 2 might\r
+result in the same.\r
+\r
+Updating 3 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+\r
+\r
+\r
+\r
+Upgrade Granularity\r
+~~~~~~~~~~~~~~~~~~~\r
+\r
+The granularity of an upgrade can be characterized from two perspective:\r
+- the physical dimension and\r
+- the software dimension\r
+\r
+\r
+Physical Dimension\r
+^^^^^^^^^^^^^^^^^^\r
+\r
+The physical dimension characterizes the number of similar upgrade objects\r
+targeted by the upgrade, i.e. whether it is full / partial upgrade of a\r
+data centre, cluster, zone.\r
+Because of the upgrade of a data centre or a zone, it may be divided into\r
+several batches. Thus there is a need for efficiency in the execution of\r
+upgrades of potentially huge number of upgrade objects while still maintain\r
+availability to fulfill the requirement of smooth upgrade.\r
+\r
+The upgrade of a cloud environment (cluster) may also\r
+be partial. For example, in one cloud environment running a number of\r
+VNFs, we may just try to upgrade one of them to check the stability and\r
+performance, before we upgrade all of them.\r
+Thus there is a need for proper organization of the artifacts associated with\r
+the different upgrade objects. Also the different versions should be able\r
+to coextist beyond the upgrade period.\r
+\r
+From this perspective special attention may be needed when upgrading\r
+objects that are collaborating in a redundancy schema as in this case\r
+different versions not only need to coexist but also collaborate. This\r
+puts requirement on the upgrade objects primarily. If this is not possible\r
+the upgrade campaign should be designed in such a way that the proper\r
+isolation is ensured.\r
+\r
+Software Dimension\r
+^^^^^^^^^^^^^^^^^^\r
+\r
+The software dimension of the upgrade characterizes the upgrade object\r
+type targeted and the combination in which they are upgraded together.\r
+\r
+Even though the upgrade may\r
+initially target only one type of upgrade object, e.g. the hypervisor\r
+the dependency of other upgrade objects on this initial target object may\r
+require their upgrade as well. I.e. the upgrades need to be combined. From this\r
+perspective the main concern is compatibility of the dependent and\r
+sponsor objects. To take into consideration of these dependencies\r
+they need to be described together with the version compatility information.\r
+Breaking dependencies is the major cause of outages during upgrades.\r
+\r
+In other cases it is more efficient to upgrade a combination of upgrade\r
+objects than to do it one by one. One aspect of the combination is how\r
+the upgrade packages can be combined, whether a new image can be created for\r
+them before hand or the different packages can be installed during the upgrade\r
+independently, but activated together.\r
+\r
+The combination of upgrade objects may span across\r
+layers (e.g. software stack in the host and the VM of the VNF).\r
+Thus, it may require additional coordination between the management layers.\r
+\r
+With respect to each upgrade object type and even stacks we can\r
+distingush major and minor upgrades:\r
+\r
+**Major Upgrade**\r
+\r
+Upgrades between major releases may introducing significant changes in\r
+function, configuration and data, such as the upgrade of OPNFV from\r
+Arno to Brahmaputra.\r
+\r
+**Minor Upgrade**\r
+\r
+Upgrades inside one major releases which would not leads to changing\r
+the structure of the platform and may not infect the schema of the\r
+system data.\r
+\r
+Scope of Impact\r
+~~~~~~~~~~~~~~~\r
+\r
+Considering availability and therefore smooth upgrade, one of the major\r
+concerns is the predictability and control of the outcome of the different\r
+upgrade operations. Ideally an upgrade can be performed without impacting any\r
+entity in the system, which means none of the operations change or potentially\r
+change the behaviour of any entity in the system in an uncotrolled manner.\r
+Accordingly the operations of such an upgrade can be performed any time while\r
+the system is running, while all the entities are online. No entity needs to be\r
+taken offline to avoid such adverse effects. Hence such upgrade operations\r
+are referred as online operations. The effects of the upgrade might be activated\r
+next time it is used, or may require a special activation action such as a\r
+restart. Note that the activation action provides more control and predictability.\r
+\r
+If an entity's behavior in the system may change due to the upgrade it may\r
+be better to take it offline for the time of the relevant upgrade operations.\r
+The main question is however considering the hosting relation of an upgrade\r
+object what hosted entities are impacted. Accordingly we can identify a scope\r
+which is impacted by taking the given upgrade object offline. The entities\r
+that are in the scope of impact may need to be taken offline or moved out of\r
+this scope i.e. migrated.\r
+\r
+If the impacted entity is in a different layer managed by another manager\r
+this may require coordination because taking out of service some\r
+infrastructure resources for the time of their upgrade which support virtual\r
+resources used by VNFs that should not experience outages. The hosted VNFs\r
+may or may not allow for the hot migration of their VMs. In case of migration\r
+the VMs placement policy should be considered.\r
+\r
+\r
+\r
+Upgrade duration\r
+~~~~~~~~~~~~~~~~\r
+\r
+As the OPNFV end-users are primarily Telecom operators, the network\r
+services provided by the VNFs deployed on the NFVI should meet the\r
+requirement of 'Carrier Grade'.::\r
+\r
+  In telecommunication, a "carrier grade" or"carrier class" refers to a\r
+  system, or a hardware or software component that is extremely reliable,\r
+  well tested and proven in its capabilities. Carrier grade systems are\r
+  tested and engineered to meet or exceed "five nines" high availability\r
+  standards, and provide very fast fault recovery through redundancy\r
+  (normally less than 50 milliseconds). [from wikipedia.org]\r
+\r
+"five nines" means working all the time in ONE YEAR except 5'15".\r
+\r
+::\r
+\r
+  We have learnt that a well prepared upgrade of OpenStack needs 10\r
+  minutes. The major time slot in the outage time is used spent on\r
+  synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!\r
+  ' by Symantec]\r
+\r
+This 10 minutes of downtime of the OpenStack services however did not impact the\r
+users, i.e. the VMs running on the compute nodes. This was the outage of\r
+the control plane only. On the other hand with respect to the\r
+preparations this was a manually tailored upgrade specific to the\r
+particular deployment and the versions of each OpenStack service.\r
+\r
+The project targets to achieve a more generic methodology, which however\r
+requires that the upgrade objects fulfil certain requirements. Since\r
+this is only possible on the long run we target first the upgrade\r
+of the different VIM services from version to version.\r
+\r
+**Questions:**\r
+\r
+1. Can we manage to upgrade OPNFV in only 5 minutes?\r
+ \r
+.. <MT> The first question is whether we have the same carrier grade\r
+   requirement on the control plane as on the user plane. I.e. how\r
+   much control plane outage we can/willing to tolerate?\r
+   In the above case probably if the database is only half of the size\r
+   we can do the upgrade in 5 minutes, but is that good? It also means\r
+   that if the database is twice as much then the outage is 20\r
+   minutes.\r
+   For the user plane we should go for less as with two release yearly\r
+   that means 10 minutes outage per year.\r
+\r
+.. <Malla> 10 minutes outage per year to the users? Plus, if we take\r
+   control plane into the consideration, then total outage will be\r
+   more than 10 minute in whole network, right?\r
+\r
+.. <MT> The control plane outage does not have to cause outage to\r
+   the users, but it may of course depending on the size of the system\r
+   as it's more likely that there's a failure that needs to be handled\r
+   by the control plane.\r
+\r
+2. Is it acceptable for end users ? Such as a planed service\r
+   interruption will lasting more than ten minutes for software\r
+   upgrade.\r
+\r
+.. <MT> For user plane, no it's not acceptable in case of\r
+   carrier-grade. The 5' 15" downtime should include unplanned and\r
+   planned downtimes.\r
+   \r
+.. <Malla> I go agree with Maria, it is not acceptable.\r
+\r
+3. Will any VNFs still working well when VIM is down?\r
+\r
+.. <MT> In case of OpenStack it seems yes. .:)\r
+\r
+The maximum duration of an upgrade\r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+The duration of an upgrade is related to and proportional with the\r
+scale and the complexity of the OPNFV platform as well as the\r
+granularity (in function and in space) of the upgrade.\r
+\r
+.. <Malla> Also, if is a partial upgrade like module upgrade, it depends\r
+  also on the OPNFV modules and their tight connection entities as well.\r
+\r
+.. <MT> Since the maintenance window is shrinking and becoming non-existent\r
+  the duration of the upgrade is secondary to the requirement of smooth upgrade.\r
+  But probably we want to be able to put a time constraint on each upgrade\r
+  during which it must complete otherwise it is considered failed and the system\r
+  should be rolled back. I.e. in case of automatic execution it might not be clear\r
+  if an upgrade is long or just hanging. The time constraints may be a function\r
+  of the size of the system in terms of the upgrade object(s).\r
+\r
+The maximum duration of a roll back when an upgrade is failed \r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+The duration of a roll back is short than the corresponding upgrade. It\r
+depends on the duration of restore the software and configure data from\r
+pre-upgrade backup / snapshot.\r
+\r
+.. <MT> During the upgrade process two types of failure may happen:\r
+  In case we can recover from the failure by undoing the upgrade\r
+  actions it is possible to roll back the already executed part of the\r
+  upgrade in graceful manner introducing no more service outage than\r
+  what was introduced during the upgrade. Such a graceful roll back\r
+  requires typically the same amount of time as the executed portion of\r
+  the upgrade and impose minimal state/data loss.\r
+  \r
+.. <MT> Requirement: It should be possible to roll back gracefully the\r
+  failed upgrade of stateful services of the control plane.\r
+  In case we cannot recover from the failure by just undoing the\r
+  upgrade actions, we have to restore the upgraded entities from their\r
+  backed up state. In other terms the system falls back to an earlier\r
+  state, which is typically a faster recovery procedure than graceful\r
+  roll back and depending on the statefulness of the entities involved it\r
+  may result in significant state/data loss.\r
+  \r
+.. <MT> Two possible types of failures can happen during an upgrade\r
+\r
+.. <MT> We can recover from the failure that occurred in the upgrade process:\r
+  In this case, a graceful rolling back of the executed part of the\r
+  upgrade may be possible which would "undo" the executed part in a\r
+  similar fashion. Thus, such a roll back introduces no more service\r
+  outage during an upgrade than the executed part introduced. This\r
+  process typically requires the same amount of time as the executed\r
+  portion of the upgrade and impose minimal state/data loss.\r
+\r
+.. <MT> We cannot recover from the failure that occurred in the upgrade\r
+   process: In this case, the system needs to fall back to an earlier\r
+   consistent state by reloading this backed-up state. This is typically\r
+   a faster recovery procedure than the graceful roll back, but can cause\r
+   state/data loss. The state/data loss usually depends on the\r
+   statefulness of the entities whose state is restored from the backup.\r
+\r
+The maximum duration of a VNF interruption (Service outage)\r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+Since not the entire process of a smooth upgrade will affect the VNFs,\r
+the duration of the VNF interruption may be shorter than the duration\r
+of the upgrade. In some cases, the VNF running without the control\r
+from of the VIM is acceptable.\r
+\r
+.. <MT> Should require explicitly that the NFVI should be able to\r
+  provide its services to the VNFs independent of the control plane?\r
+\r
+.. <MT> Requirement: The upgrade of the control plane must not cause\r
+  interruption of the NFVI services provided to the VNFs.\r
+\r
+.. <MT> With respect to carrier-grade the yearly service outage of the\r
+  VNF should not exceed 5' 15" regardless whether it is planned or\r
+  unplanned outage. Considering the HA requirements TL-9000 requires an\r
+  end-to-end service recovery time of 15 seconds based on which the ETSI\r
+  GS NFV-REL 001 V1.1.1 (2015-01) document defines three service\r
+  availability levels (SAL). The proposed example service recovery times\r
+  for these levels are:\r
+\r
+.. <MT> SAL1: 5-6 seconds\r
+\r
+.. <MT> SAL2: 10-15 seconds\r
+\r
+.. <MT> SAL3: 20-25 seconds\r
+\r
+.. <Pva> my comment was actually that the downtime metrics of the\r
+  underlying elements, components and services are small fraction of the\r
+  total E2E service availability time. No-one on the E2E service path\r
+  will get the whole downtime allocation (in this context it includes\r
+  upgrade process related outages for the services provided by VIM etc.\r
+  elements that are subject to upgrade process).\r
+  \r
+.. <MT> So what you are saying is that the upgrade of any entity\r
+  (component, service) shouldn't cause even this much service\r
+  interruption. This was the reason I brought these figures here as well\r
+  that they are posing some kind of upper-upper boundary. Ideally the\r
+  interruption is in the millisecond range i.e. no more than a\r
+  switch-over or a live migration.\r
+  \r
+.. <MT> Requirement: Any interruption caused to the VNF by the upgrade\r
+  of the NFVI should be in the sub-second range.\r
+\r
+.. <MT]> In the future we also need to consider the upgrade of the NFVI,\r
+  i.e. HW, firmware, hypervisors, host OS etc.
+\ No newline at end of file