Transfer late changes from etherpad to rst

author Maria Toeroe <Maria.Toeroe@ericsson.com>

Tue, 22 Sep 2015 15:42:23 +0000 (11:42 -0400)

committer Maria Toeroe <Maria.Toeroe@ericsson.com>

Tue, 22 Sep 2015 15:42:23 +0000 (11:42 -0400)
author Maria Toeroe <Maria.Toeroe@ericsson.com>
Tue, 22 Sep 2015 15:42:23 +0000 (11:42 -0400)
committer Maria Toeroe <Maria.Toeroe@ericsson.com>
Tue, 22 Sep 2015 15:42:23 +0000 (11:42 -0400)
diff --git a/doc/02-Background_and_Terminologies.rst b/doc/02-Background_and_Terminologies.rst

index 01010ad..afb392f 100644 (file)
--- a/doc/02-Background_and_Terminologies.rst
+++ b/doc/02-Background_and_Terminologies.rst
@@ -1 +1,458 @@
-General Requirements Background and Terminology\r-----------------------------------------------\r\rTerminologies and definitions\r~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\rNFVI\r  The term is abbreviation for Network Function Virtualization\r  Infrastructure; sometimes it is also referred as data plane in this\r  document.\r\rVIM\r  The term is abbreviation for Virtual Infrastructure Management;\r  sometimes it is also referred as control plane in this document.\r   \rOperators\r  The term is network service providers and Virtual Network\r  Function (VNF) providers.\r\rEnd-Users\r  The term is subscribers of Operator's services.\r\rNetwork Service\r  The term is a service provided by an Operator to its\r  End-users using a set of (virtualized) Network Functions\r\rInfrastructure Services\r  The term is those provided by the NFV Infrastructure and the \r  the Management & Orchestration functions to the VNFs. I.e. \r  these are the virtual resources as perceived by the VNFs.\r\rSmooth Upgrade\r  The term is that the upgrade results in no service outage \r  for the end-users.\r\rRolling Upgrade\r  The term is an upgrade strategy that upgrades each node or\r  a subset of nodes in a wave rolling style through the data centre. It\r  is a popular upgrade strategy to maintains service availability.\r\rParallel Universe\r  The term is an upgrade strategy that creates and deploys\r  a new universe - a system with the new configuration - while the old\r  system continues running. The state of the old system is transferred\r  to the new system after sufficient testing of the later.\r\rInfrastructure Resource Model\r  The term is identified as: physical resources, virtualization\r  facility resources and virtual resources.\r\rPhysical Resources\r  The term is the hardware of the infrastructure, may\r  also includes the firmware that enable the hardware.\r\rVirtual Resources\r  The term is the resources provided as services built on top\r  of the physical resources via the virtualization facilities; in our\r  case, they are the components that VNF entities are built on, e.g.\r  the VMs, virtual switches, virtual routers, virtual disks etc.\r\r.. <MT> I don't think the VNF is the virtual resource. Virtual\r   resources are the VMs, virtual switches, virtual routers, virtual\r   disks etc. The VNF uses them, but I don't think they are equal. The\r   VIM doesn't manage the VNF, but it does manage virtual resources.\r   \rVisualization Facilities\r   The term is the resources that enable the creation\r   of virtual environments on top of the physical resources, e.g.\r   hypervisor, OpenStack, etc.\r\rUpgrade Objects\r~~~~~~~~~~~~~~~\r\rPhysical Resource\r^^^^^^^^^^^^^^^^^\r\rMost of the cloud infrastructures support dynamic addition/removal of\rhardware. A hardware upgrade could be done by removing the old\rhardware node and adding the new one. Upgrade a physical resource,\rlike upgrade the firmware and modify the configuration data, may\rbe considered in the future. \r\rVirtual Resources\r^^^^^^^^^^^^^^^^^\r\rVirtual resource upgrade mainly done by users. OPNFV may facilitate\rthe activity, but suggest to have it in long term roadmap instead of\rinitiate release.\r\r.. <MT> same comment here: I don't think the VNF is the virtual\r  resource. Virtual resources are the VMs, virtual switches, virtual\r  routers, virtual disks etc. The VNF uses them, but I don't think they\r  are equal. For example if by some reason the hypervisor is changed and\r  the current VMs cannot be migrated to the new hypervisor, they are\r  incompatible, then the VMs need to be upgraded too. This is not\r  something the NFVI user (i.e. VNFs ) would even know about.\r\rVirtualization Facility Resources\r^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\rBased on the functionality they provide, virtualization facility\rresources could be divided into computing node, networking node,\rstorage node and management node.\r\rThe possible upgrade objects in these nodes are addressed below:\r(Note: hardware based virtualization may considered as virtualization\rfacility resource, but from escalator perspective, it is better\rconsidered it as part of hardware upgrade. )\r\r**Computing node**\r\r1. OS Kernel\r\r2. Hypvervisor and virtual switch\r\r3. Other kernel modules, like driver\r\r4. User space software packages, like nova-compute agents and other\r   control plane programs.\r\rUpdating 1 and 2 will cause the loss of virtualzation functionality of\rthe compute node, which may lead to data plane services interruption\rif the virtual resource is not redudant.\r\rUpdating 3 might result the same.\r\rUpdating 4 might lead to control plane services interruption if not an\rHA deployment.\r\r**Networking node**\r\r1. OS kernel, optional, not all switch/router allow you to upgrade its\r   OS since it is more like a firmware than a generic OS.\r\r2. User space software package, like neutron agents and other control\r   plane programs\r\rUpdating 1 if allowed will cause a node reboot and therefore leads to\rdata plane services interruption if the virtual resource is not\rredundant.\r\rUpdating 2 might lead to control plane services interruption if not an\rHA deployment.\r\r**Storage node**\r\r1. OS kernel, optional, not all storage node allow you to upgrade its OS\r   since it is more like a firmware than a generic OS.\r\r2. Kernel modules\r\r3. User space software packages, control plane programs\r\rUpdating 1 if allowed will cause a node reboot and therefore leads to\rdata plane services interruption if the virtual resource is not\rredundant.\r\rUpdate 2 might result in the same.\r\rUpdating 3 might lead to control plane services interruption if not an\rHA deployment.\r\r**Management node**\r\r1. OS Kernel\r\r2. Kernel modules, like driver\r\r3. User space software packages, like database, message queue and\r   control plane programs.\r\rUpdating 1 will cause a node reboot and therefore leads to control\rplane services interruption if not an HA deployment. Updating 2 might\rresult in the same.\r\rUpdating 3 might lead to control plane services interruption if not an\rHA deployment.\r\rUpgrade Span\r~~~~~~~~~~~~\r\r**Major Upgrade**\r\rUpgrades between major releases may introducing significant changes in\rfunction, configuration and data, such as the upgrade of OPNFV from\rArno to Brahmaputra.\r\r**Minor Upgrade**\r\rUpgrades inside one major releases which would not leads to changing\rthe structure of the platform and may not infect the schema of the\rsystem data.\r\rUpgrade Granularity\r~~~~~~~~~~~~~~~~~~~\r\rPhysical/Hardware Dimension\r^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\rSupport full / partial upgrade for data centre, cluster, zone. Because\rof the upgrade of a data centre or a zone, it may be divided into\rseveral batches. The upgrade of a cloud environment (cluster) may also\rbe partial. For example, in one cloud environment running a number of\rVNFs, we may just try one of them to check the stability and\rperformance, before we upgrade all of them.\r\rSoftware Dimension\r^^^^^^^^^^^^^^^^^^\r\r-  The upgrade of host OS or kernel may need a 'hot migration'\r-  The upgrade of OpenStack’s components\r\r    i.the one-shot upgrade of all components\r      \r    ii.the partial upgrade (or bugfix patch) which only affects some\r       components (e.g., computing, storage, network, database, message\r       queue, etc.)\r\r.. <MT> this section seems to overlap with 2.1.\r  I can see the following dimensions for the software.\r\r.. <MT> different software packages\r\r.. <MT> different functions - Considering that the target versions of all\r   software are compatible the upgrade needs to ensure that any\r   dependencies between SW and therefore packages are taken into account\r   in the upgrade plan, i.e. no version mismatch occurs during the\r   upgrade therefore dependencies are not broken\r   \r.. <MT> same function - This is an upgrade specific question if different\r   versions can coexist in the system when a SW is being upgraded from\r   one version to another. This is particularly important for stateful\r   functions e.g. storage, networking, control services. The upgrade\r   method must consider the compatibility of the redundant entities.\r\r.. <MT> different versions of the same software package\r\r.. <MT> major version changes - they may introduce incompatibilities. Even\r   when there are backward compatibility requirements changes may cause\r   issues at graceful roll-back\r   \r.. <MT> minor version changes - they must not introduce incompatibility\r   between versions, these should be primarily bug fixes, so live\r   patches should be possible\r   \r.. <MT> different installations of the same software package\r\r.. <MT> using different installation options - they may reflect different\r   users with different needs so redundancy issues are less likely\r   between installations of different options; but they could be the\r   reflection of the heterogeneous system in which case they may provide\r   redundancy for higher availability, i.e. deeper inspection is needed\r   \r.. <MT> using the same installation options - they often reflect that the are\r   used by redundant entities across space\r   \r.. <MT> different distribution possibilities in space - same or different\r   availability zones, multi-site, geo-redundancy\r   \r.. <MT> different entities running from the same installation of a software\r   package\r   \r.. <MT>  using different start-up options - they may reflect different users so\r   redundancy may not be an issues between them\r   \r.. <MT> using same start-up options - they often reflect redundant\r   entities\r\rUpgrade duration\r~~~~~~~~~~~~~~~~\r\rAs the OPNFV end-users are primarily Telecom operators, the network\rservices provided by the VNFs deployed on the NFVI should meet the\rrequirement of 'Carrier Grade'.::\r\r  In telecommunication, a "carrier grade" or"carrier class" refers to a\r  system, or a hardware or software component that is extremely reliable,\r  well tested and proven in its capabilities. Carrier grade systems are\r  tested and engineered to meet or exceed "five nines" high availability\r  standards, and provide very fast fault recovery through redundancy\r  (normally less than 50 milliseconds). [from wikipedia.org]\r\r"five nines" means working all the time in ONE YEAR except 5'15".\r\r::\r\r  We have learnt that a well prepared upgrade of OpenStack needs 10\r  minutes. The major time slot in the outage time is used spent on\r  synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!\r  ' by Symantec]\r\rThis 10 minutes of downtime of OpenStack however did not impact the\rusers, i.e. the VMs running on the compute nodes. This was the outage of\rthe control plane only. On the other hand with respect to the\rpreparations this was a manually tailored upgrade specific to the\rparticular deployment and the versions of each OpenStack service.\r\rThe project targets to achieve a more generic methodology, which however\rrequires that the upgrade objects fulfil certain requirements. Since\rthis is only possible on the long run we target first upgrades from\rversion to version for the different VIM services.\r\r**Questions:**\r\r1. Can we manage to upgrade OPNFV in only 5 minutes?\r \r.. <MT> The first question is whether we have the same carrier grade\r   requirement on the control plane as on the user plane. I.e. how\r   much control plane outage we can/willing to tolerate?\r   In the above case probably if the database is only half of the size\r   we can do the upgrade in 5 minutes, but is that good? It also means\r   that if the database is twice as much then the outage is 20\r   minutes.\r   For the user plane we should go for less as with two release yearly\r   that means 10 minutes outage per year.\r\r.. <Malla> 10 minutes outage per year to the users? Plus, if we take\r   control plane into the consideration, then total outage will be\r   more than 10 minute in whole network, right?\r\r.. <MT> The control plane outage does not have to cause outage to\r   the users, but it may of course depending on the size of the system\r   as it's more likely that there's a failure that needs to be handled\r   by the control plane.\r\r2. Is it acceptable for end users ? Such as a planed service\r   interruption will lasting more than ten minutes for software\r   upgrade.\r\r.. <MT> For user plane, no it's not acceptable in case of\r   carrier-grade. The 5' 15" downtime should include unplanned and\r   planned downtimes.\r   \r.. <Malla> I go agree with Maria, it is not acceptable.\r\r3. Will any VNFs still working well when VIM is down?\r\r.. <MT> In case of OpenStack it seems yes. .:)\r\rThe maximum duration of an upgrade\r^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\rThe duration of an upgrade is related to and proportional with the\rscale and the complexity of the OPNFV platform as well as the\rgranularity (in function and in space) of the upgrade.\r\r.. <Malla> Also, if is a partial upgrade like module upgrade, it depends\r  also on the OPNFV modules and their tight connection entities as well.\r\rThe maximum duration of a roll back when an upgrade is failed \r^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\rThe duration of a roll back is short than the corresponding upgrade. It\rdepends on the duration of restore the software and configure data from\rpre-upgrade backup / snapshot.\r\r.. <MT> During the upgrade process two types of failure may happen:\r  In case we can recover from the failure by undoing the upgrade\r  actions it is possible to roll back the already executed part of the\r  upgrade in graceful manner introducing no more service outage than\r  what was introduced during the upgrade. Such a graceful roll back\r  requires typically the same amount of time as the executed portion of\r  the upgrade and impose minimal state/data loss.\r  \r.. <MT> Requirement: It should be possible to roll back gracefully the\r  failed upgrade of stateful services of the control plane.\r  In case we cannot recover from the failure by just undoing the\r  upgrade actions, we have to restore the upgraded entities from their\r  backed up state. In other terms the system falls back to an earlier\r  state, which is typically a faster recovery procedure than graceful\r  roll back and depending on the statefulness of the entities involved it\r  may result in significant state/data loss.\r  \r.. <MT> Two possible types of failures can happen during an upgrade\r\r.. <MT> We can recover from the failure that occurred in the upgrade process:\r  In this case, a graceful rolling back of the executed part of the\r  upgrade may be possible which would "undo" the executed part in a\r  similar fashion. Thus, such a roll back introduces no more service\r  outage during an upgrade than the executed part introduced. This\r  process typically requires the same amount of time as the executed\r  portion of the upgrade and impose minimal state/data loss.\r\r.. <MT> We cannot recover from the failure that occurred in the upgrade\r   process: In this case, the system needs to fall back to an earlier\r   consistent state by reloading this backed-up state. This is typically\r   a faster recovery procedure than the graceful roll back, but can cause\r   state/data loss. The state/data loss usually depends on the\r   statefulness of the entities whose state is restored from the backup.\r\rThe maximum duration of a VNF interruption (Service outage)\r^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\rSince not the entire process of a smooth upgrade will affect the VNFs,\rthe duration of the VNF interruption may be shorter than the duration\rof the upgrade. In some cases, the VNF running without the control\rfrom of the VIM is acceptable.\r\r.. <MT> Should require explicitly that the NFVI should be able to\r  provide its services to the VNFs independent of the control plane?\r\r.. <MT> Requirement: The upgrade of the control plane must not cause\r  interruption of the NFVI services provided to the VNFs.\r\r.. <MT> With respect to carrier-grade the yearly service outage of the\r  VNF should not exceed 5' 15" regardless whether it is planned or\r  unplanned outage. Considering the HA requirements TL-9000 requires an\r  end-to-end service recovery time of 15 seconds based on which the ETSI\r  GS NFV-REL 001 V1.1.1 (2015-01) document defines three service\r  availability levels (SAL). The proposed example service recovery times\r  for these levels are:\r\r.. <MT> SAL1: 5-6 seconds\r\r.. <MT> SAL2: 10-15 seconds\r\r.. <MT> SAL3: 20-25 seconds\r\r.. <Pva> my comment was actually that the downtime metrics of the\r  underlying elements, components and services are small fraction of the\r  total E2E service availability time. No-one on the E2E service path\r  will get the whole downtime allocation (in this context it includes\r  upgrade process related outages for the services provided by VIM etc.\r  elements that are subject to upgrade process).\r  \r.. <MT> So what you are saying is that the upgrade of any entity\r  (component, service) shouldn't cause even this much service\r  interruption. This was the reason I brought these figures here as well\r  that they are posing some kind of upper-upper boundary. Ideally the\r  interruption is in the millisecond range i.e. no more than a\r  switch-over or a live migration.\r  \r.. <MT> Requirement: Any interruption caused to the VNF by the upgrade\r  of the NFVI should be in the sub-second range.\r\r.. <MT]> In the future we also need to consider the upgrade of the NFVI,\r  i.e. HW, firmware, hypervisors, host OS etc.
-\ No newline at end of file
+General Requirements Background and Terminology\r
+-----------------------------------------------\r
+\r
+Terminologies and definitions\r
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r
+\r
+NFVI\r
+  The term is an abbreviation for Network Function Virtualization\r
+  Infrastructure; sometimes it is also referred as data plane in this\r
+  document.\r
+\r
+VIM\r
+  The term is an abbreviation for Virtual Infrastructure Management;\r
+  sometimes it is also referred as control plane in this document.\r
+   \r
+Operator\r
+  The term refers to network service providers and Virtual Network\r
+  Function (VNF) providers.\r
+\r
+End-User\r
+  The term refers to a subscriber of the Operator's services.\r
+\r
+Network Service\r
+  The term refers to a service provided by an Operator to its\r
+  End-users using a set of (virtualized) Network Functions\r
+\r
+Infrastructure Services\r
+  The term refers to services provided by the NFV Infrastructure and the \r
+  the Management & Orchestration functions to the VNFs. I.e. \r
+  these are the virtual resources as perceived by the VNFs.\r
+\r
+Smooth Upgrade\r
+  The term refers to an upgrade that results in no service outage \r
+  for the end-users.\r
+\r
+Rolling Upgrade\r
+  The term refers to an upgrade strategy that upgrades each node or\r
+  a subset of nodes in a wave style rolling through the data centre. It\r
+  is a popular upgrade strategy to maintain service availability.\r
+\r
+Parallel Universe\r
+  The term refers to an upgrade strategy that creates and deploys\r
+  a new universe - a system with the new configuration - while the old\r
+  system continues running. The state of the old system is transferred\r
+  to the new system after sufficient testing of the new system.\r
+\r
+Infrastructure Resource Model\r
+  The term refers to the representation of infrastructure resources,\r
+  namely: the physical resources, the virtualization\r
+  facility resources and the virtual resources.\r
+\r
+Physical Resource\r
+  The term refers to a hardware pieces of the NFV infrastructure, which may\r
+  also include the firmware which enables the hardware.\r
+\r
+Virtual Resource\r
+  The term refers to a resource, which is provided as services built on top\r
+  of the physical resources via the virtualization facilities; in particular,\r
+  they are the resources on which VNF entities are deployed, e.g.\r
+  the VMs, virtual switches, virtual routers, virtual disks etc.\r
+\r
+.. <MT> I don't think the VNF is the virtual resource. Virtual\r
+   resources are the VMs, virtual switches, virtual routers, virtual\r
+   disks etc. The VNF uses them, but I don't think they are equal. The\r
+   VIM doesn't manage the VNF, but it does manage virtual resources.\r
+   \r
+Visualization Facility\r
+   The term refers to a resource that enables the creation\r
+   of virtual environments on top of the physical resources, e.g.\r
+   hypervisor, OpenStack, etc.\r
+\r
+Upgrade Plan (or Campaign?) \r
+   The term refers to a choreography that describes how the upgrade should\r
+   be performed in terms of its targets (i.e. upgrade objects), the\r
+   steps/actions required of upgrading each, and the coordination of these\r
+   steps so that service availability can be maintained. It is an input to an\r
+   upgrade tool (Escalator) to carry out the upgrade \r
+\r
+\r
+Upgrade Objects\r
+~~~~~~~~~~~~~~~\r
+\r
+Physical Resource\r
+^^^^^^^^^^^^^^^^^\r
+\r
+Most of cloud infrastructures support dynamic addition/removal of\r
+hardware. A hardware upgrade could be done by adding the new \r
+hardware node and removing the old one. From the persepctive of smooth\r
+upgrade the orchestration/scheduling of this actions is the primary concern.\r
+Upgrading a physical resource,\r
+like upgrading its firmware and/or modify its configuration data, may\r
+also be considered in the future. \r
+\r
+\r
+Virtual Resources\r
+^^^^^^^^^^^^^^^^^\r
+\r
+Virtual resource upgrade mainly done by users. OPNFV may facilitate\r
+the activity, but suggest to have it in long term roadmap instead of\r
+initiate release.\r
+\r
+.. <MT> same comment here: I don't think the VNF is the virtual\r
+  resource. Virtual resources are the VMs, virtual switches, virtual\r
+  routers, virtual disks etc. The VNF uses them, but I don't think they\r
+  are equal. For example if by some reason the hypervisor is changed and\r
+  the current VMs cannot be migrated to the new hypervisor, they are\r
+  incompatible, then the VMs need to be upgraded too. This is not\r
+  something the NFVI user (i.e. VNFs ) would even know about.\r
+\r
+\r
+Virtualization Facility Resources\r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+Based on the functionality they provide, virtualization facility\r
+resources could be divided into computing node, networking node,\r
+storage node and management node.\r
+\r
+The possible upgrade objects in these nodes are addressed below:\r
+(Note: hardware based virtualization may be considered as virtualization\r
+facility resource, but from escalator perspective, it is better to\r
+consider it as part of the hardware upgrade. )\r
+\r
+**Computing node**\r
+\r
+1. OS Kernel\r
+\r
+2. Hypvervisor and virtual switch\r
+\r
+3. Other kernel modules, like driver\r
+\r
+4. User space software packages, like nova-compute agents and other\r
+   control plane programs.\r
+\r
+Updating 1 and 2 will cause the loss of virtualzation functionality of\r
+the compute node, which may lead to data plane services interruption\r
+if the virtual resource is not redudant.\r
+\r
+Updating 3 might result the same.\r
+\r
+Updating 4 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+**Networking node**\r
+\r
+1. OS kernel, optional, not all switches/routers allow the upgrade their\r
+   OS since it is more like a firmware than a generic OS.\r
+\r
+2. User space software package, like neutron agents and other control\r
+   plane programs\r
+\r
+Updating 1 if allowed will cause a node reboot and therefore leads to\r
+data plane service interruption if the virtual resource is not\r
+redundant.\r
+\r
+Updating 2 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+**Storage node**\r
+\r
+1. OS kernel, optional, not all storage nodes allow the upgrade their OS\r
+   since it is more like a firmware than a generic OS.\r
+\r
+2. Kernel modules\r
+\r
+3. User space software packages, control plane programs\r
+\r
+Updating 1 if allowed will cause a node reboot and therefore leads to\r
+data plane services interruption if the virtual resource is not\r
+redundant.\r
+\r
+Update 2 might result in the same.\r
+\r
+Updating 3 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+**Management node**\r
+\r
+1. OS Kernel\r
+\r
+2. Kernel modules, like driver\r
+\r
+3. User space software packages, like database, message queue and\r
+   control plane programs.\r
+\r
+Updating 1 will cause a node reboot and therefore leads to control\r
+plane services interruption if not an HA deployment. Updating 2 might\r
+result in the same.\r
+\r
+Updating 3 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+Upgrade Span\r
+~~~~~~~~~~~~\r
+\r
+**Major Upgrade**\r
+\r
+Upgrades between major releases may introducing significant changes in\r
+function, configuration and data, such as the upgrade of OPNFV from\r
+Arno to Brahmaputra.\r
+\r
+**Minor Upgrade**\r
+\r
+Upgrades inside one major releases which would not leads to changing\r
+the structure of the platform and may not infect the schema of the\r
+system data.\r
+\r
+Upgrade Granularity\r
+~~~~~~~~~~~~~~~~~~~\r
+\r
+Physical/Hardware Dimension\r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+Support full / partial upgrade for data centre, cluster, zone. Because\r
+of the upgrade of a data centre or a zone, it may be divided into\r
+several batches. The upgrade of a cloud environment (cluster) may also\r
+be partial. For example, in one cloud environment running a number of\r
+VNFs, we may just try one of them to check the stability and\r
+performance, before we upgrade all of them.\r
+\r
+Software Dimension\r
+^^^^^^^^^^^^^^^^^^\r
+\r
+-  The upgrade of host OS or kernel may need a 'hot migration'\r
+-  The upgrade of OpenStack’s components\r
+\r
+    i.the one-shot upgrade of all components\r
+       \r
+    ii.the partial upgrade (or bugfix patch) which only affects some\r
+       components (e.g., computing, storage, network, database, message\r
+       queue, etc.)\r
+\r
+.. <MT> this section seems to overlap with 2.1.\r
+  I can see the following dimensions for the software.\r
+\r
+.. <MT> different software packages\r
+\r
+.. <MT> different functions - Considering that the target versions of all\r
+   software are compatible the upgrade needs to ensure that any\r
+   dependencies between SW and therefore packages are taken into account\r
+   in the upgrade plan, i.e. no version mismatch occurs during the\r
+   upgrade therefore dependencies are not broken\r
+   \r
+.. <MT> same function - This is an upgrade specific question if different\r
+   versions can coexist in the system when a SW is being upgraded from\r
+   one version to another. This is particularly important for stateful\r
+   functions e.g. storage, networking, control services. The upgrade\r
+   method must consider the compatibility of the redundant entities.\r
+\r
+.. <MT> different versions of the same software package\r
+\r
+.. <MT> major version changes - they may introduce incompatibilities. Even\r
+   when there are backward compatibility requirements changes may cause\r
+   issues at graceful roll-back\r
+   \r
+.. <MT> minor version changes - they must not introduce incompatibility\r
+   between versions, these should be primarily bug fixes, so live\r
+   patches should be possible\r
+   \r
+.. <MT> different installations of the same software package\r
+\r
+.. <MT> using different installation options - they may reflect different\r
+   users with different needs so redundancy issues are less likely\r
+   between installations of different options; but they could be the\r
+   reflection of the heterogeneous system in which case they may provide\r
+   redundancy for higher availability, i.e. deeper inspection is needed\r
+   \r
+.. <MT> using the same installation options - they often reflect that the are\r
+   used by redundant entities across space\r
+   \r
+.. <MT> different distribution possibilities in space - same or different\r
+   availability zones, multi-site, geo-redundancy\r
+   \r
+.. <MT> different entities running from the same installation of a software\r
+   package\r
+   \r
+.. <MT>  using different start-up options - they may reflect different users so\r
+   redundancy may not be an issues between them\r
+   \r
+.. <MT> using same start-up options - they often reflect redundant\r
+   entities\r
+\r
+Upgrade duration\r
+~~~~~~~~~~~~~~~~\r
+\r
+As the OPNFV end-users are primarily Telecom operators, the network\r
+services provided by the VNFs deployed on the NFVI should meet the\r
+requirement of 'Carrier Grade'.::\r
+\r
+  In telecommunication, a "carrier grade" or"carrier class" refers to a\r
+  system, or a hardware or software component that is extremely reliable,\r
+  well tested and proven in its capabilities. Carrier grade systems are\r
+  tested and engineered to meet or exceed "five nines" high availability\r
+  standards, and provide very fast fault recovery through redundancy\r
+  (normally less than 50 milliseconds). [from wikipedia.org]\r
+\r
+"five nines" means working all the time in ONE YEAR except 5'15".\r
+\r
+::\r
+\r
+  We have learnt that a well prepared upgrade of OpenStack needs 10\r
+  minutes. The major time slot in the outage time is used spent on\r
+  synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!\r
+  ' by Symantec]\r
+\r
+This 10 minutes of downtime of the OpenStack services however did not impact the\r
+users, i.e. the VMs running on the compute nodes. This was the outage of\r
+the control plane only. On the other hand with respect to the\r
+preparations this was a manually tailored upgrade specific to the\r
+particular deployment and the versions of each OpenStack service.\r
+\r
+The project targets to achieve a more generic methodology, which however\r
+requires that the upgrade objects fulfil certain requirements. Since\r
+this is only possible on the long run we target first the upgrade\r
+of the different VIM services from version to version.\r
+\r
+**Questions:**\r
+\r
+1. Can we manage to upgrade OPNFV in only 5 minutes?\r
+ \r
+.. <MT> The first question is whether we have the same carrier grade\r
+   requirement on the control plane as on the user plane. I.e. how\r
+   much control plane outage we can/willing to tolerate?\r
+   In the above case probably if the database is only half of the size\r
+   we can do the upgrade in 5 minutes, but is that good? It also means\r
+   that if the database is twice as much then the outage is 20\r
+   minutes.\r
+   For the user plane we should go for less as with two release yearly\r
+   that means 10 minutes outage per year.\r
+\r
+.. <Malla> 10 minutes outage per year to the users? Plus, if we take\r
+   control plane into the consideration, then total outage will be\r
+   more than 10 minute in whole network, right?\r
+\r
+.. <MT> The control plane outage does not have to cause outage to\r
+   the users, but it may of course depending on the size of the system\r
+   as it's more likely that there's a failure that needs to be handled\r
+   by the control plane.\r
+\r
+2. Is it acceptable for end users ? Such as a planed service\r
+   interruption will lasting more than ten minutes for software\r
+   upgrade.\r
+\r
+.. <MT> For user plane, no it's not acceptable in case of\r
+   carrier-grade. The 5' 15" downtime should include unplanned and\r
+   planned downtimes.\r
+   \r
+.. <Malla> I go agree with Maria, it is not acceptable.\r
+\r
+3. Will any VNFs still working well when VIM is down?\r
+\r
+.. <MT> In case of OpenStack it seems yes. .:)\r
+\r
+The maximum duration of an upgrade\r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+The duration of an upgrade is related to and proportional with the\r
+scale and the complexity of the OPNFV platform as well as the\r
+granularity (in function and in space) of the upgrade.\r
+\r
+.. <Malla> Also, if is a partial upgrade like module upgrade, it depends\r
+  also on the OPNFV modules and their tight connection entities as well.\r
+\r
+.. <MT> Since the maintenance window is shrinking and becoming non-existent\r
+  the duration of the upgrade is secondary to the requirement of smooth upgrade.\r
+  But probably we want to be able to put a time constraint on each upgrade\r
+  during which it must complete otherwise it is considered failed and the system\r
+  should be rolled back. I.e. in case of automatic execution it might not be clear\r
+  if an upgrade is long or just hanging. The time constraints may be a function\r
+  of the size of the system in terms of the upgrade object(s).\r
+\r
+The maximum duration of a roll back when an upgrade is failed \r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+The duration of a roll back is short than the corresponding upgrade. It\r
+depends on the duration of restore the software and configure data from\r
+pre-upgrade backup / snapshot.\r
+\r
+.. <MT> During the upgrade process two types of failure may happen:\r
+  In case we can recover from the failure by undoing the upgrade\r
+  actions it is possible to roll back the already executed part of the\r
+  upgrade in graceful manner introducing no more service outage than\r
+  what was introduced during the upgrade. Such a graceful roll back\r
+  requires typically the same amount of time as the executed portion of\r
+  the upgrade and impose minimal state/data loss.\r
+  \r
+.. <MT> Requirement: It should be possible to roll back gracefully the\r
+  failed upgrade of stateful services of the control plane.\r
+  In case we cannot recover from the failure by just undoing the\r
+  upgrade actions, we have to restore the upgraded entities from their\r
+  backed up state. In other terms the system falls back to an earlier\r
+  state, which is typically a faster recovery procedure than graceful\r
+  roll back and depending on the statefulness of the entities involved it\r
+  may result in significant state/data loss.\r
+  \r
+.. <MT> Two possible types of failures can happen during an upgrade\r
+\r
+.. <MT> We can recover from the failure that occurred in the upgrade process:\r
+  In this case, a graceful rolling back of the executed part of the\r
+  upgrade may be possible which would "undo" the executed part in a\r
+  similar fashion. Thus, such a roll back introduces no more service\r
+  outage during an upgrade than the executed part introduced. This\r
+  process typically requires the same amount of time as the executed\r
+  portion of the upgrade and impose minimal state/data loss.\r
+\r
+.. <MT> We cannot recover from the failure that occurred in the upgrade\r
+   process: In this case, the system needs to fall back to an earlier\r
+   consistent state by reloading this backed-up state. This is typically\r
+   a faster recovery procedure than the graceful roll back, but can cause\r
+   state/data loss. The state/data loss usually depends on the\r
+   statefulness of the entities whose state is restored from the backup.\r
+\r
+The maximum duration of a VNF interruption (Service outage)\r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+Since not the entire process of a smooth upgrade will affect the VNFs,\r
+the duration of the VNF interruption may be shorter than the duration\r
+of the upgrade. In some cases, the VNF running without the control\r
+from of the VIM is acceptable.\r
+\r
+.. <MT> Should require explicitly that the NFVI should be able to\r
+  provide its services to the VNFs independent of the control plane?\r
+\r
+.. <MT> Requirement: The upgrade of the control plane must not cause\r
+  interruption of the NFVI services provided to the VNFs.\r
+\r
+.. <MT> With respect to carrier-grade the yearly service outage of the\r
+  VNF should not exceed 5' 15" regardless whether it is planned or\r
+  unplanned outage. Considering the HA requirements TL-9000 requires an\r
+  end-to-end service recovery time of 15 seconds based on which the ETSI\r
+  GS NFV-REL 001 V1.1.1 (2015-01) document defines three service\r
+  availability levels (SAL). The proposed example service recovery times\r
+  for these levels are:\r
+\r
+.. <MT> SAL1: 5-6 seconds\r
+\r
+.. <MT> SAL2: 10-15 seconds\r
+\r
+.. <MT> SAL3: 20-25 seconds\r
+\r
+.. <Pva> my comment was actually that the downtime metrics of the\r
+  underlying elements, components and services are small fraction of the\r
+  total E2E service availability time. No-one on the E2E service path\r
+  will get the whole downtime allocation (in this context it includes\r
+  upgrade process related outages for the services provided by VIM etc.\r
+  elements that are subject to upgrade process).\r
+  \r
+.. <MT> So what you are saying is that the upgrade of any entity\r
+  (component, service) shouldn't cause even this much service\r
+  interruption. This was the reason I brought these figures here as well\r
+  that they are posing some kind of upper-upper boundary. Ideally the\r
+  interruption is in the millisecond range i.e. no more than a\r
+  switch-over or a live migration.\r
+  \r
+.. <MT> Requirement: Any interruption caused to the VNF by the upgrade\r
+  of the NFVI should be in the sub-second range.\r
+\r
+.. <MT]> In the future we also need to consider the upgrade of the NFVI,\r
+  i.e. HW, firmware, hypervisors, host OS etc.
+\ No newline at end of file
author	Maria Toeroe <Maria.Toeroe@ericsson.com>
	Tue, 22 Sep 2015 15:42:23 +0000 (11:42 -0400)
committer	Maria Toeroe <Maria.Toeroe@ericsson.com>
	Tue, 22 Sep 2015 15:42:23 +0000 (11:42 -0400)