ESCALATOR-31 Adjusting documentation

[escalator.git] / docs / 02-Background_and_Terminologies.rst
diff --git a/docs/02-Background_and_Terminologies.rst b/docs/02-Background_and_Terminologies.rst

deleted file mode 100644 (file)

index 488968b..0000000
--- a/docs/02-Background_and_Terminologies.rst
+++ /dev/null
@@ -1,535 +0,0 @@
-General Requirements Background and Terminology\r
------------------------------------------------\r
-\r
-Terminologies and definitions\r
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r
-\r
-NFVI\r
-  The term is an abbreviation for Network Function Virtualization\r
-  Infrastructure; sometimes it is also referred as data plane in this\r
-  document. The NFVI provides the virtual resources to the virtual\r
-  network functions under the control of the VIM.\r
-\r
-VIM\r
-  The term is an abbreviation for Virtual Infrastructure Manager;\r
-  sometimes it is also referred as control plane in this document.\r
-  The VIM controls and manages the NFVI compute, network and storage\r
-  resources to provide the required virtual resources to the VNFs.\r
-\r
-Operator\r
-  The term refers to network service providers and Virtual Network\r
-  Function (VNF) providers.\r
-\r
-End-User\r
-  The term refers to a subscriber of the Operator's services.\r
-\r
-Network Service\r
-  The term refers to a service provided by an Operator to its\r
-  end-users using a set of (virtualized) Network Functions\r
-\r
-Infrastructure Services\r
-  The term refers to services provided by the NFV Infrastructure to the VNFs\r
-  as required by the Management & Orchestration functions and especially the VIM.\r
-  I.e. these are the virtual resources as perceived by the VNFs.\r
-\r
-Smooth Upgrade\r
-  The term refers to an upgrade that results in no service outage\r
-  for the end-users.\r
-\r
-Rolling Upgrade\r
-  The term refers to an upgrade strategy, which upgrades a node or a subset\r
-  of nodes at a time in a wave style rolling through the data centre. It\r
-  is a popular upgrade strategy to maintain service availability.\r
-\r
-Parallel Universe Upgrade\r
-  The term refers to an upgrade strategy, which creates and deploys\r
-  a new universe - a system with the new configuration - while the old\r
-  system continues running. The state of the old system is transferred\r
-  to the new system after sufficient testing of the new system.\r
-\r
-Infrastructure Resource Model\r
-  The term refers to the representation of infrastructure resources,\r
-  namely: the physical resources, the virtualization\r
-  facility resources and the virtual resources.\r
-\r
-Physical Resource\r
-  The term refers to a piece of hardware in the NFV infrastructure that may\r
-  also include firmware enabling this piece of hardware.\r
-\r
-Virtual Resource\r
-  The term refers to a resource, which is provided as services built on top\r
-  of the physical resources via the virtualization facilities; in particular,\r
-  virtual resources are the resources on which VNFs are deployed. Examples of\r
-  virtual resources are: VMs, virtual switches, virtual routers, virtual disks.\r
-\r
-Visualization Facility\r
-  The term refers to a resource that enables the creation\r
-  of virtual environments on top of the physical resources, e.g.\r
-  hypervisor, OpenStack, etc.\r
-\r
-Upgrade Campaign\r
-  The term refers to a choreography that describes how the upgrade should\r
-  be performed in terms of its targets (i.e. upgrade objects), the\r
-  steps/actions required of upgrading each, and the coordination of these\r
-  steps so that service availability can be maintained. It is an input to an\r
-  upgrade tool (Escalator) to carry out the upgrade.\r
-\r
-Upgrade Duration\r
-  The duration of an upgrade characterized by the time elapsed between its\r
-  initiation and its completion. E.g. from the moment the execution of an\r
-  upgrade campaign has started until it has been committed. Depending on\r
-  the upgrade strategy, the state of the configuration and the upgrade target\r
-  some parts of the system may be in a more vulnerable state with respect to\r
-  service availbility.\r
-\r
-Outage\r
-  The period of time during which a given service is not provided is referred\r
-  as the outage of that given service. If a subsystem or the entire system\r
-  does not provide any service, it is the outage of the given subsystem or the\r
-  system. Smooth upgrade means upgrade with no outage for the user plane, i.e.\r
-  no VNF should experience service outage.\r
-\r
-Rollback\r
-  The term refers to a failure handling strategy that reverts the changes\r
-  done by a potentially failed upgrade execution one by one in a reverse order.\r
-  I.e. it is like undoing the changes done by the upgrade.\r
-\r
-Backup\r
-  The term refers to data persisted to a storage, so that it can be used to\r
-  restore the system or a given part of it in the same state as it was when the\r
-  backup was created assuming a cold restart. Changes made to the system from\r
-  the moment the backup was created till the moment it is used to restore the\r
-  (sub)system are lost in the restoration process.\r
-\r
-Restore\r
-  The term refers to a failure handling strategy that reverts the changes\r
-  done, for example, by an upgrade by restoring the system from some backup\r
-  data. This results in the loss of any change and data persisted after the\r
-  backup was been taken. To recover those additional measures need to be taken\r
-  if necessary (e.g. rollforward).\r
-\r
-Rollforward\r
-  The term refers to a failure handling strategy applied after a restore\r
-  (from a backup) opertaion to recover any loss of data persisted between\r
-  the time the backup has been taken and the moment it is restored. Rollforward\r
-  requires that data that needs to survive the restore operation is logged at\r
-  a location not impacted by the restore so that it can be re-applied to the\r
-  system after its restoration from the backup.\r
-\r
-Downgrade\r
-  The term refers to an upgrade in which an earlier version of the software\r
-  is restored through the upgrade procedure. A system can be downgraded to any\r
-  earlier version and the compatibility of the versions will determine the\r
-  applicable upgrade strategies and whether service outage can be avoided.\r
-  In particular any data conversion needs special attention.\r
-\r
-\r
-\r
-Upgrade Objects\r
-~~~~~~~~~~~~~~~\r
-\r
-Physical Resource\r
-^^^^^^^^^^^^^^^^^\r
-\r
-Most cloud infrastructures support the dynamic addition and removal of\r
-hardware. Accordingly a hardware upgrade could be done by adding the new\r
-piece of hardware and removing the old one. From the persepctive of smooth\r
-upgrade the orchestration/scheduling of these actions is the primary concern.\r
-\r
-Upgrading a physical resource may involve as well the upgrade of its firmware\r
-and/or modifying its configuration data. This may require the restart of the\r
-hardware.\r
-\r
-\r
-\r
-Virtual Resources\r
-^^^^^^^^^^^^^^^^^\r
-\r
-Addition and removal of virtual resources may be initiated by the users or be\r
-a result of an elasticity action. Users may also request the upgrade of their\r
-virtual resources using a new VM image.\r
-\r
-.. Needs to be moved to requirement section: Escalator should facilitate such an\r
-option and allow for a smooth upgrade.\r
-\r
-On the other hand changes in the infrastructure, namely, in the hardware and/or\r
-the virtualization facility resources may result in the upgrade of the virtual\r
-resources. For example if by some reason the hypervisor is changed and\r
-the current VMs cannot be migrated to the new hypervisor - they are\r
-incompatible - then the VMs need to be upgraded too. This is not\r
-something the NFVI user (i.e. VNFs ) would know about. \r
-\r
-\r
-Virtualization Facility Resources\r
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
-\r
-Based on the functionality they provide, virtualization facility\r
-resources could be divided into computing node, networking node,\r
-storage node and management node.\r
-\r
-The possible upgrade objects in these nodes are considered below:\r
-(Note: hardware based virtualization may be considered as virtualization\r
-facility resource, but from escalator perspective, it is better to\r
-consider it as part of the hardware upgrade. )\r
-\r
-**Computing node**\r
-\r
-1. OS Kernel\r
-\r
-2. Hypvervisor and virtual switch\r
-\r
-3. Other kernel modules, like drivers\r
-\r
-4. User space software packages, like nova-compute agents and other\r
-   control plane programs.\r
-\r
-Updating 1 and 2 will cause the loss of virtualzation functionality of\r
-the compute node, which may lead to the interruption of data plane services \r
-if the virtual resource is not redudant.\r
-\r
-Updating 3 might have the same result.\r
-\r
-Updating 4 might lead to control plane services interruption if not an\r
-HA deployment.\r
-\r
-.. <MT> I'm not sure why would 4 cause control plane interruption on a\r
-   compute node. My understanding is that simply the node cannot be managed.\r
-   Redundancy won't help in that either.\r
-\r
-\r
-**Networking node**\r
-\r
-1. OS kernel, optional, not all switches/routers allow the upgrade their\r
-   OS since it is more like a firmware than a generic OS.\r
-\r
-2. User space software package, like neutron agents and other control\r
-   plane programs\r
-\r
-Updating 1 if allowed will cause a node reboot and therefore leads to\r
-data plane service interruption if the virtual resource is not\r
-redundant.\r
-\r
-Updating 2 might lead to control plane services interruption if not an\r
-HA deployment.\r
-\r
-**Storage node**\r
-\r
-1. OS kernel, optional, not all storage nodes allow the upgrade their OS\r
-   since it is more like a firmware than a generic OS.\r
-\r
-2. Kernel modules\r
-\r
-3. User space software packages, control plane programs\r
-\r
-Updating 1 if allowed will cause a node reboot and therefore leads to\r
-data plane services interruption if the virtual resource is not\r
-redundant.\r
-\r
-Update 2 might result in the same.\r
-\r
-Updating 3 might lead to control plane services interruption if not an\r
-HA deployment.\r
-\r
-**Management node**\r
-\r
-1. OS Kernel\r
-\r
-2. Kernel modules, like driver\r
-\r
-3. User space software packages, like database, message queue and\r
-   control plane programs.\r
-\r
-Updating 1 will cause a node reboot and therefore leads to control\r
-plane services interruption if not an HA deployment. Updating 2 might\r
-result in the same.\r
-\r
-Updating 3 might lead to control plane services interruption if not an\r
-HA deployment.\r
-\r
-\r
-\r
-\r
-\r
-Upgrade Granularity\r
-~~~~~~~~~~~~~~~~~~~\r
-\r
-The granularity of an upgrade can be characterized from two perspective:\r
-- the physical dimension and\r
-- the software dimension\r
-\r
-\r
-Physical Dimension\r
-^^^^^^^^^^^^^^^^^^\r
-\r
-The physical dimension characterizes the number of similar upgrade objects\r
-targeted by the upgrade, i.e. whether it is full / partial upgrade of a\r
-data centre, cluster, zone.\r
-Because of the upgrade of a data centre or a zone, it may be divided into\r
-several batches. Thus there is a need for efficiency in the execution of\r
-upgrades of potentially huge number of upgrade objects while still maintain\r
-availability to fulfill the requirement of smooth upgrade.\r
-\r
-The upgrade of a cloud environment (cluster) may also\r
-be partial. For example, in one cloud environment running a number of\r
-VNFs, we may just try to upgrade one of them to check the stability and\r
-performance, before we upgrade all of them.\r
-Thus there is a need for proper organization of the artifacts associated with\r
-the different upgrade objects. Also the different versions should be able\r
-to coextist beyond the upgrade period.\r
-\r
-From this perspective special attention may be needed when upgrading\r
-objects that are collaborating in a redundancy schema as in this case\r
-different versions not only need to coexist but also collaborate. This\r
-puts requirement on the upgrade objects primarily. If this is not possible\r
-the upgrade campaign should be designed in such a way that the proper\r
-isolation is ensured.\r
-\r
-Software Dimension\r
-^^^^^^^^^^^^^^^^^^\r
-\r
-The software dimension of the upgrade characterizes the upgrade object\r
-type targeted and the combination in which they are upgraded together.\r
-\r
-Even though the upgrade may\r
-initially target only one type of upgrade object, e.g. the hypervisor\r
-the dependency of other upgrade objects on this initial target object may\r
-require their upgrade as well. I.e. the upgrades need to be combined. From this\r
-perspective the main concern is compatibility of the dependent and\r
-sponsor objects. To take into consideration of these dependencies\r
-they need to be described together with the version compatility information.\r
-Breaking dependencies is the major cause of outages during upgrades.\r
-\r
-In other cases it is more efficient to upgrade a combination of upgrade\r
-objects than to do it one by one. One aspect of the combination is how\r
-the upgrade packages can be combined, whether a new image can be created for\r
-them before hand or the different packages can be installed during the upgrade\r
-independently, but activated together.\r
-\r
-The combination of upgrade objects may span across\r
-layers (e.g. software stack in the host and the VM of the VNF).\r
-Thus, it may require additional coordination between the management layers.\r
-\r
-With respect to each upgrade object type and even stacks we can\r
-distingush major and minor upgrades:\r
-\r
-**Major Upgrade**\r
-\r
-Upgrades between major releases may introducing significant changes in\r
-function, configuration and data, such as the upgrade of OPNFV from\r
-Arno to Brahmaputra.\r
-\r
-**Minor Upgrade**\r
-\r
-Upgrades inside one major releases which would not leads to changing\r
-the structure of the platform and may not infect the schema of the\r
-system data.\r
-\r
-Scope of Impact\r
-~~~~~~~~~~~~~~~\r
-\r
-Considering availability and therefore smooth upgrade, one of the major\r
-concerns is the predictability and control of the outcome of the different\r
-upgrade operations. Ideally an upgrade can be performed without impacting any\r
-entity in the system, which means none of the operations change or potentially\r
-change the behaviour of any entity in the system in an uncotrolled manner.\r
-Accordingly the operations of such an upgrade can be performed any time while\r
-the system is running, while all the entities are online. No entity needs to be\r
-taken offline to avoid such adverse effects. Hence such upgrade operations\r
-are referred as online operations. The effects of the upgrade might be activated\r
-next time it is used, or may require a special activation action such as a\r
-restart. Note that the activation action provides more control and predictability.\r
-\r
-If an entity's behavior in the system may change due to the upgrade it may\r
-be better to take it offline for the time of the relevant upgrade operations.\r
-The main question is however considering the hosting relation of an upgrade\r
-object what hosted entities are impacted. Accordingly we can identify a scope\r
-which is impacted by taking the given upgrade object offline. The entities\r
-that are in the scope of impact may need to be taken offline or moved out of\r
-this scope i.e. migrated.\r
-\r
-If the impacted entity is in a different layer managed by another manager\r
-this may require coordination because taking out of service some\r
-infrastructure resources for the time of their upgrade which support virtual\r
-resources used by VNFs that should not experience outages. The hosted VNFs\r
-may or may not allow for the hot migration of their VMs. In case of migration\r
-the VMs placement policy should be considered.\r
-\r
-\r
-\r
-Upgrade duration\r
-~~~~~~~~~~~~~~~~\r
-\r
-As the OPNFV end-users are primarily Telecom operators, the network\r
-services provided by the VNFs deployed on the NFVI should meet the\r
-requirement of 'Carrier Grade'.::\r
-\r
-  In telecommunication, a "carrier grade" or"carrier class" refers to a\r
-  system, or a hardware or software component that is extremely reliable,\r
-  well tested and proven in its capabilities. Carrier grade systems are\r
-  tested and engineered to meet or exceed "five nines" high availability\r
-  standards, and provide very fast fault recovery through redundancy\r
-  (normally less than 50 milliseconds). [from wikipedia.org]\r
-\r
-"five nines" means working all the time in ONE YEAR except 5'15".\r
-\r
-::\r
-\r
-  We have learnt that a well prepared upgrade of OpenStack needs 10\r
-  minutes. The major time slot in the outage time is used spent on\r
-  synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!\r
-  ' by Symantec]\r
-\r
-This 10 minutes of downtime of the OpenStack services however did not impact the\r
-users, i.e. the VMs running on the compute nodes. This was the outage of\r
-the control plane only. On the other hand with respect to the\r
-preparations this was a manually tailored upgrade specific to the\r
-particular deployment and the versions of each OpenStack service.\r
-\r
-The project targets to achieve a more generic methodology, which however\r
-requires that the upgrade objects fulfil certain requirements. Since\r
-this is only possible on the long run we target first the upgrade\r
-of the different VIM services from version to version.\r
-\r
-**Questions:**\r
-\r
-1. Can we manage to upgrade OPNFV in only 5 minutes?\r
- \r
-.. <MT> The first question is whether we have the same carrier grade\r
-   requirement on the control plane as on the user plane. I.e. how\r
-   much control plane outage we can/willing to tolerate?\r
-   In the above case probably if the database is only half of the size\r
-   we can do the upgrade in 5 minutes, but is that good? It also means\r
-   that if the database is twice as much then the outage is 20\r
-   minutes.\r
-   For the user plane we should go for less as with two release yearly\r
-   that means 10 minutes outage per year.\r
-\r
-.. <Malla> 10 minutes outage per year to the users? Plus, if we take\r
-   control plane into the consideration, then total outage will be\r
-   more than 10 minute in whole network, right?\r
-\r
-.. <MT> The control plane outage does not have to cause outage to\r
-   the users, but it may of course depending on the size of the system\r
-   as it's more likely that there's a failure that needs to be handled\r
-   by the control plane.\r
-\r
-2. Is it acceptable for end users ? Such as a planed service\r
-   interruption will lasting more than ten minutes for software\r
-   upgrade.\r
-\r
-.. <MT> For user plane, no it's not acceptable in case of\r
-   carrier-grade. The 5' 15" downtime should include unplanned and\r
-   planned downtimes.\r
-   \r
-.. <Malla> I go agree with Maria, it is not acceptable.\r
-\r
-3. Will any VNFs still working well when VIM is down?\r
-\r
-.. <MT> In case of OpenStack it seems yes. .:)\r
-\r
-The maximum duration of an upgrade\r
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
-\r
-The duration of an upgrade is related to and proportional with the\r
-scale and the complexity of the OPNFV platform as well as the\r
-granularity (in function and in space) of the upgrade.\r
-\r
-.. <Malla> Also, if is a partial upgrade like module upgrade, it depends\r
-  also on the OPNFV modules and their tight connection entities as well.\r
-\r
-.. <MT> Since the maintenance window is shrinking and becoming non-existent\r
-  the duration of the upgrade is secondary to the requirement of smooth upgrade.\r
-  But probably we want to be able to put a time constraint on each upgrade\r
-  during which it must complete otherwise it is considered failed and the system\r
-  should be rolled back. I.e. in case of automatic execution it might not be clear\r
-  if an upgrade is long or just hanging. The time constraints may be a function\r
-  of the size of the system in terms of the upgrade object(s).\r
-\r
-The maximum duration of a roll back when an upgrade is failed \r
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
-\r
-The duration of a roll back is short than the corresponding upgrade. It\r
-depends on the duration of restore the software and configure data from\r
-pre-upgrade backup / snapshot.\r
-\r
-.. <MT> During the upgrade process two types of failure may happen:\r
-  In case we can recover from the failure by undoing the upgrade\r
-  actions it is possible to roll back the already executed part of the\r
-  upgrade in graceful manner introducing no more service outage than\r
-  what was introduced during the upgrade. Such a graceful roll back\r
-  requires typically the same amount of time as the executed portion of\r
-  the upgrade and impose minimal state/data loss.\r
-  \r
-.. <MT> Requirement: It should be possible to roll back gracefully the\r
-  failed upgrade of stateful services of the control plane.\r
-  In case we cannot recover from the failure by just undoing the\r
-  upgrade actions, we have to restore the upgraded entities from their\r
-  backed up state. In other terms the system falls back to an earlier\r
-  state, which is typically a faster recovery procedure than graceful\r
-  roll back and depending on the statefulness of the entities involved it\r
-  may result in significant state/data loss.\r
-  \r
-.. <MT> Two possible types of failures can happen during an upgrade\r
-\r
-.. <MT> We can recover from the failure that occurred in the upgrade process:\r
-  In this case, a graceful rolling back of the executed part of the\r
-  upgrade may be possible which would "undo" the executed part in a\r
-  similar fashion. Thus, such a roll back introduces no more service\r
-  outage during an upgrade than the executed part introduced. This\r
-  process typically requires the same amount of time as the executed\r
-  portion of the upgrade and impose minimal state/data loss.\r
-\r
-.. <MT> We cannot recover from the failure that occurred in the upgrade\r
-   process: In this case, the system needs to fall back to an earlier\r
-   consistent state by reloading this backed-up state. This is typically\r
-   a faster recovery procedure than the graceful roll back, but can cause\r
-   state/data loss. The state/data loss usually depends on the\r
-   statefulness of the entities whose state is restored from the backup.\r
-\r
-The maximum duration of a VNF interruption (Service outage)\r
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
-\r
-Since not the entire process of a smooth upgrade will affect the VNFs,\r
-the duration of the VNF interruption may be shorter than the duration\r
-of the upgrade. In some cases, the VNF running without the control\r
-from of the VIM is acceptable.\r
-\r
-.. <MT> Should require explicitly that the NFVI should be able to\r
-  provide its services to the VNFs independent of the control plane?\r
-\r
-.. <MT> Requirement: The upgrade of the control plane must not cause\r
-  interruption of the NFVI services provided to the VNFs.\r
-\r
-.. <MT> With respect to carrier-grade the yearly service outage of the\r
-  VNF should not exceed 5' 15" regardless whether it is planned or\r
-  unplanned outage. Considering the HA requirements TL-9000 requires an\r
-  end-to-end service recovery time of 15 seconds based on which the ETSI\r
-  GS NFV-REL 001 V1.1.1 (2015-01) document defines three service\r
-  availability levels (SAL). The proposed example service recovery times\r
-  for these levels are:\r
-\r
-.. <MT> SAL1: 5-6 seconds\r
-\r
-.. <MT> SAL2: 10-15 seconds\r
-\r
-.. <MT> SAL3: 20-25 seconds\r
-\r
-.. <Pva> my comment was actually that the downtime metrics of the\r
-  underlying elements, components and services are small fraction of the\r
-  total E2E service availability time. No-one on the E2E service path\r
-  will get the whole downtime allocation (in this context it includes\r
-  upgrade process related outages for the services provided by VIM etc.\r
-  elements that are subject to upgrade process).\r
-  \r
-.. <MT> So what you are saying is that the upgrade of any entity\r
-  (component, service) shouldn't cause even this much service\r
-  interruption. This was the reason I brought these figures here as well\r
-  that they are posing some kind of upper-upper boundary. Ideally the\r
-  interruption is in the millisecond range i.e. no more than a\r
-  switch-over or a live migration.\r
-  \r
-.. <MT> Requirement: Any interruption caused to the VNF by the upgrade\r
-  of the NFVI should be in the sub-second range.\r
-\r
-.. <MT]> In the future we also need to consider the upgrade of the NFVI,\r
-  i.e. HW, firmware, hypervisors, host OS etc.
-\ No newline at end of file