+General Requirements Background and Terminology\r
+-----------------------------------------------\r
+\r
+Terminologies and definitions\r
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r
+\r
+NFVI\r
+ The term is an abbreviation for Network Function Virtualization\r
+ Infrastructure; sometimes it is also referred as data plane in this\r
+ document.\r
+\r
+VIM\r
+ The term is an abbreviation for Virtual Infrastructure Management;\r
+ sometimes it is also referred as control plane in this document.\r
+ \r
+Operator\r
+ The term refers to network service providers and Virtual Network\r
+ Function (VNF) providers.\r
+\r
+End-User\r
+ The term refers to a subscriber of the Operator's services.\r
+\r
+Network Service\r
+ The term refers to a service provided by an Operator to its\r
+ End-users using a set of (virtualized) Network Functions\r
+\r
+Infrastructure Services\r
+ The term refers to services provided by the NFV Infrastructure and the \r
+ the Management & Orchestration functions to the VNFs. I.e. \r
+ these are the virtual resources as perceived by the VNFs.\r
+\r
+Smooth Upgrade\r
+ The term refers to an upgrade that results in no service outage \r
+ for the end-users.\r
+\r
+Rolling Upgrade\r
+ The term refers to an upgrade strategy that upgrades each node or\r
+ a subset of nodes in a wave style rolling through the data centre. It\r
+ is a popular upgrade strategy to maintain service availability.\r
+\r
+Parallel Universe\r
+ The term refers to an upgrade strategy that creates and deploys\r
+ a new universe - a system with the new configuration - while the old\r
+ system continues running. The state of the old system is transferred\r
+ to the new system after sufficient testing of the new system.\r
+\r
+Infrastructure Resource Model\r
+ The term refers to the representation of infrastructure resources,\r
+ namely: the physical resources, the virtualization\r
+ facility resources and the virtual resources.\r
+\r
+Physical Resource\r
+ The term refers to a hardware pieces of the NFV infrastructure, which may\r
+ also include the firmware which enables the hardware.\r
+\r
+Virtual Resource\r
+ The term refers to a resource, which is provided as services built on top\r
+ of the physical resources via the virtualization facilities; in particular,\r
+ they are the resources on which VNF entities are deployed, e.g.\r
+ the VMs, virtual switches, virtual routers, virtual disks etc.\r
+\r
+.. <MT> I don't think the VNF is the virtual resource. Virtual\r
+ resources are the VMs, virtual switches, virtual routers, virtual\r
+ disks etc. The VNF uses them, but I don't think they are equal. The\r
+ VIM doesn't manage the VNF, but it does manage virtual resources.\r
+ \r
+Visualization Facility\r
+ The term refers to a resource that enables the creation\r
+ of virtual environments on top of the physical resources, e.g.\r
+ hypervisor, OpenStack, etc.\r
+\r
+Upgrade Plan (or Campaign?) \r
+ The term refers to a choreography that describes how the upgrade should\r
+ be performed in terms of its targets (i.e. upgrade objects), the\r
+ steps/actions required of upgrading each, and the coordination of these\r
+ steps so that service availability can be maintained. It is an input to an\r
+ upgrade tool (Escalator) to carry out the upgrade \r
+\r
+\r
+Upgrade Objects\r
+~~~~~~~~~~~~~~~\r
+\r
+Physical Resource\r
+^^^^^^^^^^^^^^^^^\r
+\r
+Most of cloud infrastructures support dynamic addition/removal of\r
+hardware. A hardware upgrade could be done by adding the new \r
+hardware node and removing the old one. From the persepctive of smooth\r
+upgrade the orchestration/scheduling of this actions is the primary concern.\r
+Upgrading a physical resource,\r
+like upgrading its firmware and/or modify its configuration data, may\r
+also be considered in the future. \r
+\r
+\r
+Virtual Resources\r
+^^^^^^^^^^^^^^^^^\r
+\r
+Virtual resource upgrade mainly done by users. OPNFV may facilitate\r
+the activity, but suggest to have it in long term roadmap instead of\r
+initiate release.\r
+\r
+.. <MT> same comment here: I don't think the VNF is the virtual\r
+ resource. Virtual resources are the VMs, virtual switches, virtual\r
+ routers, virtual disks etc. The VNF uses them, but I don't think they\r
+ are equal. For example if by some reason the hypervisor is changed and\r
+ the current VMs cannot be migrated to the new hypervisor, they are\r
+ incompatible, then the VMs need to be upgraded too. This is not\r
+ something the NFVI user (i.e. VNFs ) would even know about.\r
+\r
+\r
+Virtualization Facility Resources\r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+Based on the functionality they provide, virtualization facility\r
+resources could be divided into computing node, networking node,\r
+storage node and management node.\r
+\r
+The possible upgrade objects in these nodes are addressed below:\r
+(Note: hardware based virtualization may be considered as virtualization\r
+facility resource, but from escalator perspective, it is better to\r
+consider it as part of the hardware upgrade. )\r
+\r
+**Computing node**\r
+\r
+1. OS Kernel\r
+\r
+2. Hypvervisor and virtual switch\r
+\r
+3. Other kernel modules, like driver\r
+\r
+4. User space software packages, like nova-compute agents and other\r
+ control plane programs.\r
+\r
+Updating 1 and 2 will cause the loss of virtualzation functionality of\r
+the compute node, which may lead to data plane services interruption\r
+if the virtual resource is not redudant.\r
+\r
+Updating 3 might result the same.\r
+\r
+Updating 4 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+**Networking node**\r
+\r
+1. OS kernel, optional, not all switches/routers allow the upgrade their\r
+ OS since it is more like a firmware than a generic OS.\r
+\r
+2. User space software package, like neutron agents and other control\r
+ plane programs\r
+\r
+Updating 1 if allowed will cause a node reboot and therefore leads to\r
+data plane service interruption if the virtual resource is not\r
+redundant.\r
+\r
+Updating 2 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+**Storage node**\r
+\r
+1. OS kernel, optional, not all storage nodes allow the upgrade their OS\r
+ since it is more like a firmware than a generic OS.\r
+\r
+2. Kernel modules\r
+\r
+3. User space software packages, control plane programs\r
+\r
+Updating 1 if allowed will cause a node reboot and therefore leads to\r
+data plane services interruption if the virtual resource is not\r
+redundant.\r
+\r
+Update 2 might result in the same.\r
+\r
+Updating 3 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+**Management node**\r
+\r
+1. OS Kernel\r
+\r
+2. Kernel modules, like driver\r
+\r
+3. User space software packages, like database, message queue and\r
+ control plane programs.\r
+\r
+Updating 1 will cause a node reboot and therefore leads to control\r
+plane services interruption if not an HA deployment. Updating 2 might\r
+result in the same.\r
+\r
+Updating 3 might lead to control plane services interruption if not an\r
+HA deployment.\r
+\r
+Upgrade Span\r
+~~~~~~~~~~~~\r
+\r
+**Major Upgrade**\r
+\r
+Upgrades between major releases may introducing significant changes in\r
+function, configuration and data, such as the upgrade of OPNFV from\r
+Arno to Brahmaputra.\r
+\r
+**Minor Upgrade**\r
+\r
+Upgrades inside one major releases which would not leads to changing\r
+the structure of the platform and may not infect the schema of the\r
+system data.\r
+\r
+Upgrade Granularity\r
+~~~~~~~~~~~~~~~~~~~\r
+\r
+Physical/Hardware Dimension\r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+Support full / partial upgrade for data centre, cluster, zone. Because\r
+of the upgrade of a data centre or a zone, it may be divided into\r
+several batches. The upgrade of a cloud environment (cluster) may also\r
+be partial. For example, in one cloud environment running a number of\r
+VNFs, we may just try one of them to check the stability and\r
+performance, before we upgrade all of them.\r
+\r
+Software Dimension\r
+^^^^^^^^^^^^^^^^^^\r
+\r
+- The upgrade of host OS or kernel may need a 'hot migration'\r
+- The upgrade of OpenStack’s components\r
+\r
+ i.the one-shot upgrade of all components\r
+ \r
+ ii.the partial upgrade (or bugfix patch) which only affects some\r
+ components (e.g., computing, storage, network, database, message\r
+ queue, etc.)\r
+\r
+.. <MT> this section seems to overlap with 2.1.\r
+ I can see the following dimensions for the software.\r
+\r
+.. <MT> different software packages\r
+\r
+.. <MT> different functions - Considering that the target versions of all\r
+ software are compatible the upgrade needs to ensure that any\r
+ dependencies between SW and therefore packages are taken into account\r
+ in the upgrade plan, i.e. no version mismatch occurs during the\r
+ upgrade therefore dependencies are not broken\r
+ \r
+.. <MT> same function - This is an upgrade specific question if different\r
+ versions can coexist in the system when a SW is being upgraded from\r
+ one version to another. This is particularly important for stateful\r
+ functions e.g. storage, networking, control services. The upgrade\r
+ method must consider the compatibility of the redundant entities.\r
+\r
+.. <MT> different versions of the same software package\r
+\r
+.. <MT> major version changes - they may introduce incompatibilities. Even\r
+ when there are backward compatibility requirements changes may cause\r
+ issues at graceful roll-back\r
+ \r
+.. <MT> minor version changes - they must not introduce incompatibility\r
+ between versions, these should be primarily bug fixes, so live\r
+ patches should be possible\r
+ \r
+.. <MT> different installations of the same software package\r
+\r
+.. <MT> using different installation options - they may reflect different\r
+ users with different needs so redundancy issues are less likely\r
+ between installations of different options; but they could be the\r
+ reflection of the heterogeneous system in which case they may provide\r
+ redundancy for higher availability, i.e. deeper inspection is needed\r
+ \r
+.. <MT> using the same installation options - they often reflect that the are\r
+ used by redundant entities across space\r
+ \r
+.. <MT> different distribution possibilities in space - same or different\r
+ availability zones, multi-site, geo-redundancy\r
+ \r
+.. <MT> different entities running from the same installation of a software\r
+ package\r
+ \r
+.. <MT> using different start-up options - they may reflect different users so\r
+ redundancy may not be an issues between them\r
+ \r
+.. <MT> using same start-up options - they often reflect redundant\r
+ entities\r
+\r
+Upgrade duration\r
+~~~~~~~~~~~~~~~~\r
+\r
+As the OPNFV end-users are primarily Telecom operators, the network\r
+services provided by the VNFs deployed on the NFVI should meet the\r
+requirement of 'Carrier Grade'.::\r
+\r
+ In telecommunication, a "carrier grade" or"carrier class" refers to a\r
+ system, or a hardware or software component that is extremely reliable,\r
+ well tested and proven in its capabilities. Carrier grade systems are\r
+ tested and engineered to meet or exceed "five nines" high availability\r
+ standards, and provide very fast fault recovery through redundancy\r
+ (normally less than 50 milliseconds). [from wikipedia.org]\r
+\r
+"five nines" means working all the time in ONE YEAR except 5'15".\r
+\r
+::\r
+\r
+ We have learnt that a well prepared upgrade of OpenStack needs 10\r
+ minutes. The major time slot in the outage time is used spent on\r
+ synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!\r
+ ' by Symantec]\r
+\r
+This 10 minutes of downtime of the OpenStack services however did not impact the\r
+users, i.e. the VMs running on the compute nodes. This was the outage of\r
+the control plane only. On the other hand with respect to the\r
+preparations this was a manually tailored upgrade specific to the\r
+particular deployment and the versions of each OpenStack service.\r
+\r
+The project targets to achieve a more generic methodology, which however\r
+requires that the upgrade objects fulfil certain requirements. Since\r
+this is only possible on the long run we target first the upgrade\r
+of the different VIM services from version to version.\r
+\r
+**Questions:**\r
+\r
+1. Can we manage to upgrade OPNFV in only 5 minutes?\r
+ \r
+.. <MT> The first question is whether we have the same carrier grade\r
+ requirement on the control plane as on the user plane. I.e. how\r
+ much control plane outage we can/willing to tolerate?\r
+ In the above case probably if the database is only half of the size\r
+ we can do the upgrade in 5 minutes, but is that good? It also means\r
+ that if the database is twice as much then the outage is 20\r
+ minutes.\r
+ For the user plane we should go for less as with two release yearly\r
+ that means 10 minutes outage per year.\r
+\r
+.. <Malla> 10 minutes outage per year to the users? Plus, if we take\r
+ control plane into the consideration, then total outage will be\r
+ more than 10 minute in whole network, right?\r
+\r
+.. <MT> The control plane outage does not have to cause outage to\r
+ the users, but it may of course depending on the size of the system\r
+ as it's more likely that there's a failure that needs to be handled\r
+ by the control plane.\r
+\r
+2. Is it acceptable for end users ? Such as a planed service\r
+ interruption will lasting more than ten minutes for software\r
+ upgrade.\r
+\r
+.. <MT> For user plane, no it's not acceptable in case of\r
+ carrier-grade. The 5' 15" downtime should include unplanned and\r
+ planned downtimes.\r
+ \r
+.. <Malla> I go agree with Maria, it is not acceptable.\r
+\r
+3. Will any VNFs still working well when VIM is down?\r
+\r
+.. <MT> In case of OpenStack it seems yes. .:)\r
+\r
+The maximum duration of an upgrade\r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+The duration of an upgrade is related to and proportional with the\r
+scale and the complexity of the OPNFV platform as well as the\r
+granularity (in function and in space) of the upgrade.\r
+\r
+.. <Malla> Also, if is a partial upgrade like module upgrade, it depends\r
+ also on the OPNFV modules and their tight connection entities as well.\r
+\r
+.. <MT> Since the maintenance window is shrinking and becoming non-existent\r
+ the duration of the upgrade is secondary to the requirement of smooth upgrade.\r
+ But probably we want to be able to put a time constraint on each upgrade\r
+ during which it must complete otherwise it is considered failed and the system\r
+ should be rolled back. I.e. in case of automatic execution it might not be clear\r
+ if an upgrade is long or just hanging. The time constraints may be a function\r
+ of the size of the system in terms of the upgrade object(s).\r
+\r
+The maximum duration of a roll back when an upgrade is failed \r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+The duration of a roll back is short than the corresponding upgrade. It\r
+depends on the duration of restore the software and configure data from\r
+pre-upgrade backup / snapshot.\r
+\r
+.. <MT> During the upgrade process two types of failure may happen:\r
+ In case we can recover from the failure by undoing the upgrade\r
+ actions it is possible to roll back the already executed part of the\r
+ upgrade in graceful manner introducing no more service outage than\r
+ what was introduced during the upgrade. Such a graceful roll back\r
+ requires typically the same amount of time as the executed portion of\r
+ the upgrade and impose minimal state/data loss.\r
+ \r
+.. <MT> Requirement: It should be possible to roll back gracefully the\r
+ failed upgrade of stateful services of the control plane.\r
+ In case we cannot recover from the failure by just undoing the\r
+ upgrade actions, we have to restore the upgraded entities from their\r
+ backed up state. In other terms the system falls back to an earlier\r
+ state, which is typically a faster recovery procedure than graceful\r
+ roll back and depending on the statefulness of the entities involved it\r
+ may result in significant state/data loss.\r
+ \r
+.. <MT> Two possible types of failures can happen during an upgrade\r
+\r
+.. <MT> We can recover from the failure that occurred in the upgrade process:\r
+ In this case, a graceful rolling back of the executed part of the\r
+ upgrade may be possible which would "undo" the executed part in a\r
+ similar fashion. Thus, such a roll back introduces no more service\r
+ outage during an upgrade than the executed part introduced. This\r
+ process typically requires the same amount of time as the executed\r
+ portion of the upgrade and impose minimal state/data loss.\r
+\r
+.. <MT> We cannot recover from the failure that occurred in the upgrade\r
+ process: In this case, the system needs to fall back to an earlier\r
+ consistent state by reloading this backed-up state. This is typically\r
+ a faster recovery procedure than the graceful roll back, but can cause\r
+ state/data loss. The state/data loss usually depends on the\r
+ statefulness of the entities whose state is restored from the backup.\r
+\r
+The maximum duration of a VNF interruption (Service outage)\r
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r
+\r
+Since not the entire process of a smooth upgrade will affect the VNFs,\r
+the duration of the VNF interruption may be shorter than the duration\r
+of the upgrade. In some cases, the VNF running without the control\r
+from of the VIM is acceptable.\r
+\r
+.. <MT> Should require explicitly that the NFVI should be able to\r
+ provide its services to the VNFs independent of the control plane?\r
+\r
+.. <MT> Requirement: The upgrade of the control plane must not cause\r
+ interruption of the NFVI services provided to the VNFs.\r
+\r
+.. <MT> With respect to carrier-grade the yearly service outage of the\r
+ VNF should not exceed 5' 15" regardless whether it is planned or\r
+ unplanned outage. Considering the HA requirements TL-9000 requires an\r
+ end-to-end service recovery time of 15 seconds based on which the ETSI\r
+ GS NFV-REL 001 V1.1.1 (2015-01) document defines three service\r
+ availability levels (SAL). The proposed example service recovery times\r
+ for these levels are:\r
+\r
+.. <MT> SAL1: 5-6 seconds\r
+\r
+.. <MT> SAL2: 10-15 seconds\r
+\r
+.. <MT> SAL3: 20-25 seconds\r
+\r
+.. <Pva> my comment was actually that the downtime metrics of the\r
+ underlying elements, components and services are small fraction of the\r
+ total E2E service availability time. No-one on the E2E service path\r
+ will get the whole downtime allocation (in this context it includes\r
+ upgrade process related outages for the services provided by VIM etc.\r
+ elements that are subject to upgrade process).\r
+ \r
+.. <MT> So what you are saying is that the upgrade of any entity\r
+ (component, service) shouldn't cause even this much service\r
+ interruption. This was the reason I brought these figures here as well\r
+ that they are posing some kind of upper-upper boundary. Ideally the\r
+ interruption is in the millisecond range i.e. no more than a\r
+ switch-over or a live migration.\r
+ \r
+.. <MT> Requirement: Any interruption caused to the VNF by the upgrade\r
+ of the NFVI should be in the sub-second range.\r
+\r
+.. <MT]> In the future we also need to consider the upgrade of the NFVI,\r
+ i.e. HW, firmware, hypervisors, host OS etc.
\ No newline at end of file