docs/requirements/104-Requirements.rst

   1 ============
   2 Requirements
   3 ============
   4
   5 Upgrade duration
   6 ================
   7
   8 Being a telecom service system, OPNFV shall target at carrier grade availability,
   9 which allows only about 5 minutes of outage in a year. Base on this basic input
  10 and discussions on the current solutions, The following requirements are defined
  11 from the perspective of time constraints:
  12
  13 - OPNFV platform must be deployed with HA to allow live upgrade possible. Considering of
  14   the scale, complexity, and life cycle of OPNFV system, allocating less than
  15   5 minutes out of a year for upgrade is in-realistic. Therefore OPNFV should
  16   be deployed with HA, allowing part of system being upgraded, while its
  17   redundant parts continue to serve End-User. This hopefully relieves the time
  18   constraint on upgrade operation to achievable level.
  19
  20 - VNF service interruption for each switching should be sub-second range. In
  21   HA system, switching from an in-service system/component to the redundant
  22   ones normally cause service interruption. From example live-migrating a
  23   virtual machine from one hypervisor to another typically take the virtual
  24   machine out of service for about 500ms. Summing up all these interruptions in
  25   a year shall be less than 5 minutes in order to fulfill the five-nines carrier
  26   grade availability. In addition, when interruption goes over a second, End-User
  27   experience is likely impacted. This document therefore recommends service
  28   switching should be less than a second.
  29
  30 - VIM interruption shall not result in NFVI interruption. VIM in general has more
  31   logic built-in, therefore more complicated, and likely less reliable than NFVI.
  32   To minimize the impact from VIM to NFVI, unless VIM explicitly order NFVI stop
  33   functioning, NFVI shall continue working as it should.
  34
  35 - Total upgrade duration should be less than 2 hours. Even time constraint is
  36   relieved with HA design, the total time for upgrade operation is recommended
  37   to limit in 2 hours. The reason is that upgrade might interfere End-User
  38   unexpectedly, shorter maintenance window is less possible risk. In this
  39   document, upgrade duration is started at the moment that End-User services
  40   are possibly impacted to the moment that upgrade is concluded with either
  41   commit or rollback. Regarding on the scale and complexity of OPNFV system,
  42   this requirements looks challenging, however OPNFV implementations should
  43   target this with introducing novel designs and solutions.
  44
  45 Pre-upgrading Environment
  46 =========================
  47
  48 System is running normally. If there are any faults before the upgrade,
  49 it is difficult to distinguish between upgrade introduced and the environment
  50 itself.
  51
  52 The environment should have the redundant resources. Because the upgrade
  53 process is based on the business migration, in the absence of resource
  54 redundancy,it is impossible to realize the business migration, as well as to
  55 achieve a smooth upgrade.
  56
  57 Resource redundancy in two levels:
  58
  59 NFVI level: This level is mainly the compute nodes resource redundancy.
  60 During the upgrade, the virtual machine on business can be migrated to another
  61 free compute node.
  62
  63 VNF level: This level depends on HA mechanism in VNF, such as:
  64 active-standby, load balance. In this case, as long as business of the target
  65 node on VMs is migrated to other free nodes, the migration of VM might not be
  66 necessary.
  67
  68 The way of redundancy to be used is subject to the specific environment.
  69 Generally speaking, During the upgrade, the VNF's service level availability
  70 mechanism should be used in higher priority than the NFVI's. This will help
  71 us to reduce the service outage.
  72
  73 Release version of software components
  74 ======================================
  75
  76 This is primarily a compatibility requirement. You can refer to Linux/Python
  77 Compatible Semantic Versioning 3.0.0:
  78
  79 Given a version number MAJOR.MINOR.PATCH, increment the:
  80
  81 MAJOR version when you make incompatible API changes,
  82
  83 MINOR version when you add functionality in a backwards-compatible manner,
  84
  85 PATCH version when you make backwards-compatible bug fixes.
  86
  87 Some internal interfaces of OpenStack will be used by Escalator indirectly,
  88 such as VM migration related interface between VIM and NFVI. So it is required
  89 to be backward compatible on these interfaces. Refer to "Interface" chapter
  90 for details.
  91
  92 Work Flows
  93 ==========
  94
  95 Describes the different types of requirements.  To have a table to label the source of
  96 the requirements, e.g. Doctor, Multi-site, etc.
  97
  98 Basic Actions
  99 =============
 100
 101 This section describes the basic functions may required by Escalator.
 102
 103 Preparation (offline)
 104 ^^^^^^^^^^^^^^^^^^^^^
 105
 106 This is the design phase when the upgrade plan (or upgrade campaign) is
 107 being designed so that it can be executed automatically with minimal
 108 service outage. It may include the following work:
 109
 110 1. Check the dependencies of the software modules and their impact,
 111    backward compatibilities to figure out the appropriate upgrade method
 112    and ordering.
 113 2. Find out if a rolling upgrade could be planned with several rolling
 114    steps to avoid any service outage due to the upgrade some
 115    parts/services at the same time.
 116 3. Collect the proper version files and check the integration for
 117    upgrading.
 118 4. The preparation step should produce an output (i.e. upgrade
 119    campaign/plan), which is executable automatically in an NFV Framework
 120    and which can be validated before execution.
 121
 122    -  The upgrade campaign should not be referring to scalable entities
 123       directly, but allow for adaptation to the system configuration and
 124       state at any given moment.
 125    -  The upgrade campaign should describe the ordering of the upgrade
 126       of different entities so that dependencies, redundancies can be
 127       maintained during the upgrade execution
 128    -  The upgrade campaign should provide information about the
 129       applicable recovery procedures and their ordering.
 130    -  The upgrade campaign should consider information about the
 131       verification/testing procedures to be performed during the upgrade
 132       so that upgrade failures can be detected as soon as possible and
 133       the appropriate recovery procedure can be identified and applied.
 134    -  The upgrade campaign should provide information on the expected
 135       execution time so that hanging execution can be identified
 136    -  The upgrade campaign should indicate any point in the upgrade when
 137       coordination with the users (VNFs) is required.
 138
 139 .. <hujie> Depends on the attributes of the object being upgraded, the
 140   upgrade plan may be slitted into step(s) and/or sub-plan(s), and even
 141   more small sub-plans in design phase. The plan(s) or sub-plan(s) my
 142   include step(s) or sub-plan(s).
 143
 144 Validation the upgrade plan / Checking the pre-requisites of System( offline / online)
 145 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 146
 147 The upgrade plan should be validated before the execution by testing
 148 it in a test environment which is similar to the product environment.
 149
 150 .. <MT> However it could also mean that we can identify some properties
 151   that it should satisfy e.g. what operations can or cannot be executed
 152   simultaneously like never take out two VMs of the same VNF.
 153
 154 .. <MT> Another question is if it requires that the system is in a particular
 155   state when the upgrade is applied. I.e. if there's certain amount of
 156   redundancy in the system, migration is enabled for VMs, when the NFVI
 157   is upgraded the VIM is healthy, when the VIM is upgraded the NFVI is
 158   healthy, etc.
 159
 160 .. <MT> I'm not sure what online validation means: Is it the validation of the
 161   upgrade plan/campaign or the validation of the system that it is in a
 162   state that the upgrade can be performed without too much risk?==
 163
 164 Before the upgrade plan being executed, the system healthy of the
 165 online product environment should be checked and confirmed to satisfy
 166 the requirements which were described in the upgrade plan. The
 167 sysinfo, e.g. which included system alarms, performance statistics and
 168 diagnostic logs, will be collected and analogized. It is required to
 169 resolve all of the system faults or exclude the unhealthy part before
 170 executing the upgrade plan.
 171
 172
 173 Backup/Snapshot (online)
 174 ^^^^^^^^^^^^^^^^^^^^^^^^
 175
 176 For avoid loss of data when a unsuccessful upgrade was encountered, the
 177 data should be back-upped and the system state snapshot should be taken
 178 before the execution of upgrade plan. This would be considered in the
 179 upgrade plan.
 180
 181 Several backups/Snapshots may be generated and stored before the single
 182 steps of changes. The following data/files are required to be
 183 considered:
 184
 185 1. running version files for each node.
 186 2. system components' configuration file and database.
 187 3. image and storage, if it is necessary.
 188
 189 .. <MT> Does 3 imply VNF image and storage? I.e. VNF state and data?==
 190
 191 .. <hujie> The following text is derived from previous "4. Negotiate
 192   with the VNF if it's ready for the upgrade"
 193
 194 Although the upper layer, which include VNFs and VNFMs, is out of the
 195 scope of Escalator, but it is still recommended to let it ready for a
 196 smooth system upgrade. The escalator could not guarantee the safe of
 197 VNFs. The upper layer should have some safe guard mechanism in design,
 198 and ready for avoiding failure in system upgrade.
 199
 200 Execution (online)
 201 ^^^^^^^^^^^^^^^^^^
 202
 203 The execution of upgrade plan should be a dynamical procedure which is
 204   controlled by Escalator.
 205
 206 .. <hujie> Revised text to be general.==
 207
 208 1. It is required to supporting execution ether in sequence or in
 209    parallel.
 210 2. It is required to check the result of the execution and take the
 211    action according the situation and the policies in the upgrade plan.
 212 3. It is required to execute properly on various configurations of
 213    system object. I.e. stand-alone, HA, etc.
 214 4. It is required to execute on the designated different parts of the
 215    system. I.e. physical server, virtualized server, rack, chassis,
 216    cluster, even different geographical places.
 217
 218 Testing (online)
 219 ^^^^^^^^^^^^^^^^
 220
 221 The testing after upgrade the whole system or parts of system to make
 222 sure the upgraded system(object) is working normally.
 223
 224 .. <hujie> Revised text to be general.
 225
 226 1. It is recommended to run the prepared test cases to see if the
 227    functionalities are available without any problem.
 228 2. It is recommended to check the sysinfo, e.g. system alarms,
 229    performance statistics and diagnostic logs to see if there are any
 230    abnormal.
 231
 232 Restore/Roll-back (online)
 233 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 234
 235 When upgrade is failure unfortunately, a quick system restore or system
 236 roll-back should be taken to recovery the system and the services.
 237
 238 .. <hujie> Revised text to be general.
 239
 240 1. It is recommend to support system restore from backup when upgrade
 241    was failed.
 242 2. It is recommend to support graceful roll-back with reverse order
 243    steps if possible.
 244
 245 Monitoring (online)
 246 ^^^^^^^^^^^^^^^^^^^
 247
 248 Escalator should continually monitor the process of upgrade. It is
 249 keeping update status of each module, each node, each cluster into a
 250 status table during upgrade.
 251
 252 .. <hujie> Revised text to be general.
 253
 254 1. It is required to collect the status of every objects being upgraded
 255    and sending abnormal alarms during the upgrade.
 256 2. It is recommend to reuse the existing monitoring system, like alarm.
 257 3. It is recommend to support pro-actively query.
 258 4. It is recommend to support passively wait for notification.
 259
 260 **Two possible ways for monitoring:**
 261
 262 **Pro-Actively Query** requires NFVI/VIM provides proper API or CLI
 263 interface. If Escalator serves as a service, it should pass on these
 264 interfaces.
 265
 266 **Passively Wait for Notification** requires Escalator provides
 267 callback interface, which could be used by NFVI/VIM systems or upgrade
 268 agent to send back notification.
 269
 270 .. <hujie> I am not sure why not to subscribe the notification.
 271
 272 Logging (online)
 273 ^^^^^^^^^^^^^^^^
 274
 275 Record the information generated by escalator into log files. The log
 276 file is used for manual diagnostic of exceptions.
 277
 278 1. It is required to support logging.
 279 2. It is recommended to include time stamp, object id, action name,
 280    error code, etc.
 281
 282 Administrative Control (online)
 283 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 284
 285 Administrative Control is used for control the privilege to start any
 286 escalator's actions for avoiding unauthorized operations.
 287
 288 #. It is required to support administrative control mechanism
 289 #. It is recommend to reuse the system's own secure system.
 290 #. It is required to avoid conflicts when the system's own secure system
 291    being upgraded.
 292
 293 Requirements on Object being upgraded
 294 =====================================
 295
 296 .. <hujie> We can develop BPs in future from requirements of this section and
 297   gap analysis for upper stream projects
 298
 299 Escalator focus on smooth upgrade. In practical implementation, it
 300 might be combined with installer/deplorer, or act as an independent
 301 tool/service. In either way, it requires targeting systems(NFVI and
 302 VIM) are developed/deployed in a way that Escalator could perform
 303 upgrade on them.
 304
 305 On NFVI system, live-migration is likely used to maintain availability
 306 because OPNFV would like to make HA transparent from end user. This
 307 requires VIM system being able to put compute node into maintenance mode
 308 and then isolated from normal service. Otherwise, new NFVI instances
 309 might risk at being schedule into the upgrading node.
 310
 311 On VIM system, availability is likely achieved by redundancy. This
 312 impose less requirements on system/services being upgrade (see PVA
 313 comments in early version). However, there should be a way to put the
 314 target system into standby mode. Because starting upgrade on the
 315 master node in a cluster is likely a bad idea.
 316
 317 .. <hujie>Revised text to be general.
 318
 319 1. It is required for NFVI/VIM to support **service handover** mechanism
 320    that minimize interruption to 0.001%(i.e. 99.999% service
 321    availability). Possible implementations are live-migration, redundant
 322    deployment, etc, (Note: for VIM, interruption could be less
 323    restrictive)
 324
 325 2. It is required for NFVI/VIM to restore the early version in a efficient
 326    way, such as **snapshot**.
 327
 328 3. It is required for NFVI/VIM to **migration data** efficiently between
 329    base and upgraded system.
 330
 331 4. It is recommend for NFV/VIM's interface to support upgrade
 332    orchestration, e.g. reading/setting system state.
 333
 334 Functional Requirements
 335 =======================
 336
 337 Availability mechanism, etc.
 338
 339 Non-functional Requirements
 340 ===========================