docs/requirements/104-Requirements.rst

   1 ============
   2 Requirements
   3 ============
   4
   5 Upgrade duration
   6 ================
   7
   8 As the OPNFV end-users are primarily Telecom operators, the network
   9 services provided by the VNFs deployed on the NFVI should meet the
  10 requirement of 'Carrier Grade'.::
  11
  12   In telecommunication, a "carrier grade" or"carrier class" refers to a
  13   system, or a hardware or software component that is extremely reliable,
  14   well tested and proven in its capabilities. Carrier grade systems are
  15   tested and engineered to meet or exceed "five nines" high availability
  16   standards, and provide very fast fault recovery through redundancy
  17   (normally less than 50 milliseconds). [from wikipedia.org]
  18
  19 "five nines" means working all the time in ONE YEAR except 5'15".
  20
  21 ::
  22
  23   We have learnt that a well prepared upgrade of OpenStack needs 10
  24   minutes. The major time slot in the outage time is used spent on
  25   synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
  26   ' by Symantec]
  27
  28 This 10 minutes of downtime of the OpenStack services however did not impact the
  29 users, i.e. the VMs running on the compute nodes. This was the outage of
  30 the control plane only. On the other hand with respect to the
  31 preparations this was a manually tailored upgrade specific to the
  32 particular deployment and the versions of each OpenStack service.
  33
  34 The project targets to achieve a more generic methodology, which however
  35 requires that the upgrade objects fulfil certain requirements. Since
  36 this is only possible on the long run we target first the upgrade
  37 of the different VIM services from version to version.
  38
  39 **Questions:**
  40
  41 1. Can we manage to upgrade OPNFV in only 5 minutes?
  42
  43 .. <MT> The first question is whether we have the same carrier grade
  44    requirement on the control plane as on the user plane. I.e. how
  45    much control plane outage we can/willing to tolerate?
  46    In the above case probably if the database is only half of the size
  47    we can do the upgrade in 5 minutes, but is that good? It also means
  48    that if the database is twice as much then the outage is 20
  49    minutes.
  50    For the user plane we should go for less as with two release yearly
  51    that means 10 minutes outage per year.
  52
  53 .. <Malla> 10 minutes outage per year to the users? Plus, if we take
  54    control plane into the consideration, then total outage will be
  55    more than 10 minute in whole network, right?
  56
  57 .. <MT> The control plane outage does not have to cause outage to
  58    the users, but it may of course depending on the size of the system
  59    as it's more likely that there's a failure that needs to be handled
  60    by the control plane.
  61
  62 2. Is it acceptable for end users ? Such as a planed service
  63    interruption will lasting more than ten minutes for software
  64    upgrade.
  65
  66 .. <MT> For user plane, no it's not acceptable in case of
  67    carrier-grade. The 5' 15" downtime should include unplanned and
  68    planned downtimes.
  69
  70 .. <Malla> I go agree with Maria, it is not acceptable.
  71
  72 3. Will any VNFs still working well when VIM is down?
  73
  74 .. <MT> In case of OpenStack it seems yes. .:)
  75
  76 The maximum duration of an upgrade
  77 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  78
  79 The duration of an upgrade is related to and proportional with the
  80 scale and the complexity of the OPNFV platform as well as the
  81 granularity (in function and in space) of the upgrade.
  82
  83 .. <Malla> Also, if is a partial upgrade like module upgrade, it depends
  84   also on the OPNFV modules and their tight connection entities as well.
  85
  86 .. <MT> Since the maintenance window is shrinking and becoming non-existent
  87   the duration of the upgrade is secondary to the requirement of smooth upgrade.
  88   But probably we want to be able to put a time constraint on each upgrade
  89   during which it must complete otherwise it is considered failed and the system
  90   should be rolled back. I.e. in case of automatic execution it might not be clear
  91   if an upgrade is long or just hanging. The time constraints may be a function
  92   of the size of the system in terms of the upgrade object(s).
  93
  94 The maximum duration of a roll back when an upgrade is failed
  95 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  96
  97 The duration of a roll back is short than the corresponding upgrade. It
  98 depends on the duration of restore the software and configure data from
  99 pre-upgrade backup / snapshot.
 100
 101 .. <MT> During the upgrade process two types of failure may happen:
 102   In case we can recover from the failure by undoing the upgrade
 103   actions it is possible to roll back the already executed part of the
 104   upgrade in graceful manner introducing no more service outage than
 105   what was introduced during the upgrade. Such a graceful roll back
 106   requires typically the same amount of time as the executed portion of
 107   the upgrade and impose minimal state/data loss.
 108
 109 .. <MT> Requirement: It should be possible to roll back gracefully the
 110   failed upgrade of stateful services of the control plane.
 111   In case we cannot recover from the failure by just undoing the
 112   upgrade actions, we have to restore the upgraded entities from their
 113   backed up state. In other terms the system falls back to an earlier
 114   state, which is typically a faster recovery procedure than graceful
 115   roll back and depending on the statefulness of the entities involved it
 116   may result in significant state/data loss.
 117
 118 .. <MT> Two possible types of failures can happen during an upgrade
 119
 120 .. <MT> We can recover from the failure that occurred in the upgrade process:
 121   In this case, a graceful rolling back of the executed part of the
 122   upgrade may be possible which would "undo" the executed part in a
 123   similar fashion. Thus, such a roll back introduces no more service
 124   outage during an upgrade than the executed part introduced. This
 125   process typically requires the same amount of time as the executed
 126   portion of the upgrade and impose minimal state/data loss.
 127
 128 .. <MT> We cannot recover from the failure that occurred in the upgrade
 129    process: In this case, the system needs to fall back to an earlier
 130    consistent state by reloading this backed-up state. This is typically
 131    a faster recovery procedure than the graceful roll back, but can cause
 132    state/data loss. The state/data loss usually depends on the
 133    statefulness of the entities whose state is restored from the backup.
 134
 135 The maximum duration of a VNF interruption (Service outage)
 136 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 137
 138 Since not the entire process of a smooth upgrade will affect the VNFs,
 139 the duration of the VNF interruption may be shorter than the duration
 140 of the upgrade. In some cases, the VNF running without the control
 141 from of the VIM is acceptable.
 142
 143 .. <MT> Should require explicitly that the NFVI should be able to
 144   provide its services to the VNFs independent of the control plane?
 145
 146 .. <MT> Requirement: The upgrade of the control plane must not cause
 147   interruption of the NFVI services provided to the VNFs.
 148
 149 .. <MT> With respect to carrier-grade the yearly service outage of the
 150   VNF should not exceed 5' 15" regardless whether it is planned or
 151   unplanned outage. Considering the HA requirements TL-9000 requires an
 152   end-to-end service recovery time of 15 seconds based on which the ETSI
 153   GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
 154   availability levels (SAL). The proposed example service recovery times
 155   for these levels are:
 156
 157 .. <MT> SAL1: 5-6 seconds
 158
 159 .. <MT> SAL2: 10-15 seconds
 160
 161 .. <MT> SAL3: 20-25 seconds
 162
 163 .. <Pva> my comment was actually that the downtime metrics of the
 164   underlying elements, components and services are small fraction of the
 165   total E2E service availability time. No-one on the E2E service path
 166   will get the whole downtime allocation (in this context it includes
 167   upgrade process related outages for the services provided by VIM etc.
 168   elements that are subject to upgrade process).
 169
 170 .. <MT> So what you are saying is that the upgrade of any entity
 171   (component, service) shouldn't cause even this much service
 172   interruption. This was the reason I brought these figures here as well
 173   that they are posing some kind of upper-upper boundary. Ideally the
 174   interruption is in the millisecond range i.e. no more than a
 175   switch-over or a live migration.
 176
 177 .. <MT> Requirement: Any interruption caused to the VNF by the upgrade
 178   of the NFVI should be in the sub-second range.
 179
 180 .. <MT]> In the future we also need to consider the upgrade of the NFVI,
 181   i.e. HW, firmware, hypervisors, host OS etc.
 182
 183 Pre-upgrading Environment
 184 =========================
 185
 186 System is running normally. If there are any faults before the upgrade,
 187 it is difficult to distinguish between upgrade introduced and the environment
 188 itself.
 189
 190 The environment should have the redundant resources. Because the upgrade
 191 process is based on the business migration, in the absence of resource
 192 redundancy,it is impossible to realize the business migration, as well as to
 193 achieve a smooth upgrade.
 194
 195 Resource redundancy in two levels:
 196
 197 NFVI level: This level is mainly the compute nodes resource redundancy.
 198 During the upgrade, the virtual machine on business can be migrated to another
 199 free compute node.
 200
 201 VNF level: This level depends on HA mechanism in VNF, such as:
 202 active-standby, load balance. In this case, as long as business of the target
 203 node on VMs is migrated to other free nodes, the migration of VM might not be
 204 necessary.
 205
 206 The way of redundancy to be used is subject to the specific environment.
 207 Generally speaking, During the upgrade, the VNF's service level availability
 208 mechanism should be used in higher priority than the NFVI's. This will help
 209 us to reduce the service outage.
 210
 211 Release version of software components
 212 ======================================
 213
 214 This is primarily a compatibility requirement. You can refer to Linux/Python
 215 Compatible Semantic Versioning 3.0.0:
 216
 217 Given a version number MAJOR.MINOR.PATCH, increment the:
 218
 219 MAJOR version when you make incompatible API changes,
 220
 221 MINOR version when you add functionality in a backwards-compatible manner,
 222
 223 PATCH version when you make backwards-compatible bug fixes.
 224
 225 Some internal interfaces of OpenStack will be used by Escalator indirectly,
 226 such as VM migration related interface between VIM and NFVI. So it is required
 227 to be backward compatible on these interfaces. Refer to "Interface" chapter
 228 for details.
 229
 230 Work Flows
 231 ==========
 232
 233 Describes the different types of requirements.  To have a table to label the source of
 234 the requirements, e.g. Doctor, Multi-site, etc.
 235
 236 Basic Actions
 237 =============
 238
 239 This section describes the basic functions may required by Escalator.
 240
 241 Preparation (offline)
 242 ^^^^^^^^^^^^^^^^^^^^^
 243
 244 This is the design phase when the upgrade plan (or upgrade campaign) is
 245 being designed so that it can be executed automatically with minimal
 246 service outage. It may include the following work:
 247
 248 1. Check the dependencies of the software modules and their impact,
 249    backward compatibilities to figure out the appropriate upgrade method
 250    and ordering.
 251 2. Find out if a rolling upgrade could be planned with several rolling
 252    steps to avoid any service outage due to the upgrade some
 253    parts/services at the same time.
 254 3. Collect the proper version files and check the integration for
 255    upgrading.
 256 4. The preparation step should produce an output (i.e. upgrade
 257    campaign/plan), which is executable automatically in an NFV Framework
 258    and which can be validated before execution.
 259
 260    -  The upgrade campaign should not be referring to scalable entities
 261       directly, but allow for adaptation to the system configuration and
 262       state at any given moment.
 263    -  The upgrade campaign should describe the ordering of the upgrade
 264       of different entities so that dependencies, redundancies can be
 265       maintained during the upgrade execution
 266    -  The upgrade campaign should provide information about the
 267       applicable recovery procedures and their ordering.
 268    -  The upgrade campaign should consider information about the
 269       verification/testing procedures to be performed during the upgrade
 270       so that upgrade failures can be detected as soon as possible and
 271       the appropriate recovery procedure can be identified and applied.
 272    -  The upgrade campaign should provide information on the expected
 273       execution time so that hanging execution can be identified
 274    -  The upgrade campaign should indicate any point in the upgrade when
 275       coordination with the users (VNFs) is required.
 276
 277 .. <hujie> Depends on the attributes of the object being upgraded, the
 278   upgrade plan may be slitted into step(s) and/or sub-plan(s), and even
 279   more small sub-plans in design phase. The plan(s) or sub-plan(s) my
 280   include step(s) or sub-plan(s).
 281
 282 Validation the upgrade plan / Checking the pre-requisites of System( offline / online)
 283 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 284
 285 The upgrade plan should be validated before the execution by testing
 286 it in a test environment which is similar to the product environment.
 287
 288 .. <MT> However it could also mean that we can identify some properties
 289   that it should satisfy e.g. what operations can or cannot be executed
 290   simultaneously like never take out two VMs of the same VNF.
 291
 292 .. <MT> Another question is if it requires that the system is in a particular
 293   state when the upgrade is applied. I.e. if there's certain amount of
 294   redundancy in the system, migration is enabled for VMs, when the NFVI
 295   is upgraded the VIM is healthy, when the VIM is upgraded the NFVI is
 296   healthy, etc.
 297
 298 .. <MT> I'm not sure what online validation means: Is it the validation of the
 299   upgrade plan/campaign or the validation of the system that it is in a
 300   state that the upgrade can be performed without too much risk?==
 301
 302 Before the upgrade plan being executed, the system healthy of the
 303 online product environment should be checked and confirmed to satisfy
 304 the requirements which were described in the upgrade plan. The
 305 sysinfo, e.g. which included system alarms, performance statistics and
 306 diagnostic logs, will be collected and analogized. It is required to
 307 resolve all of the system faults or exclude the unhealthy part before
 308 executing the upgrade plan.
 309
 310
 311 Backup/Snapshot (online)
 312 ^^^^^^^^^^^^^^^^^^^^^^^^
 313
 314 For avoid loss of data when a unsuccessful upgrade was encountered, the
 315 data should be back-upped and the system state snapshot should be taken
 316 before the execution of upgrade plan. This would be considered in the
 317 upgrade plan.
 318
 319 Several backups/Snapshots may be generated and stored before the single
 320 steps of changes. The following data/files are required to be
 321 considered:
 322
 323 1. running version files for each node.
 324 2. system components' configuration file and database.
 325 3. image and storage, if it is necessary.
 326
 327 .. <MT> Does 3 imply VNF image and storage? I.e. VNF state and data?==
 328
 329 .. <hujie> The following text is derived from previous "4. Negotiate
 330   with the VNF if it's ready for the upgrade"
 331
 332 Although the upper layer, which include VNFs and VNFMs, is out of the
 333 scope of Escalator, but it is still recommended to let it ready for a
 334 smooth system upgrade. The escalator could not guarantee the safe of
 335 VNFs. The upper layer should have some safe guard mechanism in design,
 336 and ready for avoiding failure in system upgrade.
 337
 338 Execution (online)
 339 ^^^^^^^^^^^^^^^^^^
 340
 341 The execution of upgrade plan should be a dynamical procedure which is
 342   controlled by Escalator.
 343
 344 .. <hujie> Revised text to be general.==
 345
 346 1. It is required to supporting execution ether in sequence or in
 347    parallel.
 348 2. It is required to check the result of the execution and take the
 349    action according the situation and the policies in the upgrade plan.
 350 3. It is required to execute properly on various configurations of
 351    system object. I.e. stand-alone, HA, etc.
 352 4. It is required to execute on the designated different parts of the
 353    system. I.e. physical server, virtualized server, rack, chassis,
 354    cluster, even different geographical places.
 355
 356 Testing (online)
 357 ^^^^^^^^^^^^^^^^
 358
 359 The testing after upgrade the whole system or parts of system to make
 360 sure the upgraded system(object) is working normally.
 361
 362 .. <hujie> Revised text to be general.
 363
 364 1. It is recommended to run the prepared test cases to see if the
 365    functionalities are available without any problem.
 366 2. It is recommended to check the sysinfo, e.g. system alarms,
 367    performance statistics and diagnostic logs to see if there are any
 368    abnormal.
 369
 370 Restore/Roll-back (online)
 371 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 372
 373 When upgrade is failure unfortunately, a quick system restore or system
 374 roll-back should be taken to recovery the system and the services.
 375
 376 .. <hujie> Revised text to be general.
 377
 378 1. It is recommend to support system restore from backup when upgrade
 379    was failed.
 380 2. It is recommend to support graceful roll-back with reverse order
 381    steps if possible.
 382
 383 Monitoring (online)
 384 ^^^^^^^^^^^^^^^^^^^
 385
 386 Escalator should continually monitor the process of upgrade. It is
 387 keeping update status of each module, each node, each cluster into a
 388 status table during upgrade.
 389
 390 .. <hujie> Revised text to be general.
 391
 392 1. It is required to collect the status of every objects being upgraded
 393    and sending abnormal alarms during the upgrade.
 394 2. It is recommend to reuse the existing monitoring system, like alarm.
 395 3. It is recommend to support pro-actively query.
 396 4. It is recommend to support passively wait for notification.
 397
 398 **Two possible ways for monitoring:**
 399
 400 **Pro-Actively Query** requires NFVI/VIM provides proper API or CLI
 401 interface. If Escalator serves as a service, it should pass on these
 402 interfaces.
 403
 404 **Passively Wait for Notification** requires Escalator provides
 405 callback interface, which could be used by NFVI/VIM systems or upgrade
 406 agent to send back notification.
 407
 408 .. <hujie> I am not sure why not to subscribe the notification.
 409
 410 Logging (online)
 411 ^^^^^^^^^^^^^^^^
 412
 413 Record the information generated by escalator into log files. The log
 414 file is used for manual diagnostic of exceptions.
 415
 416 1. It is required to support logging.
 417 2. It is recommended to include time stamp, object id, action name,
 418    error code, etc.
 419
 420 Administrative Control (online)
 421 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 422
 423 Administrative Control is used for control the privilege to start any
 424 escalator's actions for avoiding unauthorized operations.
 425
 426 #. It is required to support administrative control mechanism
 427 #. It is recommend to reuse the system's own secure system.
 428 #. It is required to avoid conflicts when the system's own secure system
 429    being upgraded.
 430
 431 Requirements on Object being upgraded
 432 =====================================
 433
 434 .. <hujie> We can develop BPs in future from requirements of this section and
 435   gap analysis for upper stream projects
 436
 437 Escalator focus on smooth upgrade. In practical implementation, it
 438 might be combined with installer/deplorer, or act as an independent
 439 tool/service. In either way, it requires targeting systems(NFVI and
 440 VIM) are developed/deployed in a way that Escalator could perform
 441 upgrade on them.
 442
 443 On NFVI system, live-migration is likely used to maintain availability
 444 because OPNFV would like to make HA transparent from end user. This
 445 requires VIM system being able to put compute node into maintenance mode
 446 and then isolated from normal service. Otherwise, new NFVI instances
 447 might risk at being schedule into the upgrading node.
 448
 449 On VIM system, availability is likely achieved by redundancy. This
 450 impose less requirements on system/services being upgrade (see PVA
 451 comments in early version). However, there should be a way to put the
 452 target system into standby mode. Because starting upgrade on the
 453 master node in a cluster is likely a bad idea.
 454
 455 .. <hujie>Revised text to be general.
 456
 457 1. It is required for NFVI/VIM to support **service handover** mechanism
 458    that minimize interruption to 0.001%(i.e. 99.999% service
 459    availability). Possible implementations are live-migration, redundant
 460    deployment, etc, (Note: for VIM, interruption could be less
 461    restrictive)
 462
 463 2. It is required for NFVI/VIM to restore the early version in a efficient
 464    way, such as **snapshot**.
 465
 466 3. It is required for NFVI/VIM to **migration data** efficiently between
 467    base and upgraded system.
 468
 469 4. It is recommend for NFV/VIM's interface to support upgrade
 470    orchestration, e.g. reading/setting system state.
 471
 472 Functional Requirements
 473 =======================
 474
 475 Availability mechanism, etc.
 476
 477 Non-functional Requirements
 478 ===========================