src/ceph/doc/dev/logging.rst

   1
   2 Use of the cluster log
   3 ======================
   4
   5 (Note: none of this applies to the local "dout" logging.  This is about
   6 the cluster log that we send through the mon daemons)
   7
   8 Severity
   9 --------
  10
  11 Use ERR for situations where the cluster cannot do its job for some reason.
  12 For example: we tried to do a write, but it returned an error, or we tried
  13 to read something, but it's corrupt so we can't, or we scrubbed a PG but
  14 the data was inconsistent so we can't recover.
  15
  16 Use WRN for incidents that the cluster can handle, but have some abnormal/negative
  17 aspect, such as a temporary degredation of service, or an unexpected internal
  18 value.  For example, a metadata error that can be auto-fixed, or a slow operation.
  19
  20 Use INFO for ordinary cluster operations that do not indicate a fault in
  21 Ceph.  It is especially important that INFO level messages are clearly
  22 worded and do not cause confusion or alarm.
  23
  24 Frequency
  25 ---------
  26
  27 It is important that messages of all severities are not excessively
  28 frequent.  Consumers may be using a rotating log buffer that contains
  29 messages of all severities, so even DEBUG messages could interfere
  30 with proper display of the latest INFO messages if the DEBUG messages
  31 are too frequent.
  32
  33 Remember that if you have a bad state (as opposed to event), that is
  34 what health checks are for -- do not spam the cluster log to indicate
  35 a continuing unhealthy state.
  36
  37 Do not emit cluster log messages for events that scale with
  38 the number of clients or level of activity on the system, or for
  39 events that occur regularly in normal operation.  For example, it
  40 would be inappropriate to emit a INFO message about every
  41 new client that connects (scales with #clients), or to emit and INFO
  42 message about every CephFS subtree migration (occurs regularly).
  43
  44 Language and formatting
  45 -----------------------
  46
  47 (Note: these guidelines matter much less for DEBUG-level messages than
  48  for INFO and above.  Concentrate your efforts on making INFO/WRN/ERR
  49  messages as readable as possible.)
  50
  51 Use the passive voice.  For example, use "Object xyz could not be read", rather
  52 than "I could not read the object xyz".
  53
  54 Print long/big identifiers, such as inode numbers, as hex, prefixed
  55 with an 0x so that the user can tell it is hex.  We do this because
  56 the 0x makes it unambiguous (no equivalent for decimal), and because
  57 the hex form is more likely to fit on the screen.
  58
  59 Print size quantities as a human readable MB/GB/etc, including the unit
  60 at the end of the number.  Exception: if you are specifying an offset,
  61 where precision is essential to the meaning, then you can specify
  62 the value in bytes (but print it as hex).
  63
  64 Make a good faith effort to fit your message on a single line.  It does
  65 not have to be guaranteed, but it should at least usually be
  66 the case.  That means, generally, no printing of lists unless there
  67 are only a few items in the list.
  68
  69 Use nouns that are meaningful to the user, and defined in the
  70 documentation.  Common acronyms are OK -- don't waste screen space
  71 typing "Rados Object Gateway" instead of RGW.  Do not use internal
  72 class names like "MDCache" or "Objecter".  It is okay to mention
  73 internal structures if they are the direct subject of the message,
  74 for example in a corruption, but use plain english.
  75 Example: instead of "Objecter requests" say "OSD client requests"
  76 Example: it is okay to mention internal structure in the context
  77         of "Corrupt session table" (but don't say "Corrupt SessionTable")
  78
  79 Where possible, describe the consequence for system availability, rather
  80 than only describing the underlying state.  For example, rather than
  81 saying "MDS myfs.0 is replaying", say that "myfs is degraded, waiting
  82 for myfs.0 to finish starting".
  83
  84 While common acronyms are fine, don't randomly truncate words.  It's not
  85 "dir ino", it's "directory inode".
  86
  87 If you're logging something that "should never happen", i.e. a situation
  88 where it would be an assertion, but we're helpfully not crashing, then
  89 make that clear in the language -- this is probably not a situation
  90 that the user can remediate themselves.
  91
  92 Avoid UNIX/programmer jargon.  Instead of "errno", just say "error" (or
  93 preferably give something more descriptive than the number!)
  94
  95 Do not mention cluster map epochs unless they are essential to
  96 the meaning of the message.  For example, "OSDMap epoch 123 is corrupt"
  97 would be okay (the epoch is the point of the message), but saying "OSD
  98 123 is down in OSDMap epoch 456" would not be (the osdmap and epoch
  99 concepts are an implementation detail, the down-ness of the OSD
 100 is the real message).  Feel free to send additional detail to
 101 the daemon's local log (via `dout`/`derr`).
 102
 103 If you log a problem that may go away in the future, make sure you
 104 also log when it goes away.  Whatever priority you logged the original
 105 message at, log the "going away" message at INFO.
 106