Monday, November 14, 2011

Do Not Believe in Structured Logging

This is based on my first-hand involvement in the audit attempt at structured
logging, and on the results of documenting the audit records that were actually
sent by various programs. It is rather long, I'm afraid—but please read at
least the "issues" section; these lessons were learned the hard way, and any
plan that ignores them will fail.

Outline



  • Issues:

    • Any generic log parsing tool WILL need to handle unstructured text. Always.
    • Any structured logging format WILL be used inconsistently.
    • There WILL NOT be an universal field name namespace.

  • Implications:

    • You can only "trust" data that always comes from the same piece of code.
    • Most log analysis code must specifically know what record it is working with.
    • Structured logging formats can not deliver what they promise.

  • A potential recommendation:

    • Do not define new structured logging formats.
    • Define an universal field syntax. Make it trivial to use.
    • Treat unstructured text as the primary format, and optimize its use.
    • ... but is it really worth it?



Issues


A. Any generic log parsing tool WILL need to handle unstructured text. Always.


Right now we have thousands of Linux software packages that output unstructured
text. This, realistically, will never change, and note that incremental moves
to structured logging do not matter at all (even if only 5% of the important
log messages are unstructured text, the tools still need to handle unstructured
text).

Even if we magically managed to convert every package we have to use a specific
structured logging format within the next 5 years, a few new other logging
infrastructures would appear in the meantime (like we currently have Java's
log4j, Python's logging module or the proposed CEE standard); these
infrastructures would certainly not be 100% compatible with the chosen
structured logging format (either carrying more or less information). In
practice, the only format everybody can agree on is unstructured text.

Another aspect is logging in "can't happen" situations, which is one of the
places where logging is most valuable. Programmers just won't use a structured
logging API in that situation; you probably can persuade a person to replace
assert(index >= 0);

with
my_assert(index >= 0, "index file corrupted");

but we can't expect that they will write 5 lines of code to properly log this
in a structured format; they're, if anything, in fact more likely to return back
to the plain assert()—after all, the situation "cannot happen", can it?


B. Any structured logging format WILL be used inconsistently.


Bitter experience indicates that even programmers that use a structured logging
API instead of unstructured text WILL use it in ways that break the promises of
structured logging:


  • Using a single "message" string instead of breaking it down to pieces
    (so even a reader of "structured" log information must handle arbitrary text!).
  • Writing UIDs into "user name" fields.
  • Using three different field names for a single conceptual field (although
    there was a helper function that correctly writes the field for an
    application).


And all of this has happened while writing auditing patches by a relatively
small group of people (instead of various upstreams reading documentation and
doing their best), and many of the patches went to a single mailing list for
review. It is possible that another group or another time would be different,
but I don't really believe it.


C. There WILL NOT be an universal field name namespace.


The only point of structured logging is to be able to parse the messages and
extract some useful data (or at least useful statistics) without encoding
specific knowledge about the message into the program. For example, one would
like to list all messages related to a specific host name.

On a purely structural level, there will be records that simply specify
HOSTNAME=foo. Then there will be records that specify CLIENT\_HOSTNAME=foo,
SERVER\_HOSTNAME=bar. Or perhaps even records that specify SENDER\_HOSTNAME=foo,
ROUTER\_HOSTNAMES=[foo1,foo2,foo3],
MULTICAST\_DESTINATION\_HOSTNAMES=[bar1,bar2,bar3] (using some convention or
mechanism to express multi-valued fields). And, looking at "B." above, there
will also be records that specify HOST\_NAME=foo and HOST=foo. Obviously, doing
any log aggregation based on field names will fail. (Adding "field types"
doesn't solve this problem—field types will have to be user-extensible, and
thus will also be used inconsistently.)

Looking at the host name example from a semantic level, there will be records
that specify HOSTNAME=foo, and records that specify IP\_ADDRESS=address\_of\_foo;
if DNS is temporarily down, the same program may actually generate both of
these records in response to the same kind of event! Then there's the whole
mess of IP addresses that change over time, or are over time reused by
different hosts.

Implications


D. You can only "trust" data that always comes from the same piece of code.


There are three general categories of data that is usually "trustable":

  • Fields that are always attached to the log record by the logging mechanism
    and recorded "out of band".
  • Fields in records generated within a single application (only if you can
    identify the application that has generated the record)
  • Fields generated by a library helper (but people WILL emulate the helper
    instead of using it, sooner or later)


Any other conventions and de-facto standards will, sooner or later, be violated
in a way that makes the result of parsing the logs incorrect.

E. Most log analysis code must specifically know what record it is working with.


The set of "trustable" data in logs is very limited ("D."), so log records will
be inconsistent ("B.", "C."). Any application that will want to make sense out
of the data will have to know the specifics—both on the structural level,
e.g. field names, and on the semantic level, e.g. what fields with the same
"type" are relevant (should 'ROUTER\_HOSTNAMES' above be included or excluded?),
what alternative sources of the data exist.


F. Structured logging formats can not deliver what they promise.


Because of "E.", log analysis is not as much a matter of code as a matter of
information—what are the record formats, which fields are relevant, what
transformations must be applied. This is a lot of work that isn't automatable,
and the major part is collecting the data (finding all relevant programs and
looking at the records they generate), not parsing; once you find the code that
generates a specific record, writing a parser is usually quite simple.

Having a simpler or more efficient parser is somewhat valuable (see below), but
that's nowhere close to the idea of logging "directly into a database": the
database will contain nonsensical tuples instead of structured data.

A potential recommendation


Do not define new structured logging formats.


See F. above. The logging format change is costly, and does not deliver the
desired benefit. Having two parallel logging mechanisms is particularly
damaging.


Define an universal field syntax. Make it trivial to use.


The above argues that "structured logging" in the full extent (new formats,
field semantics) does not make sense. What does make sense, however, is to
define a convention for record formatting (and always admit that it is only a
convention, see "A.") that allows simple parsing into fields (using a
single-pass lexer instead of record-specific, and potentially backtracking,
regexps).

There are two components to this:

The field syntax must be truly universal—i.e. it must be possible to use it
in any existing logging system. In particular, ASCII-only and no embedded
newlines.

The field syntax should be trivial to use—writing it must not be more
difficult than using syslog(), and parsing it must not be more difficult than
indexing a hash value. All we really need is something like:

  • A record contains only name=value pairs
  • A syntax for "name" is $some\_simple\_regexp.
  • Some simple syntax for "value" that allows escaping of non-ASCII/binary data,
    and is not corrupted-escaped by any existing logging mechanism.


Perhaps the easiest write-side interface would be to extend syslog(), or to
provide a similar alternative, e.g.
UFsyslog("foo=%b bar=%B", string_value, binary_value, binary_value_size);


The parser-side interface is even more important; probably the best way to
persuade log writers to use the field syntax is to make this field syntax most
convenient to use in parsers, and to let parser developers pressure authors of
applications that write logs to use this syntax.

However, the caveat from E. still applies: parsing is not the difficult part, and
persuading anyone to choose a syntax only to make parsing easier will be a hard
sell.


Treat unstructured text as the primary format, and optimize its use.


As argued above, any log parser will have to handle arbitrary text. The
system's logging mechanism will similarly have to handle arbitrary text, both
on input (compatibility with syslog()) and output (human-readable output,
compatibility with existing parsers, probably compatibility with existing
remote logging protocols). The only reasonable way to store text is as text
, so it is natural to treat the text as the primary information format.

However, nothing prevents the logging mechanism to be smart about its use of
text:

  • It can store out-of-band information about the records (e.g. a reliable
    timestamp, user and process ID).
  • It can compress the text, either using generic data compression algorithms or
    by using specific strategies (e.g. if a message follows the "universal
    field syntax", replace repeating field-value pairs with a backreference).
  • It can build search indexes over the text: full-text indexes, indexes of the
    out-of-band information, and indexes of structured information for records
    that match the "universal field syntax". Note that the structured indexes
    are necessarily advisory—looking up a specific value in the HOSTNAME
    index will not return "all relevant records".


In this approach, we do not come anywhere close to fulfilling the hope of
"structured logging"—to get meaningful data out of the logs it is still
necessary to know exactly what record we are looking for.

Most of the records ever written will remain essentially "write-only"—programs will write them, no programs will specifically analyze them, a human
may look at them from time to time when there is a problem.

For the records that are not "write-only", however, application (log
writer/parser) developers will have an easier life: they will be able to write
data without worrying about escaping, they will have a library that does the log
parsing for them, and they may be able to take advantage of an index that
allows them to find the relevant records quickly.


... but is it really worth it?


So, where are we? The grand idea of "structured logging" deserves to be
abandoned. In its place, we have a suggested convention that may or may not be
universally adopted, and an idea for a library that may be used by a few
applications.

Is implementing it worth it? That really depends on the number and complexity
of applications that want to regularly parse logs. How many are there? I
would argue that there are very few within Linux distributions: most data that
needs to be parsed is actually stored in specialized databases (lastlog, wtmp
and others); I can only think of logwatch and denyhosts, both of which have
already built the necessary regexps to handle existing logs and are unlikely to
grow much over time, so the benefit the new field formatting convention and
library would be minimal, if any. The logging daemons would perhaps benefit
from some additional metadata (for remote logging, log rotation and so on), but
that metadata can be pretty much invisible by the rest of the system.

Outside of Linux distributions, there is large value in log aggregation
(e.g. splunk)—but, as argued above, "structured logging" is not the answer,
and the field formatting convention is only a minor improvement when compared
to the amount of work necessary to collect information about record semantics.
Also, any log aggregation tool shipped separately from a Linux distribution
will very likely have to handle Windows and other platforms, which restricts
the benefits from any Linux-only log format.

Given all this, is there a good reason to work on the field syntax convention
and libraries? I don't know, but I'm leaning towards "not really."

1 comment:

  1. I have recently looked into JSON logging for our applications at work, partly due to the issue that the unstructured log parsers are too slow, minor changes in log messages can break parsing, and not even the standard, unchanging parts of every log message are yet handled universally.

    It seems to me that it would be quite valuable to have all the standard fields always captured correctly, like time and date, user agent, user name, log level, etc. In one application I have 11 standard fields that get output automatically with every log message.

    I have also recognized something like half a dozen special log messages that I would like to log in a structured manner (for example some automated timing and status indicators).

    The rest I don't care that much, and I am fine to leave the message part be unstructured in those cases.

    So I agree that it would be futile to do fully structured logging of every message. But I still think that there is lots of value in structure when it automatically captures some fields, and in some specific cases where you want to go through the trouble of logging more structure explicitly. I don't think this can be done universally even at this level, but currently I am happy I can do it for the applications I care about.

    ReplyDelete