|
This is work in progress! It might be wrong, misleading and
even useless at all. It covers the authors thoughts so far and can potentially
become totally changed. Use it with care!
On the Nature of Syslog Data
Abstract
This paper describes the "nature" of syslog data. It looks at how syslog data is
structured and what are the syntaxes and semantics of the log data. The
entities making up the log record are identified and defined. Syntaxes and
semantics typically found are also described and defined. The intension of this
paper is to provide a theoretical model describing the structure of real-world
log data. With such a theoretical model, further work can be done to define a
set of well-known log message properties which in turn can be used to build
generic log analysis algorithms and tools. The theoretical model created in
this paper should also enable the creation of log parsers that will parse
individual log messages into a generic format.
What is syslog data?
This paper purely addresses syslog data. Most importantly, syslog data is data
that has been transmitted via the syslog protocol. It is NOT SNMP based data
and also not data from log files. Obviously, there are tools that allow
relaying from other log sources (like SNMP, log files, serial lines and so on)
to syslog, so these things may eventually end up to become syslog data. In this
sense, this paper also describes these data entities.
Data and Information
Unfortunately, IT terms tend to have different definitions depding on who is
asked. So I am providing some definitons of my own, which I will at least use
consistently throghout my papers.
Data, in our context, is the representation of
facts in a machine-processable form. Data itself are objective facts without
any interpretation. If meaningfullness is assigned to data, it becomes information.
Information is data that an individual has
assigned meaningfulness too. As such, information is inherantly subjective,
which means it's meaning is depending on the subject that assigns it.
Let's look at a real-world sample: My car may run at 100 mph at a given
period in time. This is a fact, thus it is data. The information this data
conveys can be very different: If I am a police officer and the speed is
measured on a public highway, that probably means I am in for some light to
heavy trouble (depending on legistlation and actual limits). If I am the driver
and I am a speed addict, this may mean to me that the car is performing nicely
... - the same fact (data) obviously leads to a different interpretation. It is
the interpretation that transforms facts (data) to actual information.
Please note thate there are multiple layers of this transformation. For example,
the core computer entity is the bit and build upon it the byte. Let's look
at the byte level for our next example. In the computer's main memory, the
following 4 bytes may be present (in this order): 0x54, 0x65, 0x73, 0x74.
Depending on the interpretation of these bytes, they can carry very different
information, for example:
-
the string "Test"
-
the number 371,876
-
the number 506,276
This list could easily be extended - it all depends on how the data is defined,
which means it depends on the meaning assigned to it. Data plus meaning will
become information - but obviously only at that level. Let's assume that the
byte sequence was character data and thus the information contained in it is
"Test". If we now look from a higher level, "Test" again is not information,
but only a fact - data. It becomes information if we assign a meaning to it.
For example, if that data entity describes the status of e.g. a source code
module, "Test" may be the indication that the source has been written and is
currently being tested - it has been turned into information at a higher layer.
I think we can probably stack layer on layer for a long time.
In essence: what is data and what is information
depends on the context. For any given context, "data" means the raw facts
(objective) while "information" means the meaning that has been assigned to the
facts by an individual (subjective). The "individual", of course, may also be
any process, for example an automatted one.
For the purpose of decision-making, data is useless. It must first be processed
to become information and then this information can be used for any decision
making.
An interesting fact is that the same set of data can yield to different sets of
information, depending on which meaning is assigned to the data set. In
log analysis, this finding can be very helpful. It even enables us to transform
the same data set to multiple information sets, which, at a higher layer, may
again be used as diverse sets of data which in turn are processed to another
(set of) information. This eventually can yield to better highest-level
information than direct transformations of the data set. But this process is
beyond the scope of this document.
Log Data
Now that we know what data actually is, we can have a look at log data. Some
time ago, in [1], I defined log data to be:
Log data is emitted by devices and applications and typically of free
text form. We can assume that log data will be stored in a file (or can be
written into one) by collector software. In file format, log data contains of
lines terminated by the OS line terminator (in most cases CRLF or LF).
Typically, one logged message occupies one line, but there may exit some
anomalies where more than a single line represents a single log file (or there
are multiple line terminators). Log lines will typically not contain
non-printable characters or characters with integer values over 254. Interger
values between 128 and 254 should not occur, but are quite common in most
places of the world. Each line consits of both textual message string as well
as parameter data (e.g. IP addresses, ports, user names and the like). There is
no standard for the message contents (neither a RFC nor any "mainstream"
industry standard. Each log emittor uses its own, nonstandard format. Even the
same emittor, in different versions, may use a different format. It is up to
the emittor's decision (or configuration) which data will be emitted. There is
a great variety in what emittors' think is worth logging. Log data is often
transferred via syslog protocol or plainly written to text files, but there is
also a variety of other means it can be transferred or emitted.
This definition still describes what we observe in the real world. Recent
developments in syslog like syslog-protocol [2] touch the part that log data
can potentially contain any character, including control characters. However,
we can (hopefully) safely assume that upcoming implementations of this standard
will use a way to escape control characters when persisting them to a file. So
I think it is still a usable base for this paper.
We can extract some key points from this definition:
-
log data is emited by configurable devices
-
there is no specific format present in syslog message
-
it can be assumed that a given device will emit message in the same format, at
least until it is upgraded or re-configured
-
a single log entry is represented as a character stream
Especially the fact that a log entry is essentially a chracter stream is
interesting. If we combine this knowledge with the other parts of the
definition, general computer scienece principles and observed behaviour, we can
assume that the following is given for an individual log entries character
stream:
-
the character stream is made up of individual tokens
-
each token consists of one of more of the characters from the stream
If we identify tokens, we actually look at a higher level at the syslog message.
At this level, the characters are the data, while the tokens are information.
However, this information still is not what we - the user / analysis script -
are actually looking for. So tokens are still raw data from our higher level
point of view. In ABNF [3], we can formalize this finding as follows:
LOGMSG = *TOKEN
TOKEN = *CHR
CHR = %d00..255
If we use this ABNF, we can also say that a log entry can not only be seen as a
stream of characters, but we can also say
A log entry is a stream of (log) tokens.
This definition will be helpful in understanding the nature of the syslog
message. Unfortunately, it does not provide any insight into how the individual
tokens are terminated. This the because there is no simply way to tell the
format, that in turn is because:
The token format (tf) is the function of the emitor software (e) ,
its version (v) and configuration (c).
tf = f(e, v, c)
This essentially means that we can not provide any generic tokenizer - but we
can provide one for each known e, v and c. This also implies that we have a
finite set of token formats, but due to the large variaty of software and
versions and especially the numerous ways to configure a product a very large
finite set. This may sound discouraging first, but it is not actually that bad.
For any given installation, the size of th token format set is considerably
smaller, simply because only a very limited number of e, v, c
combinations should be in use in a single (well-designed) actual installation.
Generic Tokenizing
While we can not write a generic tokenizer without knowledge of the token
format, we can specify some generics about token formats themselfs. Observed
behaviour is:
-
tokens are usually terminated by terminator-characters (eg. colon, space,
quote, slash)
-
tokens contain values and eventually a name for the value (e.g. "IP=127.0.0.1")
Semantics
A token, as described here, is a semantic object. That is, each
token describes a specific object, e.g. an IP-address, a byte count, a retry
count or whatever else the emitor thinks to be noteworthy. Obviously, again
there is a very large set of semantic objects.
However, there are many semantic objects that are used very frequently and/or
are very imporant and/or must be present by the nature of the message. To
clarify the later, let's look at an example: The notation of a "sender" object
must be present by a message received - simply by the fact that a
sender is needed to transmit something. Without sender, nothing would be
transmitted and without transmission, nothing could be received. Of course, the
sender may hide itself from the receiver, so that the receiver does not know
who the sender is. Even in this scenario, we would have a sender (we know this
by the nature of the message) and we would know that the sender is hidden (we
know this by the fact that we do not have information on the sender). So we
still have information on the sender - that is the information that we do not
know it (it is hidden).
So we have basically two classes of semantic objects:
a) those that are present in many messages (whatever many means) and
b) those that are present in few messages
If we like, we could further sub-class the a) class in those that aa) must be
present by message nature ab) are highly likely to be present because a
standard defines them ac) are likely to be present because it is observed in
most messages .... and so on. I think, however, that this sub-classification
does not really help us in tackle the logging problem, so I will leave it out
(at least for now). Common to all a) class semantic objects is that they are
present in messages from a large variety of emitor software, version and
configuration (the e, v, c from above). That is, semantically,
these are present in a large precentage of the token format set described
above.
Class b) semantic objects represent either uncommon ideas or seldomly-reported
events.
In any case, we can assume that it is potentially possible to actually define
the set of class a) semantic objects. There have been efforts to create
dictionaries of these class a) tokens in the past. One such effort is Marcus
Ranum's loggin data map, which I unfortunately was no longer able to find
online. Another one is the NetIQ "Webtrends Enhanced Log Format" (WELF) [4]. I
am sure there were also other efforts to classify these class a) semantic
objects. This prooves the point that it is doable - the set is small enough. As
such, I will refer to these semantic objects as "standardizable semantic
objects" (SSO). Consequently, I call class b) semantic objects "non-standardizable
semantic objects" (NSSO). This is not 100% correct, because there
is no sharp border between those two. A NSSO will become a SSO if only it is
frequently enough being used. For the same reason, a SSO may be demoted to a
NSSO (though this is not really advisable once a logging map has been
agreed-upon).
For a given emitor software and its version, we have a finite set set of
semantic objects (both SSOs and NSSOs). The semantics are very unlikely to be
depending on the actual configuration. So the overall set of semantic objects
is:
SO = f(e, v)
This also implies that we can limit the set of semantic objects by
limiting the number of emitor softwares and versions. This manifsts itself in
real software - it is often easy to find analysis scripts for a specific
software (e.g. one that analyses postfix 2.0 logs) while it is hard to find one
that analyses all software (and versions) an installation runs.
Token Syntax
In this document, token syntax means the way a specific token
is formed. The syntax is at least a function of the emitor software, partly of
its version and unlinkely (but observed) of its configuration.
Unfortunately, even SSOs do not have a standardized syntax.
Thankfully, we can apply some knowledge from the semantics above. That is, the
token often indicates both the information which semantic object it represents
(a name) as well as the actual value. An example for this may be a token formed
like this "IP=127.0.0.1". For a human reader, it is obvious that the emitor
intends to describe a semantic object (it calls "IP"). Due to the broad range
of different representations, it is not similar obvious for a computer script.
Another way to indicate a semantic object is by the position of
the token in the log data token stream. An example for this is the syslog tag,
which becomes the tag simply by the fact that it is present as the 4th token
inside a (RFC 3164-formatted[5]) syslog message. Obviously, a token identified
by its position in the token stream is somewhat harder to identify as a token
that carries a semantic object indicator in itself.
As a first rule, we can say, we have two classes of tokens:
a) those that include an indication of the semantic object they represent
b) those that do not
Class b) tokens need to be assgined their respective semantic meaning by using
context information. In almost all cases, this context information is the
position in the token stream (at least this is observed behaviour). While other
contextual-dependent information is imaginable, we will assume that class b)
tokens can always be assigned a sematical object by their position inside the
token stream.
This leads us to the following rule:
Tokens can either be assigned to a semantical object either by information
contained in the token or by their position.
As such, a token is a set of
-
optional identifying information (token id, tid)
-
optional filler characters (these increase human readability and may also be
included in an value (e.g. "BYTE=10,000" to denote a value of tenthousand).
-
a value (token value, tval)
A value must always be present, because this is what the token ultimately
intends to provide. In this point of view, we could also describe a token as a
set of data and optional meta-data where the meta-data describes the semantics
(and provides readability information).
Please note that some messages may contain tokens that look like just
identifying information without a value. A sample may be this: "IP=" - nothing
else. Actually, the value is not missing here, it is just a nil-value, an
indication that the sender (for whatever reason) did not include the value.
This is an information in itself and may be very useful for a process
processing these messages. As an obscure case, this very same sample may
actually be a value in itself, thus being a class b) token described above. In
this case the value would actually be "IP=" and its semantics would be assigned
by its position inside the token stream.
Exceptions from these rules?
Above, I have said that every token must have a token value.
The following is an excerpt from an actual message:
SA msg read: 3 (0%), SA parse: 5 (0%)
I would like to point to a problem that we eventually run into. To show the
issue, I would like to interpret this message so that it contains three tokens
(other interpretations are possible and may also be more appropriate):
-
"SA msg read: 3 (0%)"
-
", "
-
"SA parse: 5 (0%)"
The main issue is token number 2, ", ". If we apply the rules above, we end up
with a value-only token with value ", ". On first look, a human reader
immediately notices that this is just a formatting sequence, intended to make
the reading easier. So effectively, these are filler characters. And it is this
knowledge, that solves our problem: The theory is actually right, we have a
token with value ", " and it is identified by its position in the token stream.
The semantical object it is to be assigned to is "filler",
which in most cases means it can be discarded. The important point, though, is
that we have a semantic object (here named "filler") that we can include in our
map of SSOs. So this is no exeption from our rule.
However, if we look at tokens number 1 and 3, way may see exceptions. In both
cases, they contain two values - one absolute value and a
precentage. This does not map to what we have defined so far. One solution is
to re-think about the tokenizing process - eventually we have identified the
wrong tokens. Let's concentrate on "SA msg read: 3 (0%)". We could also
express this as four tokens:
-
"SA msg read: 3"
-
" ("
-
"0"
-
"%)"
If we apply this tokenzing, we have one token contaning a token id and a
value and three tokens containing only values - two of them being of filler
semantic. All of these 3 are identifed by their position. If we take that
position, we may also tokenize it as follows:
-
"SA msg read: "
-
"3"
-
" ("
-
"0"
-
"%)"
In this case, all tokens are position-identifed and we have three tokens with
filler semantics. So it looks like we can always use positional identification
for tokens. If that is actually the case, generic parsing can be done less
complicated.
Using only Position-Identified Tokens
If we follow the idea of identifying tokens only by their position, each token
is just reduced to a value. Each token will either have the "filler" semantic
or any other semantic. Tokens of "filler" semantics can be ignored for analysis
but can be used to identify if the message is malformed (if we instruct the
parser to verify that the read value actually is the value
that we expect - more on this later).
As we progress on this route, it may even be more appropriate to say that the
log entry is not actually a token stream but a stream of tokens and intermixed
"filler" characters. This would lead us to this ABNF:
LOGMSG = *(*TOKEN *FILLER)
TOKEN = *CHR
FILLER = *CHR
CHR = %d00..255
While this definition does not look more promising than the initial one, it can
greatly simplify the creation of generic parsers: we now have entities (the
token) that we need to extract and other entities (the fillers) that we can use
as a guideline for the parsing process.
But is it save to use this understanding of the log entry? I think it is, as we
said that tf = f(e, v, c), which means that the token stream will remain in
constant format once a device is configured. This matches observed behaviour.
With our new defintion, we are just (ab)using the identifying part contained in
some tokens to guide the parser.
So, for the rest of this paper, we say:
A log entry is a stream of tokens and intermixed filler characters. The
filler characters in themself are meaningless but can be used for verification
of message format correctness.
There is one special case that we need to look at - that of the nil-value.
Above, we had a token "IP=" which had an identfying part of "IP", a filler part
of "=" and a nil-value. With the new definition, does this mean we can not
detect this nil-value? No, we actually can. It depends on (proper) parser
algorithm. If the parser finds "IP=", it should be able to match this with the
expected filler of "IP=". As it works purely positional, it then assumes that
the next (character) position contains the value - as there is no further
position, we still have a nil-value. But what if we have a log entry like "IP=
" (with a space character after the equal sign?). This would indeed cause some
trouble, but this trouble can be avoided by introducing token value syntaxes.
Token Value Syntaxes
Token value syntaxes describe the syntax that a token value can have. There is a
limited set of syntaxes that almost all values can be build on. This concept is
very similar to ASN.1 or X.500 directory services - all of them have a limited
set of syntaxes that can be applied to (character-representations of) values.
For our needs, we shoul be able to identify a limited set of syntaxes at least
for the SSOs - these will probably also be sufficient for most of the NSSOs.
At least the following syntaxes are needed:
-
integer
-
IP V4 address
-
IP V6 address
-
hostname (pure)
-
hostname with FQDN
-
timestamp(s) [several formats]
-
"n" characters
This list is not intended to be complete. The "n characters" syntax is special.
As we have free form messages, we may need to pull an arbriatary number of
characters to satisfy a field. For the same reason, it may make sense to
introduce an "regular expression" syntax which allows to grab all characters
matching a specific characters. Such a regexp syntax could also work as a
safeguard for all syntaxes that we did not explicitely specify (which probably
would be most helpful in the case of NSSOs).
It is suggested that a potential SSO map includes the required syntax for
each SSO.
Finally: Parsing the Log Entry
Now that we have defined the necessary objects, a generic parser can relatively
simply be build. To build it, we need:
-
parsing modules for the defined syntaxes
-
a list of the SSOs and their assigned syntaxes
-
a way to process the filler characters, probably via a regular expression
engine
-
a way to assign an identified value to an application-object
A generic parser can use these components to automatically parse ANY message, as
long as a template for the specific message/token format (tf) is given.
Of course, this is what specialised parsers do since decades - they are written
for a specific log format and parse it. When a new log format appears, a new
parser is written (probably copied over from a similar one). The difference to
the approach outlined here is that the parser engine is never rewritten
- and also does not even need to be recompiled to support a new log format (at
least as long as syntaxes and SSOs do not change). All that needs to be done is
to provide a new template to the parser, and, of course, identify to the parser
which template to use for which message. The later is obviously vital, but I
consider it to be out of the scope of this paper. There are already
sufficiently enough ways to route messages to specific messages parsers so that
this does not impose any notable functional or implementation limit.
The parser does the following: it is passed a parsing template and
the actual log entry (message). The parsing template describes
the filler characters as well as the positional tokens and their syntaxs. There
are numerous ways to represent this template. To build a sample, I am using a
simple form where all text is to be taken literal except for text in percent
signs, which represents tokens. This may be a hypothetical template:
"IPFROM=%source-addr:ipv4%, IPTO=%dest-addr:ipv4%"
Inside the percent-"sequence", we have %SSO:syntax%. I hope the example is now
self-explanatory.
If the following log entry
"IPFROM=172.16.0.1, IPTO=172.16.0.2"
would be parsed by this template, the semantic object "source-addr" would be
assigned the value "172.16.0.1" and the semantic object "dest-addr" would be
assigned the valule "172.16.0.2". The rest of the message would be discarded.
If the message
"IPFORM=172.16.0.1, IPTO=172.16.0.2" (note the typo - ipfORm!)
would be passed to the parser, it would not assign any value (and eventually
flag an error to its caller), because the message does not match the parsing
template (this also indicates why regular expressions to specify the filler
characters can be helpful).
A positive side-effect of position-dependent parsing is that the parser can be
single path (except when regular expressions are used, but even then it is
primarily forward-oriented). The parser only needs to follow the template until
a semantic object is found. While doing this, it must just verify that the
provided string matches the template. Once it detects a semantic object, it
uses the provided syntax definition to parse the object. Once done, the value
can be assigned to the semantic object (or be passed to an upper layer or
whatever is appropriate in regard to the calling framework). Then, the parser
carries on with template processing from where the syntax module left it. This
process continues until either the end of the template or the end of the
log message is reached or an error occurs. It is up to the decision of the
implementor if premature end of either the message or template is considered an
error.
Conclusion
By using well defined semantics and syntaxes plus a template based parser, we
can parse any log message and store the result for an important set of
properties - the standardizable semantic objects (SSOs), via a
generic parser. The value of this not only lies in the minimal effort of parser
creation (only templates need to be updated/defined), but also in a
standardized view of important objects. These SSOs can be further utilized in
higher-level log analysis scripts that now are able to act on a standardized
token strem and thus are vendor-, version- and configuration independant.
Even if a total standardization of the semantic objects can not be achieved, it
is much easier to address differences from configuration settings from a
generic log parser. So we may ultimately be interested in writing a log
analysis for e.g. Cisco IOS 11.1. If we assume that user configuration can
change the log properties (and their sequence inside the message), we can now
use our generic parser plus templates and have it generate an intermediary data
layer that now our analysis script can utilize. This will unbundle our analysis
script from the actual configuration. Of course, with only very little more
effort we could also unbundle it from a specific software version.
I am not sure if we can ever write a generic router or firewall analysis script
- vendors offer so different logging that this is very challenging. At least I
am not sure if such a generic script would actually be useful. But even without
that, by eliminating the software release and configuration settings
dependency, we are sharply reducing the number of different things an analysis
script needs to look at.
Credits
We would like to thank the following people for their input on this paper or
important thoughts they have published somewhere else:
We have tried to include everyone who made a contribution, but someone might
accidently not be included. If you feel we have forgotton you on the list,
please accept my apologies and let me know.
References
Revision History
| 2004-03-09 |
Initial version begun. |
Copyright
This document is copyrighted © 2003 by Adiscon GmbH and Rainer Gerhards. Anybody
is free to distribute it without paying a fee as long as it is distributed
unaltered and there is only a reasonable fee charged for it (e.g. a copying fee
for a printout handed out). Please note that "unaltered" means as either this
web page or a printout of the same on paper. Any other use requires previous
written authorization by Adiscon GmbH and Rainer Gerhards.
If you place the document on a web site or otherwise distribute it to a broader
audience, I would appreciate if you let me know. This serves two needs: Number
one is I am able to notify you when there is an update available (that is no
promise!) and number two is I am a creature of curiosity and simply interested
in where the paper pops up.
Author's Address
Rainer Gerhards
Adiscon GmbH
rgerhards@adiscon.com
www.adiscon.com
Disclaimer
The information within this paper may change without notice. Use of this
information constitutes acceptance for use in an AS IS condition. There are NO
warranties with regard to this information. In no event shall the author be
liable for any damages whatsoever arising out of or in connection with the use
or spread of this information. Any use of this information is at the user's own
risk.
|