Articles  
 

This is work in progress! It might be wrong, misleading and even useless at all. It covers the authors thoughts so far and can potentially become totally changed. Use it with care!

On the Nature of Syslog Data

Abstract

This paper describes the "nature" of syslog data. It looks at how syslog data is structured and what are the syntaxes and semantics of the log data. The entities making up the log record are identified and defined. Syntaxes and semantics typically found are also described and defined. The intension of this paper is to provide a theoretical model describing the structure of real-world log data. With such a theoretical model, further work can be done to define a set of well-known log message properties which in turn can be used to build generic log analysis algorithms and tools. The theoretical model created in this paper should also enable the creation of log parsers that will parse individual log messages into a generic format.

What is syslog data?

This paper purely addresses syslog data. Most importantly, syslog data is data that has been transmitted via the syslog protocol. It is NOT SNMP based data and also not data from log files. Obviously, there are tools that allow relaying from other log sources (like SNMP, log files, serial lines and so on) to syslog, so these things may eventually end up to become syslog data. In this sense, this paper also describes these data entities.

Data and Information

Unfortunately, IT terms tend to have different definitions depding on who is asked. So I am providing some definitons of my own, which I will at least use consistently throghout my papers.

Data, in our context, is the representation of facts in a machine-processable form. Data itself are objective facts without any interpretation. If meaningfullness is assigned to data, it becomes information.

Information is data that an individual has assigned meaningfulness too. As such, information is inherantly subjective, which means it's meaning is depending on the subject that assigns it.

Let's look at a real-world sample: My car may run at 100 mph at a given period in time. This is a fact, thus it is data. The information this data conveys can be very different: If I am a police officer and the speed is measured on a public highway, that probably means I am in for some light to heavy trouble (depending on legistlation and actual limits). If I am the driver and I am a speed addict, this may mean to me that the car is performing nicely ... - the same fact (data) obviously leads to a different interpretation. It is the interpretation that transforms facts (data) to actual information.

Please note thate there are multiple layers of this transformation. For example, the core computer entity is the bit and build upon it the byte. Let's look at the byte level for our next example. In the computer's main memory, the following 4 bytes may be present (in this order): 0x54, 0x65, 0x73, 0x74. Depending on the interpretation of these bytes, they can carry very different information, for example:

  • the string "Test"
  • the number 371,876
  • the number 506,276

This list could easily be extended - it all depends on how the data is defined, which means it depends on the meaning assigned to it. Data plus meaning will become information - but obviously only at that level. Let's assume that the byte sequence was character data and thus the information contained in it is "Test". If we now look from a higher level, "Test" again is not information, but only a fact - data. It becomes information if we assign a meaning to it. For example, if that data entity describes the status of e.g. a source code module, "Test" may be the indication that the source has been written and is currently being tested - it has been turned into information at a higher layer. I think we can probably stack layer on layer for a long time.

In essence: what is data and what is information depends on the context. For any given context, "data" means the raw facts (objective) while "information" means the meaning that has been assigned to the facts by an individual (subjective). The "individual", of course, may also be any process, for example an automatted one.

For the purpose of decision-making, data is useless. It must first be processed to become information and then this information can be used for any decision making.

An interesting fact is that the same set of data can yield to different sets of information, depending on which meaning is assigned to the data set. In log analysis, this finding can be very helpful. It even enables us to transform the same data set to multiple information sets, which, at a higher layer, may again be used as diverse sets of data which in turn are processed to another (set of) information. This eventually can yield to better highest-level information than direct transformations of the data set. But this process is beyond the scope of this document.

Log Data

Now that we know what data actually is, we can have a look at log data. Some time ago, in [1], I defined log data to be:

Log data is emitted by devices and applications and typically of free text form. We can assume that log data will be stored in a file (or can be written into one) by collector software. In file format, log data contains of lines terminated by the OS line terminator (in most cases CRLF or LF). Typically, one logged message occupies one line, but there may exit some anomalies where more than a single line represents a single log file (or there are multiple line terminators). Log lines will typically not contain non-printable characters or characters with integer values over 254. Interger values between 128 and 254 should not occur, but are quite common in most places of the world. Each line consits of both textual message string as well as parameter data (e.g. IP addresses, ports, user names and the like). There is no standard for the message contents (neither a RFC nor any "mainstream" industry standard. Each log emittor uses its own, nonstandard format. Even the same emittor, in different versions, may use a different format. It is up to the emittor's decision (or configuration) which data will be emitted. There is a great variety in what emittors' think is worth logging. Log data is often transferred via syslog protocol or plainly written to text files, but there is also a variety of other means it can be transferred or emitted.

This definition still describes what we observe in the real world. Recent developments in syslog like syslog-protocol [2] touch the part that log data can potentially contain any character, including control characters. However, we can (hopefully) safely assume that upcoming implementations of this standard will use a way to escape control characters when persisting them to a file. So I think it is still a usable base for this paper.

We can extract some key points from this definition:

  • log data is emited by configurable devices
  • there is no specific format present in syslog message
  • it can be assumed that a given device will emit message in the same format, at least until it is upgraded or re-configured
  • a single log entry is represented as a character stream

Especially the fact that a log entry is essentially a chracter stream is interesting. If we combine this knowledge with the other parts of the definition, general computer scienece principles and observed behaviour, we can assume that the following is given for an individual log entries character stream:

  • the character stream is made up of individual tokens
  • each token consists of one of more of the characters from the stream

If we identify tokens, we actually look at a higher level at the syslog message. At this level, the characters are the data, while the tokens are information. However, this information still is not what we - the user / analysis script - are actually looking for. So tokens are still raw data from our higher level point of view. In ABNF [3], we can formalize this finding as follows:

 LOGMSG = *TOKEN
 TOKEN = *CHR 
 CHR = %d00..255

If we use this ABNF, we can also say that a log entry can not only be seen as a stream of characters, but we can also say

A log entry is a stream of (log) tokens.

This definition will be helpful in understanding the nature of the syslog message. Unfortunately, it does not provide any insight into how the individual tokens are terminated. This the because there is no simply way to tell the format, that in turn is because:

The token format (tf) is the function of the emitor software (e) , its version (v) and configuration (c).
tf = f(e, v, c)

This essentially means that we can not provide any generic tokenizer - but we can provide one for each known e, v and c. This also implies that we have a finite set of token formats, but due to the large variaty of software and versions and especially the numerous ways to configure a product a very large finite set. This may sound discouraging first, but it is not actually that bad. For any given installation, the size of th token format set is considerably smaller, simply because only  a very limited number of e, v, c combinations should be in use in a single (well-designed) actual installation.

Generic Tokenizing

While we can not write a generic tokenizer without knowledge of the token format, we can specify some generics about token formats themselfs. Observed behaviour is:

  • tokens are usually terminated by terminator-characters (eg. colon, space, quote, slash)
  • tokens contain values and eventually a name for the value (e.g. "IP=127.0.0.1")

Semantics

A token, as described here, is a semantic object. That is, each token describes a specific object, e.g. an IP-address, a byte count, a retry count or whatever else the emitor thinks to be noteworthy. Obviously, again there is a very large set of semantic objects.

However, there are many semantic objects that are used very frequently and/or are very imporant and/or must be present by the nature of the message. To clarify the later, let's look at an example: The notation of a "sender" object must be present by a message received - simply by the fact that a sender is needed to transmit something. Without sender, nothing would be transmitted and without transmission, nothing could be received. Of course, the sender may hide itself from the receiver, so that the receiver does not know who the sender is. Even in this scenario, we would have a sender (we know this by the nature of the message) and we would know that the sender is hidden (we know this by the fact that we do not have information on the sender). So we still have information on the sender - that is the information that we do not know it (it is hidden).

So we have basically two classes of semantic objects:

a) those that are present in many messages (whatever many means) and

b) those that are present in few messages

If we like, we could further sub-class the a) class in those that aa) must be present by message nature ab) are highly likely to be present because a standard defines them ac) are likely to be present because it is observed in most messages .... and so on. I think, however, that this sub-classification does not really help us in tackle the logging problem, so I will leave it out (at least for now). Common to all a) class semantic objects is that they are present in messages from a large variety of emitor software, version and configuration (the e, v, c from above). That is, semantically, these are present in a large precentage of the token format set described above.

Class b) semantic objects represent either uncommon ideas or seldomly-reported events.

In any case, we can assume that it is potentially possible to actually define the set of class a) semantic objects. There have been efforts to create dictionaries of these class a) tokens in the past. One such effort is Marcus Ranum's loggin data map, which I unfortunately was no longer able to find online. Another one is the NetIQ "Webtrends Enhanced Log Format" (WELF) [4]. I am sure there were also other efforts to classify these class a) semantic objects. This prooves the point that it is doable - the set is small enough. As such, I will refer to these semantic objects as "standardizable semantic objects" (SSO). Consequently, I call class b) semantic objects "non-standardizable semantic objects" (NSSO). This is not 100% correct, because there is no sharp border between those two. A NSSO will become a SSO if only it is frequently enough being used. For the same reason, a SSO may be demoted to a NSSO (though this is not really advisable once a logging map has been agreed-upon).

For a given emitor software and its version, we have a finite set set of semantic objects (both SSOs and NSSOs). The semantics are very unlikely to be depending on the actual configuration. So the overall set of semantic objects is:

SO = f(e, v)

This also implies that we can limit the set of semantic objects by limiting the number of emitor softwares and versions. This manifsts itself in real software - it is often easy to find analysis scripts for a specific software (e.g. one that analyses postfix 2.0 logs) while it is hard to find one that analyses all software (and versions) an installation runs.

Token Syntax

In this document, token syntax means the way a specific token is formed. The syntax is at least a function of the emitor software, partly of its version and unlinkely (but observed) of its configuration.

Unfortunately, even SSOs do not have a standardized syntax. Thankfully, we can apply some knowledge from the semantics above. That is, the token often indicates both the information which semantic object it represents (a name) as well as the actual value. An example for this may be a token formed like this "IP=127.0.0.1". For a human reader, it is obvious that the emitor intends to describe a semantic object (it calls "IP"). Due to the broad range of different representations, it is not similar obvious for a computer script.

Another way to indicate a semantic object is by the position of the token in the log data token stream. An example for this is the syslog tag, which becomes the tag simply by the fact that it is present as the 4th token inside a (RFC 3164-formatted[5]) syslog message. Obviously, a token identified by its position in the token stream is somewhat harder to identify as a token that carries a semantic object indicator in itself.

As a first rule, we can say, we have two classes of tokens:

a) those that include an indication of the semantic object they represent

b) those that do not

Class b) tokens need to be assgined their respective semantic meaning by using context information. In almost all cases, this context information is the position in the token stream (at least this is observed behaviour). While other contextual-dependent information is imaginable, we will assume that class b) tokens can always be assigned a sematical object by their position inside the token stream.

This leads us to the following rule:

Tokens can either be assigned to a semantical object either by information contained in the token or by their position.

As such, a token is a set of

  • optional identifying information (token id, tid)
  • optional filler characters (these increase human readability and may also be included in an value (e.g. "BYTE=10,000" to denote a value of tenthousand).
  • a value (token value, tval)

A value must always be present, because this is what the token ultimately intends to provide. In this point of view, we could also describe a token as a set of data and optional meta-data where the meta-data describes the semantics (and provides readability information).

Please note that some messages may contain tokens that look like just identifying information without a value. A sample may be this: "IP=" - nothing else. Actually, the value is not missing here, it is just a nil-value, an indication that the sender (for whatever reason) did not include the value. This is an information in itself and may be very useful for a process processing these messages. As an obscure case, this very same sample may actually be a value in itself, thus being a class b) token described above. In this case the value would actually be "IP=" and its semantics would be assigned by its position inside the token stream.

Exceptions from these rules?

Above, I have said that every token must have a token value. The following is an excerpt from an actual message:

SA msg read: 3 (0%), SA parse: 5 (0%)

I would like to point to a problem that we eventually run into. To show the issue, I would like to interpret this message so that it contains three tokens (other interpretations are possible and may also be more appropriate):

  1. "SA msg read: 3 (0%)"
  2. ", "
  3. "SA parse: 5 (0%)"

The main issue is token number 2, ", ". If we apply the rules above, we end up with a value-only token with value ", ". On first look, a human reader immediately notices that this is just a formatting sequence, intended to make the reading easier. So effectively, these are filler characters. And it is this knowledge, that solves our problem: The theory is actually right, we have a token with value ", " and it is identified by its position in the token stream. The semantical object it is to be assigned to is "filler", which in most cases means it can be discarded. The important point, though, is that we have a semantic object (here named "filler") that we can include in our map of SSOs. So this is no exeption from our rule.

However, if we look at tokens number 1 and 3, way may see exceptions. In both cases, they contain two values - one absolute value and a precentage. This does not map to what we have defined so far. One solution is to re-think about the tokenizing process - eventually we have identified the wrong tokens. Let's concentrate on "SA msg read: 3 (0%)". We could also express this as four tokens:

  1. "SA msg read: 3"
  2. " ("
  3. "0"
  4. "%)"

If we apply this tokenzing, we have one token contaning a token id and a value and three tokens containing only values - two of them being of filler semantic. All of these 3 are identifed by their position. If we take that position, we may also tokenize it as follows:

  1. "SA msg read: "
  2. "3"
  3. " ("
  4. "0"
  5. "%)"

In this case, all tokens are position-identifed and we have three tokens with filler semantics. So it looks like we can always use positional identification for tokens. If that is actually the case, generic parsing can be done less complicated.

Using only Position-Identified Tokens

If we follow the idea of identifying tokens only by their position, each token is just reduced to a value. Each token will either have the "filler" semantic or any other semantic. Tokens of "filler" semantics can be ignored for analysis but can be used to identify if the message is malformed (if we instruct the parser to verify that the read value actually is the value that we expect - more on this later).

As we progress on this route, it may even be more appropriate to say that the log entry is not actually a token stream but a stream of tokens and intermixed "filler" characters. This would lead us to this ABNF:


 LOGMSG = *(*TOKEN *FILLER)
 TOKEN = *CHR 
 FILLER = *CHR
 CHR = %d00..255

While this definition does not look more promising than the initial one, it can greatly simplify the creation of generic parsers: we now have entities (the token) that we need to extract and other entities (the fillers) that we can use as a guideline for the parsing process.

But is it save to use this understanding of the log entry? I think it is, as we said that tf = f(e, v, c), which means that the token stream will remain in constant format once a device is configured. This matches observed behaviour. With our new defintion, we are just (ab)using the identifying part contained in some tokens to guide the parser.

So, for the rest of this paper, we say:

A log entry is a stream of tokens and intermixed filler characters. The filler characters in themself are meaningless but can be used for verification of message format correctness.

There is one special case that we need to look at - that of the nil-value. Above, we had a token "IP=" which had an identfying part of "IP", a filler part of "=" and a nil-value. With the new definition, does this mean we can not detect this nil-value? No, we actually can. It depends on (proper) parser algorithm. If the parser finds "IP=", it should be able to match this with the expected filler of "IP=". As it works purely positional, it then assumes that the next (character) position contains the value - as there is no further position, we still have a nil-value. But what if we have a log entry like "IP= " (with a space character after the equal sign?). This would indeed cause some trouble, but this trouble can be avoided by introducing token value syntaxes.

Token Value Syntaxes

Token value syntaxes describe the syntax that a token value can have. There is a limited set of syntaxes that almost all values can be build on. This concept is very similar to ASN.1 or X.500 directory services - all of them have a limited set of syntaxes that can be applied to (character-representations of) values. For our needs, we shoul be able to identify a limited set of syntaxes at least for the SSOs - these will probably also be sufficient for most of the NSSOs.

At least the following syntaxes are needed:

  • integer
  • IP V4 address
  • IP V6 address
  • hostname (pure)
  • hostname with FQDN
  • timestamp(s) [several formats]
  • "n" characters

This list is not intended to be complete. The "n characters" syntax is special. As we have free form messages, we may need to pull an arbriatary number of characters to satisfy a field. For the same reason, it may make sense to introduce an "regular expression" syntax which allows to grab all characters matching a specific characters. Such a regexp syntax could also work as a safeguard for all syntaxes that we did not explicitely specify (which probably would be most helpful in the case of NSSOs).

It is suggested that a potential SSO map includes the required syntax for each SSO.

Finally: Parsing the Log Entry

Now that we have defined the necessary objects, a generic parser can relatively simply be build. To build it, we need:

  • parsing modules for the defined syntaxes
  • a list of the SSOs and their assigned syntaxes
  • a way to process the filler characters, probably via a regular expression engine
  • a way to assign an identified value to an application-object

A generic parser can use these components to automatically parse ANY message, as long as a template for the specific message/token format (tf) is given.

Of course, this is what specialised parsers do since decades - they are written for a specific log format and parse it. When a new log format appears, a new parser is written (probably copied over from a similar one). The difference to the approach outlined here is that the parser engine is never rewritten - and also does not even need to be recompiled to support a new log format (at least as long as syntaxes and SSOs do not change). All that needs to be done is to provide a new template to the parser, and, of course, identify to the parser which template to use for which message. The later is obviously vital, but I consider it to be out of the scope of this paper. There are already sufficiently enough ways to route messages to specific messages parsers so that this does not impose any notable functional or implementation limit.

The parser does the following: it is passed a parsing template and the actual log entry (message). The parsing template describes the filler characters as well as the positional tokens and their syntaxs. There are numerous ways to represent this template. To build a sample, I am using a simple form where all text is to be taken literal except for text in percent signs, which represents tokens. This may be a hypothetical template:

"IPFROM=%source-addr:ipv4%, IPTO=%dest-addr:ipv4%"

Inside the percent-"sequence", we have %SSO:syntax%. I hope the example is now self-explanatory.

If the following log entry

"IPFROM=172.16.0.1, IPTO=172.16.0.2"

would be parsed by this template, the semantic object "source-addr" would be assigned the value "172.16.0.1" and the semantic object "dest-addr" would be assigned the valule "172.16.0.2". The rest of the message would be discarded. If the message

"IPFORM=172.16.0.1, IPTO=172.16.0.2" (note the typo - ipfORm!)

would be passed to the parser, it would not assign any value (and eventually flag an error to its caller), because the message does not match the parsing template (this also indicates why regular expressions to specify the filler characters can be helpful).

A positive side-effect of position-dependent parsing is that the parser can be single path (except when regular expressions are used, but even then it is primarily forward-oriented). The parser only needs to follow the template until a semantic object is found. While doing this, it must just verify that the provided string matches the template. Once it detects a semantic object, it uses the provided syntax definition to parse the object. Once done, the value can be assigned to the semantic object (or be passed to an upper layer or whatever is appropriate in regard to the calling framework). Then, the parser carries on with template processing from where the syntax module left it. This process continues until either the end of the template or the end of the log message is reached or an error occurs. It is up to the decision of the implementor if premature end of either the message or template is considered an error.

Conclusion

By using well defined semantics and syntaxes plus a template based parser, we can parse any log message and store the result for an important set of properties - the standardizable semantic objects (SSOs), via a generic parser. The value of this not only lies in the minimal effort of parser creation (only templates need to be updated/defined), but also in a standardized view of important objects. These SSOs can be further utilized in higher-level log analysis scripts that now are able to act on a standardized token strem and thus are vendor-, version- and configuration independant.

Even if a total standardization of the semantic objects can not be achieved, it is much easier to address differences from configuration settings from a generic log parser. So we may ultimately be interested in writing a log analysis for e.g. Cisco IOS 11.1. If we assume that user configuration can change the log properties (and their sequence inside the message), we can now use our generic parser plus templates and have it generate an intermediary data layer that now our analysis script can utilize. This will unbundle our analysis script from the actual configuration. Of course, with only very little more effort we could also unbundle it from a specific software version.

I am not sure if we can ever write a generic router or firewall analysis script - vendors offer so different logging that this is very challenging. At least I am not sure if such a generic script would actually be useful. But even without that, by eliminating the software release and configuration settings dependency, we are sharply reducing the number of different things an analysis script needs to look at.

Credits

We would like to thank the following people for their input on this paper or important thoughts they have published somewhere else:

We have tried to include everyone who made a contribution, but someone might accidently not be included. If you feel we have forgotton you on the list, please accept my apologies and let me know.

References

Revision History

2004-03-09 Initial version begun.

Copyright

This document is copyrighted 2003 by Adiscon GmbH and Rainer Gerhards. Anybody is free to distribute it without paying a fee as long as it is distributed unaltered and there is only a reasonable fee charged for it (e.g. a copying fee for a printout handed out). Please note that "unaltered" means as either this web page or a printout of the same on paper. Any other use requires previous written authorization by Adiscon GmbH and Rainer Gerhards.

If you place the document on a web site or otherwise distribute it to a broader audience, I would appreciate if you let me know. This serves two needs: Number one is I am able to notify you when there is an update available (that is no promise!) and number two is I am a creature of curiosity and simply interested in where the paper pops up.

Author's Address

Rainer Gerhards
Adiscon GmbH
rgerhards@adiscon.com
www.adiscon.com

Disclaimer

The information within this paper may change without notice. Use of this information constitutes acceptance for use in an AS IS condition. There are NO warranties with regard to this information. In no event shall the author be liable for any damages whatsoever arising out of or in connection with the use or spread of this information. Any use of this information is at the user's own risk.

MonitorWare
 Home
 The Products
MonitorWare Products
Product Comparison
Which one to Purchase?
Order and Pricing
Upgrade Insurance Info
News Releases
Version History
MonitorWare Tools
 Event Repository
 Download
 Reference library
General Information
Step-by-step guides
 - All
 - Installation and Configuration
 - Services related
 - Actions related
 - Central Monitoring
Common Uses
Syslog configuration
Syslog Log Samples
Security Reference
 Help
Support
Manual
FAQ
 - All
 - General questions
 - Configurations related
 - Monitorware Agent
 - Monitorware Console
Articles
Seminars Online
 - All
 - General
 - MonitorWare Console
 - MonitorWare Agent
 - WinSyslog related
 - EventReporter
 Order & pricing
Order now
Product Comparison
Pricing Information
Upgrade Insurance Info
Local Reseller
 Contact Us
 Search
 
 



Printer Version Send this page to a friend

Copyright © 1988-2005 Adiscon GmbH All rights reserved.
Contact us via Secure Web Response | Privacy Policy
Topic Links: syslog | Free Weblinks Directory