This is work in progress! It might be wrong, misleading and even useless at all. It covers the authors thoughts so far and can potentially become totally changed. Use it with care!

Finding the needle in the haystack...

Log analysis can serve multiple purposes: creating evidence, creating stats and/or detecting something. This something being for example upcoming operational problems or intrusions ... One can also think about a lot more things in log analysis.

Depending on what you intend to achieve, there are different ways to do it. We will focus in this paper on detecting "something". Common to this need is to create (semi-) automatical alerts that will point the human administrator to those potentially "interesting" log entries - and create as few false alarms (aka "false positives") as possible. Obviously, what is interesting largely depends on the the task you are doing ;) Fortunately, for many tasks the way we can look at log data - and thus algorithms to analyse it - is pretty much the same (at least we think this at the current state of this document...). The main goal of our effort is to take some, say, millions of lines of log data and then extract those 20 lines out of it that are really worth looking at. I have decided to call this effort "finding the needle in the haystack" which I suppose is a good graphical description of what we intend to do.

We are actively interested in feedback on anything in this document - as well as algorithms currently being employed (or thought about). We will most probably integrate the full feedback (with credits) into this document, which will be available free of charge to everyone who asks (via web request ;-)). Please note, however, that this work will most probably be used to craft advanced algorithms for use in our MonitorWare line of products. MonitorWare products are moderately priced closed source. You may consider this before providing feedback. Any feedback is highly welcome at rgerhards@adiscon.com.

What is normal and what is not?

What is normal and what not is the key question. Obviously, what is normal largely depends on the individual situation. For example, a high number of changed passwords might be normal for one orginization and not for an other one. In the security world, I have sometimes heard that around 20% of the traffic hitting Microsoft's web servers is malicious. I don't know if that is true or not - but I at least can envision that this is possible. Anyhow, if we assume it is correct, we can see that "normal" operations for the administrator of this web farm has a quite different meaning than for most others...

What is normal is also depending on a time dimension. For example, user logons might be normal during daytime but highly unnormal during night time (if not working in shifts). Think about it - there are a number of things that will come up in this regard...

Obviously, every administrator can define what he thinks to be normal and set the system to ignore this. Obviously this is a lengthy and error-prone process. What we try to acomplish is to find a way to (more or less) automatically identify normal activity. Of course, no such system can be fully automated. But we would like to keep the human involvement as little as possible.

Honestly I have to say I am not sure if this is a little bit too ambitious. But anyhow, we will at least give it a try and see where we arrive...

Log Data

General Structure

Unfortunately, log data comes in different flavours. If we try to generalize things, we end up with a very basic definition:

Log data is emitted by devices and applications and typically of free text form. We can assume that log data will be stored in a file (or can be written into one) by collector software. In file format, log data contains of lines terminated by the OS line terminator (in most cases CRLF or LF). Typically, one logged message occupies one line, but there may exit some anomalies where more than a single line represents a single log file (or there are multiple line terminators). Log lines will typically not contain non-printable characters or characters with integer values over 254. Interger values between 128 and 254 should not occur, but are quite common in most places of the world. Each line consits of both textual message string as well as parameter data (e.g. IP addresses, ports, user names and the like). There is no standard for the message contents (neither a RFC nor any "mainstream" industry standard. Each log emitter uses its own, nonstandard format. Even the same emitter, in different versions, may use a different format. It is up to the emitter's decision (or configuration) which data will be emitted. There is a great variety in what emitters' think is worth logging. Log data is often transferred via syslog protocol or plainly written to text files, but there is also a variety of other means it can be transferred or emitted.

This definition does not sound very promising. In fact, the diverse nature of log data makes up the initial problem when dealing with analysis. Fortunately, preprocessing of log data can at least resolve some of the ambiguity.

Thin vs. Fat Log Entries

There are two fundamentally different logging philosophies - the thin and the fat log. Eric Fitzgerald of Micorosoft once gave a good, up-to-the-point description, so let's listen to what he said on January, 17th 2003 on the loganalysis mailing list [2]:

Analysts like "fat" audits- where one event contains everything one would conceivably want to know about some occurrence on the system. Unfortunately "fat" events have several drawbacks: they require lots of processing to gather, translate, and format the information, which may require really convoluted code paths as well as modifications of standard APIs to carry information back and forth (and the hope that no intervening functions mess up your data). Also, "fat" audits often require the machine to keep additional copies of state information (process name, etc.) lying around, increasing memory usage and overall bloat of the security system. Lastly, "fat" audits sometimes require delay in reporting the original interesting occurrence while waiting for some of the data that will be logged.

On the other hand, developers like "thin" audits- where each audit is only generated at one point in the code, and only contains information available at that point in the code. This is the easiest to develop and maintain. However, it requires that related audits must be able to be correlated with each other, which is why we have data items like "Logon ID" and "Handle ID" in our audits.

In Windows we've typically followed the "thin" model. However, where practical, we've added extra information that might be redundant with other audits but is still useful. As a side effect there are some inconsistencies- some audits such as process termination events now have image path name, for example, but some other audits that might benefit from it still don't.

To read the full original post, see http://lists.shmoo.com/pipermail/loganalysis/2003-January/001774.html.

Thin vs. Fat is obviously an issue with many log entries. In my experience, most vendors tend to produce thin logs for the reasons Eric has outlined above.

A good example of fat logging are web server logs, like defined in the W3C standard. If you look at a web log, you will see a single line for each request and that single line contains all information about the web request. For example, it contains the URL being requested as well as the bytes received and sent while serving the request. This implies that logging must be deferred until processing has finished - how otherwise should the total sent byte count could be know. The advantage obviously is that we have all in one place. The big disavantage is that if someone crafts a malicious URL that succeeds in breaking into the web server, that request will never be seen because the web server will never reach the log stage of its processing. This is a general problem (or better: design decision) with W3C logs, so it is a problem that all web servers I know have, including Apache and Microsoft IIS.

The good news with thin logs is that we can create fat logs out of the thin ones with the help of a good preprocessor and the necessary correlation logic. The bad news, of course, is that this is not easy and automatic... ;)

Preprocessing the Log Data

First of all, we assume the log data is available to the analyzer as a stream of text strings. It does not matter how it becomes a text string and whether or not it is from a database or flat file. Let us assume that our preprocessor has already captured the message and converted it into a text string.

  • By doing so, the preprocessor may accidently split a single log message spreading over multiple lines into multiple messages. The preprocessor should try to avoid this, but not at all costs. If it can't be avoided, we will accept this.
  • The text string will be made up of valid printable characters, but not neccessarily be limited to the ANSI character set. It may be stored in Unicode to facilitate processing of Asian characters.

After this first step, we have a structured stream of log data. Now we can apply further formatting. The next stage of the preprocessor can

  • parse each single stream of text message and extract name/value pairs for known entities like source and destination ip addresses. The idea behind this is to associate each entry with a set of well known, emitter-ignorant properties. The key is that these known properties should be very well defined and available for all messages that can successfully be parsed. On the down side, the preprocessor needs to know each individual message format that it intends to parse. This is not practical for all events - but chances might be good it is for the most prominent emitters (e.g. like major firewalls or major OS events).
  • For the known messages, the parameters will be stripped from the message part - this will later allow us to identify identical types of log entries.

After step 2, we have a well-defined stream of log entries with hopefully many well-defined properties assigned to each entry. We are now able to cluster identical types log entries even when the actual data is different. We can of course sub-analyse those identical log entries based on the associated well-defined data. We have, however, still log entries which were unknown to the parser in step 2 and thus have not be split in the generic and actual message part. As we can not do any better, we need to live with that and further stages should be designed with that in mind - maybe there is a way to extract some more meaning from them.

[Idea: It might be possible to run an additonal step over those log entries and do an textual analysis on what is identical in them and what not. Chances are good we could strip the changing parameters off the static message text and thus identify the message type - but would this help anything at all???]

A Run-Down of the preprocessing Phase

The following chart provides a quick (and idealized) rundown of the preprocessing stages outlined above.

Please note that at the end of the preprocessing stages, we have a set of well-defined log entries which have associated

  • a kind of type
  • some common properties (like date and time of their generation or originating device)
  • some well-defined name/value properties

[rgerhards: it may be smarter to run the first two preproc stages, and *then* run the "thin-to-fat" converter - it will benefit from the name/value pairs... One can also think about dropping log entries *before* the ever run into the preproc - this will safe processing time. But we can do this at any time later, so let's not yet focus on performance.]


Baselines are very imporant when it comes to what is normal. While I have to admit I do not yet know exactly how we can correlate log entries to the respective baseline, and I do not know exactly which granularity we need for the baseline, I am sure that baselines are a key to identify "interesting" events. Only a baseline can teach me (my algorithm) what is normal (at a given time) and what not.

So far, I strongly think that we need multiple baselines, e.g. on an hourly, (week)daily and monthly (really?) basis. All of these baselines can than be used to detecte spikes of traffic that are not common.

... more to be added ...

"Missing" Log Entries

Many currently existing approaches look into the log data they receive and try to find out interesting events. However, absence of events is also very interesting. So while digging through the mass of our log data, we must also try to dig out what is normally there (baseline!) but not present this time.

... more to be added ....


section to be written

I am stil thinking that honeypots - used intelligently - can play a major role in detecting attacks. Needs to be elaborated more, for now see www.honeynet.org for some background.

Known Attack Signatures

section to be written

Known attack signatures allow to positively identify an intrusion attempt. It can be the adminstrators decision to be alarmed or not. All in all, I think known attack signatures not as vitally important as the rest of the topics discussed here (when it comes to detecting interesting stuff). The reason is a) when the rest of the algorithms work fine, they will find those attacks and b) if the attacks are known, the (security-aware) admin has probably countermeasures in place (I know this is a weak point ;)).

Removing Noise

There is theory that those events that normally happen frequently, are noise. Normally and frequently are the key words (and thus in bold ;)).

Normally is a time dimension. It means that during normal, typical operation, these events continously happen. This may include events that happen every 5 minutes, but it may also mean events that happen only once a month, but typically always e.g. on the 1st of the month. An example for a normal, but not really periodic activity might be an increased number of failed logons during the morning time when people come into the office and are not really awake.

Frequently is a volume dimension. It means how often a specific event happens within a given period. To be infrequent, an event must happen much less often than most of the other events. For example, an event is frequent if it just happened twice per hour when other events have typically happened 60 times per hour.

As a general consideration, we can assume that on typical days we have a number of non-noise events that is very low and can be easily handled within the time that the admin typically allocates to such. For any given administrator, I would assume this is less than 20 events per day. Please note that I am specifically talking about a single Administrator - I assume that larger enterprises have more administrator and the workload should be distributed among them. I am also assuming that the administrator has many other tasks to perform. This number is definitely not meant for e.g. an incident response team.

The key question is now "how can we remove the noise so that the administrator will only receive those valid positives?". Good question ;-).

Marcus J. Ranum introduced the approach of "Artificial Ignorance" [1]. In short, this means removing all those events that happen too frequently to be really useful. Marcus, please correct me if I did not summarize it right ;)

Basic ideas:

  • drop known noise from "noise list" (admin created), so we don't need to process them any further
  • group messages (by type and name/value where it makes sense)
  • messages that happen very frequently are noise (be sure to apply baseline! - admin may set exclusions)
  • only few infrequent message will be brought out as to be considered
  • present them to admin
  • let admin decide which messages (by example) are noise, too --> create "noise list"


Noise - those things in the log that we are not interested in, because they happen frequently and/or are otherwise normal during typical operation.

Log Emitter - any device or application that emits log data in any way whatsoever (e.g. sending syslog messages or storing them to a file in the local file system).

Things to be done...

This section is more or less a reminder for the author ... but if somebody has related stuff and is willing to share - well, I don't intend to reinvent everthing ;) Just give me a hand at rgerhards@adiscon.com.

There are many things I can think of as being to be done. These are the most concrete ones:

  • build a resource with information for correlating Windows event log events. I mean, how to create "fat" entries out of the "thin" ones.
  • do the same at least for the most prominent PIX messages


We would like to thank the following people for their input on this paper or important thoughts they have published somewhere else:

  • Eric Fitzgerald -
  • Marcus J. Ranum -
  • Tina Bird of loganalysis.org for her wonderful moderation of the loganalysis mailing list and great help.

We have tried to include everyone who made a contribution, but someone might accidently not be included. If you feel we have forgotton you on the list, please accept my apologies and let me know.


Revision History

2003-02-28Updated with new thoughts - too many to list specifically (still under initial construction)
2003-02-27Initial version begun.


This document is copyrighted 2003 by Adiscon GmbH and Rainer Gerhards. Anybody is free to distribute it without paying a fee as long as it is distributed unaltered and there is only a reasonable fee charged for it (e.g. a copying fee for a printout handed out). Please note that "unaltered" means as either this web page or a printout of the same on paper. Any other use requires previous written authorization by Adiscon GmbH and Rainer Gerhards.

If you place the document on a web site or otherwise distribute it to a broader audience, I would appreciate if you let me know. This serves two needs: Number one is I am able to notify you when there is an update available (that is no promise!) and number two is I am a creature of curiosity and simply interested in where the paper pops up.

Author's Address

Rainer Gerhards
Adiscon GmbH


The information within this paper may change without notice. Use of this information constitutes acceptance for use in an AS IS condition. There are NO warranties with regard to this information. In no event shall the author be liable for any damages whatsoever arising out of or in connection with the use or spread of this information. Any use of this information is at the user's own risk.

 The Products
MonitorWare Products
Product Comparison
Which one to Purchase?
Order and Pricing
Upgrade Insurance Info
News Releases
Version History
MonitorWare Tools
 Event Repository
 Reference library
General Information
Step-by-step guides
 - All
 - Installation and Configuration
 - Services related
 - Actions related
 - Central Monitoring
Common Uses
Syslog configuration
Syslog Log Samples
Security Reference
 - All
 - General questions
 - Configurations related
 - Monitorware Agent
 - Monitorware Console
Seminars Online
 - All
 - General
 - MonitorWare Console
 - MonitorWare Agent
 - WinSyslog related
 - EventReporter
 Order & pricing
Order now
Product Comparison
Pricing Information
Upgrade Insurance Info
Local Reseller
 Contact Us
 Data privacy policy

Printer Version Send this page to a friend

Copyright © 1988-2005 Adiscon GmbH All rights reserved.
Contact us via Secure Web Response | Privacy Policy
Topic Links: syslog | Free Weblinks Directory