This is work in progress! It might be wrong, misleading and even useless at all. It covers the
authors thoughts so far and can potentially become totally changed. Use it with care!
Finding the needle in the haystack...
Log analysis can serve multiple purposes: creating evidence, creating stats
and/or detecting something. This something being for example upcoming operational
problems or intrusions ... One can also think about a lot more things in log
Depending on what you intend to achieve, there are
different ways to do it. We will focus in this paper on detecting "something".
Common to this need is to create (semi-) automatical alerts that will point the
human administrator to those potentially "interesting" log entries - and create
as few false alarms (aka "false positives") as possible. Obviously,
what is interesting largely depends on the the task you are doing ;) Fortunately, for many tasks
the way we can look at log data - and thus algorithms to analyse it
- is pretty much the same (at least we think this
at the current state of this document...). The main goal
of our effort is to take some, say, millions
of lines of log data and then extract those 20 lines out of it that
are really worth looking at. I have decided to call this effort "finding the needle in the haystack"
which I suppose is a good graphical description of what we intend to do.
We are actively interested in feedback on anything in
this document - as well as algorithms currently being employed (or thought
about). We will most probably integrate the full feedback (with credits) into
this document, which will be available free of charge to everyone who asks (via
web request ;-)). Please note, however, that this work will most probably be
used to craft advanced algorithms for use in our MonitorWare
line of products. MonitorWare
products are moderately priced closed source. You may consider this before
providing feedback. Any feedback is highly welcome at firstname.lastname@example.org.
What is normal and what is not?
What is normal and what not is the key question.
Obviously, what is normal largely depends on the individual situation. For
example, a high number of changed passwords might be normal for one orginization
and not for an other one. In the security world, I have sometimes heard that
around 20% of the traffic hitting Microsoft's web servers is malicious. I don't
know if that is true or not - but I at least can envision that this is possible.
Anyhow, if we assume it is correct, we can see that "normal" operations for the
administrator of this web farm has a quite different meaning than for most others...
What is normal is also depending on a time dimension.
For example, user logons might be normal during daytime but highly unnormal
during night time (if not working in shifts). Think about it - there are a
number of things that will come up in this regard...
Obviously, every administrator can define what he thinks
to be normal and set the system to ignore this. Obviously this is a lengthy and
error-prone process. What we try to acomplish is to find a way to (more or less)
automatically identify normal activity. Of course, no such system can be fully
automated. But we would like to keep the human involvement as little as possible.
Honestly I have to say I am not sure if this is a little
bit too ambitious. But anyhow, we will at least give it a try and see where we arrive...
Unfortunately, log data comes in different
flavours. If we try to generalize things, we end up with a very basic
Log data is emitted by devices and
applications and typically of free text form. We can assume that log data will
be stored in a file (or can be written into one) by collector software. In
file format, log data contains of lines terminated by the OS line terminator (in
most cases CRLF or LF). Typically, one logged message occupies one line, but
there may exit some anomalies where more than a single line represents a single log file
(or there are multiple line terminators). Log lines will typically not contain non-printable characters or
characters with integer values over 254. Interger values between 128 and 254 should not
occur, but are quite common in most places of the world. Each line consits
of both textual message string as well as parameter data (e.g. IP addresses, ports,
user names and the like). There is no standard for
the message contents (neither a RFC nor any "mainstream" industry
standard. Each log emitter uses its own, nonstandard format. Even the same emitter, in different
versions, may use a different format. It is up to the emitter's decision (or
configuration) which data will be emitted. There is a great variety in what emitters'
think is worth logging. Log data is often transferred via syslog protocol
or plainly written to text files, but there is also a variety
of other means it can be transferred or emitted.
This definition does not sound very promising. In fact,
the diverse nature of log data makes up the initial problem when dealing with
analysis. Fortunately, preprocessing of log data can at least resolve some of
Thin vs. Fat Log Entries
There are two fundamentally different logging philosophies - the thin and the
fat log. Eric Fitzgerald of Micorosoft once gave a good, up-to-the-point
description, so let's listen to what he said on January, 17th 2003 on the
loganalysis mailing list :
Analysts like "fat" audits- where one event contains everything one would conceivably want to know about some occurrence on the system. Unfortunately "fat" events have several drawbacks: they require lots of processing to gather, translate, and format the information, which may require really convoluted code paths as well as modifications of standard APIs to carry information back and forth (and the hope that no intervening functions mess up your data). Also, "fat" audits often require the machine to keep additional copies of state information (process name, etc.) lying around, increasing memory usage and overall bloat of the security system. Lastly, "fat" audits sometimes require delay in reporting the original interesting occurrence while waiting for some of the data that will be logged.
On the other hand, developers like "thin" audits- where each audit is only generated at one point in the code, and only contains information available at that point in the code. This is the easiest to develop and maintain. However, it requires that related audits must be able to be correlated with each other, which is why we have data items like "Logon ID" and "Handle ID" in our audits.
In Windows we've typically followed the "thin" model. However, where practical, we've added extra information that might be redundant with other audits but is still useful. As a side effect there are some
inconsistencies- some audits such as process termination events now have image path name, for example, but some other audits that might benefit from it still don't.
To read the full original post, see
Thin vs. Fat is obviously an issue with many log
entries. In my experience, most vendors tend to produce thin logs for the
reasons Eric has outlined above.
A good example of fat logging are web server logs,
like defined in the W3C standard. If you look at a web log, you will see a
single line for each request and that single line contains all
information about the web request. For example, it contains the URL being
requested as well as the bytes received and sent while serving the request. This
implies that logging must be deferred until processing has finished - how
otherwise should the total sent byte count could be know. The advantage
obviously is that we have all in one place. The big disavantage is that if
someone crafts a malicious URL that succeeds in breaking into the web server,
that request will never be seen because the web server will never reach the log
stage of its processing. This is a general problem (or better: design
decision) with W3C logs, so it is a problem that all web servers I know
have, including Apache and Microsoft IIS.
The good news with thin logs is that we can create fat logs out of the thin
ones with the help of a good preprocessor and the necessary correlation logic.
The bad news, of course, is that this is not easy and automatic... ;)
Preprocessing the Log Data
First of all, we assume the log data is available to the
analyzer as a stream of text strings. It does not matter how it becomes a text
string and whether or not it is from a database or flat file. Let us assume that
our preprocessor has already captured the message and converted it into a text
By doing so, the preprocessor may accidently split a
single log message spreading over multiple lines into multiple messages. The
preprocessor should try to avoid this, but not at all costs. If it can't be
avoided, we will accept this.
The text string will be made up of valid printable
characters, but not neccessarily be limited to the ANSI character set. It may
be stored in Unicode to facilitate processing of Asian
After this first step, we have a structured stream of log
data. Now we can apply further formatting. The next stage of the preprocessor
parse each single stream of text message and extract
name/value pairs for known entities like source and destination ip addresses.
The idea behind this is to associate each entry with a set of well known,
emitter-ignorant properties. The key is that these known properties should be
very well defined and available for all messages that can successfully be
parsed. On the down side, the preprocessor needs to know each individual
message format that it intends to parse. This is not practical for all events
- but chances might be good it is for the most prominent emitters (e.g. like major
firewalls or major OS events).
For the known messages, the parameters will be
stripped from the message part - this will later allow us to identify
identical types of log entries.
After step 2, we have a well-defined stream of log
entries with hopefully many well-defined properties assigned to each entry. We
are now able to cluster identical types log entries even when the actual data is
different. We can of course sub-analyse those identical log entries based on the
associated well-defined data. We have, however, still log entries which were
unknown to the parser in step 2 and thus have not be split in the generic and
actual message part. As we can not do any better, we need to live with that and
further stages should be designed with that in mind - maybe there is a way to
extract some more meaning from them.
[Idea: It might be possible to run an additonal step over
those log entries and do an textual analysis on what is identical in them and
what not. Chances are good we could strip the changing parameters off the static
message text and thus identify the message type - but would this help anything
A Run-Down of the preprocessing Phase
The following chart provides a quick (and idealized) rundown of the preprocessing stages outlined
Please note that at the end of the preprocessing stages,
we have a set of well-defined log entries which have associated
- a kind of type
- some common properties (like date and time of their
generation or originating device)
- some well-defined name/value properties
[rgerhards: it may be smarter to run the first two
preproc stages, and *then* run the "thin-to-fat" converter - it will benefit
from the name/value pairs... One can also think about dropping log entries
*before* the ever run into the preproc - this will safe processing time. But we
can do this at any time later, so let's not yet focus on
Baselines are very imporant when it comes to what is
normal. While I have to admit I do not yet know exactly how we can correlate log entries to the respective baseline,
and I do not know exactly which granularity we need for the baseline, I am sure that
baselines are a key to identify "interesting" events. Only a baseline can teach
me (my algorithm) what is normal (at a given time) and what not.
So far, I strongly think that we need multiple baselines,
e.g. on an hourly, (week)daily and monthly (really?) basis. All of these baselines
can than be used to detecte spikes of traffic that are not common.
... more to be added ...
"Missing" Log Entries
Many currently existing approaches look into the log data they
receive and try to find out interesting events. However, absence of events is also very interesting. So
while digging through the mass of our log data, we must also try
to dig out what is normally there (baseline!) but not present this time.
... more to be added ....
section to be written
I am stil thinking that honeypots - used intelligently -
can play a major role in detecting attacks. Needs to be elaborated more, for now
see www.honeynet.org for some background.
Known Attack Signatures
section to be written
Known attack signatures allow to positively identify an
intrusion attempt. It can be the adminstrators decision to be alarmed or not.
All in all, I think known attack signatures not as vitally important as the rest
of the topics discussed here (when it comes to detecting interesting stuff). The
reason is a) when the rest of the algorithms work fine, they will find those
attacks and b) if the attacks are known, the (security-aware) admin
has probably countermeasures in place (I know this is a weak point ;)).
There is theory that those events that
normally happen frequently, are noise.
Normally and frequently are the key words (and thus in bold ;)).
Normally is a time dimension. It means
that during normal, typical operation, these events continously happen. This may
include events that happen every 5 minutes, but it may also mean events that
happen only once a month, but typically always e.g. on the 1st of the month. An
example for a normal, but not really periodic activity might be an increased
number of failed logons during the morning time when people come into the office
and are not really awake.
Frequently is a volume dimension. It
means how often a specific event happens within a given period. To be
infrequent, an event must happen much less often than most of the other events.
For example, an event is frequent if it just happened twice per hour when other
events have typically happened 60 times per hour.
As a general consideration, we can assume that on typical
days we have a number of non-noise events that is very low and can be easily
handled within the time that the admin typically allocates to such. For any
given administrator, I would assume this is less than 20 events per day. Please
note that I am specifically talking about a single Administrator - I assume that
larger enterprises have more administrator and the workload should be
distributed among them. I am also assuming that the administrator has many other
tasks to perform. This number is definitely not meant for e.g. an incident
The key question is now "how can we remove the noise so
that the administrator will only receive those valid positives?". Good question
Marcus J. Ranum introduced the approach of "Artificial Ignorance"
In short, this means removing all those events that happen too frequently to be
really useful. Marcus, please correct me if I did not summarize it right ;)
drop known noise
from "noise list" (admin created), so we don't need to process them any further
group messages (by type and name/value where it makes sense)
messages that happen
very frequently are noise (be sure to apply baseline! - admin may set exclusions)
only few infrequent message will be brought out as to be considered
present them to admin
let admin decide which messages (by example) are noise, too --> create "noise list"
Noise - those things in the log that we
are not interested in, because they happen frequently and/or are otherwise
normal during typical operation.
Log Emitter - any device or application
that emits log data in any way whatsoever (e.g. sending syslog messages or
storing them to a file in the local file system).
Things to be done...
This section is more or less a reminder for the author ... but if somebody
has related stuff and is willing to share - well, I don't intend to reinvent
everthing ;) Just give me a hand at email@example.com.
There are many things I can think of as being to be done. These are the most
- build a resource with information for
correlating Windows event log events. I mean, how to create "fat" entries
out of the "thin" ones.
- do the same at least for the most prominent PIX messages
We would like to thank the following people for their input on this paper or
important thoughts they have published somewhere else:
- Eric Fitzgerald -
- Marcus J. Ranum -
- Tina Bird of loganalysis.org for her wonderful
moderation of the loganalysis mailing list and great help.
We have tried to include everyone who made a contribution, but someone might
accidently not be included. If you feel we have forgotton you on the list, please
accept my apologies and let me know.
|2003-02-28||Updated with new thoughts - too many to list specifically (still under initial construction)
|2003-02-27||Initial version begun.
This document is copyrighted ©
2003 by Adiscon GmbH and Rainer Gerhards. Anybody is free to distribute it
without paying a fee as long as it is distributed unaltered and there is only a
reasonable fee charged for it (e.g. a copying fee for a printout handed out).
Please note that "unaltered" means as either this web page or a printout of the same on paper.
Any other use requires previous
written authorization by Adiscon GmbH and Rainer
If you place the document on a web
site or otherwise distribute it to a broader audience, I would appreciate if you
let me know. This serves two needs: Number one is I am able to notify you when
there is an update available (that is no promise!) and number two is I am a
creature of curiosity and simply interested in where the paper pops
The information within this paper may change without notice. Use
of this information constitutes acceptance for use in an AS IS condition. There
are NO warranties with regard to this information. In no event shall the author
be liable for any damages whatsoever arising out of or in connection with the
use or spread of this information. Any use of this information is at the user's