An Algorithm for Traffic Baseline
Created 2003-08-04 by Wajih-ur-Rehman
Updated 2003-09-04 by Wajih-ur-Rehman
Abstract
This paper mainly describes a Base lining algorithm for Traffic
such that this baseline could be used to generate alerts to the user if the
traffic behavior goes out of the range as specified by the baseline. For an
algorithm of such sort, there are two main things that must be taken care
of:
- What is normal and what is not?
- The algorithm should be self-learning and adaptive.
The algorithm that has been described in this document caters for both of these
issues as you will see in the detailed description of the algorithm below.
Detailed Description of the Baseline Algorithm
Purpose
The main purpose of this algorithm is to define some baselines
using the historic traffic data to predict the behavior (or to be more
precise, the volume) of the current data and dynamically adapt itself to the
changing conditions / volume and hence to define
new baselines on the fly.
Inputs
To start this algorithm, there would be two inputs to it
initially:
- Holidays for each month. On a public holiday, the traffic
behavior might change abnormally and hence if it is not taken care of in the
algorithm, it might generate a false positive.
- At what percentage x, a number would be classified as an
outlier (lets say 20% or 30%). This is truly based on the user assessment and
may vary from environment to environment
Once the algorithm is started, no / little human interaction
would be required and the algorithm would learn itself using the historic data
and according to certain rules that have been defined within this algorithm.
Description
This algorithm has been divided into two parts:
- Learning Mode: In this part, the algorithm would
calculate a starting baseline value with which to compare the records in
future.
- Dynamic Mode: In this part, the algorithm would
dynamically adapt itself to the changing conditions and hence update the
baseline on the fly.
1. Learning Mode
When the algorithm starts, it will be in "Learning Mode".
In this mode, the algorithm would definitely need to calculate some baseline
value based on historic data in order to come up with some baseline value with
which to compare the volume in future.
This
algorithm defines baseline for each day of the week separately! The reason
behind this is that the traffic for one day might be very different from the
traffic on another day and hence should not be intermingled to spoil the
baselines. For example, the traffic on Sundays might be very less as compared to
traffic on Mondays or vice versa but the traffic on two Mondays or two Sundays
should be more or less the same.
Now to determine the initial
baseline, one very simple way is to take the average values of the days and come
up with a baseline. There is, however, a problem with this approach. The problem
lies in the fact that there could be certain outliers as well in the historic
data and the average value that has been taken will also be affected by the
outliers in the historic data. So its not a good starting point.
To get the initial
baseline value, Interquartile Range (IQR) has been used to eliminate the
outliers that might be present in the historic data. The
interquartile distance is the
difference between the values for the 25th-percentile and the 75th-percentile of
the test statistic. These values could be obtained by ordering the data from
smallest to largest. The 25th-percentile value is the value that divides the
ordered set such that 25% of the statistics are smaller and 75% are greater. The
75th-percentile value is the value that divides the ordered set such that 75% of
the statistics are smaller and 25% are greater. If the number of test statistics
cannot be evenly divided by 2, these values are calculated as the mean of the
two test statistics closest to the appropriate position. To determine the
outliers, multiply the interquartile distance by 1.5. Any values farther from
the median than 1.5 times the interquartile distance are considered outliers.
The example below would
elaborate it further.
Example
This example calculates the initial baseline for Monday only (The
counts given in the table below are only of Mondays). A similar procedure would
be adopted for calculating the initial baselines for the other days. Consider
the following table of total counts of events on Mondays after sorting:
| 1 |
800 |
| 2 |
950 |
| 3 |
990 |
| 4 |
998 |
| 5 |
1000 |
| 6 |
1050 |
| 7 |
1100 |
| 8 |
1200 |
| 9 |
1500 |
| 10 |
2000 |
| 11 |
2010 |
25th-percentile = 0.25 * (11+1) = 3rd value i.e. 990
75th-percentile = 0.75*(11+1) = 9th value i.e. 1500
IQR = 1500 - 990 = 510
Median = 1050
Upper Limit = 1050 + 510*1.5 = 1815
Lower Limit = 1050 - 510*1.5 = 285
Outliers = Values not present within limits = 2000 and 2010
After the outliers are discarded, the new data contents are:
| 1 |
800 |
| 2 |
950 |
| 3 |
990 |
| 4 |
998 |
| 5 |
1000 |
| 6 |
1050 |
| 7 |
1100 |
| 8 |
1200 |
| 9 |
1500 |
Now calculate the starting baseline as the average of these values
Starting Baseline = 1065
This value can now be considered as a refined
baseline for Monday. Similar procedure would be adopted for other days to
determine their initial baselines as well. Note that these baselines are just
some starting values. With the passage of time these baselines would continue to
adjust themselves.
Once the initial baselines for all the days have been calculated, the
algorithm would shift its mode to "Dynamic Mode"
2. Dynamic Mode
Now the algorithm can start comparing the day to day data with the baselines
that it calculated in the "Learning Mode" to generate alerts and also to update
the baselines. There can be two different scenarios in day to day operations.
- Every thing is running normally with a few outliers
on some days
- Some changes have been done to the system which drastically changed what
has been considered as normal before. Hence new baselines need to be computed.
a) Every thing is running normally with a few outliers
The algorithm will pick up the count of a specific day and
will check whether it is within the range x (specified by the user as input to
this algorithm) compared to its corresponding baseline that it calculated for
that day in the Learning Mode. If it is, it will be marked as “Count towards the
baseline”. If it is not, it will be marked as “Too High” or “Too Low” depending
upon from which side it is out of the specified range. On the next day of the
same week, it will take the average of all the previous values that have been
marked as “count towards the baseline” and compare the current value of this
day. If it is within the range, it will be marked as “Count towards the
baseline” otherwise, it will once again be marked as either “Too High” or “Too
Low” depending upon from which side it is out of range. e.g.
Assume:
- Baseline for Friday = 10000
- x = 40 %
- B.L. = short for Baseline
- !< is short for not less than
- !> is short for not greater than
| Day |
Count |
Calculations |
Status |
Comments |
| 1st Friday Of July |
10500 |
B.L. = 10000 10500 < 10000*1.4
10500 > 10000*0.6 |
Count towards B.L |
|
| 2nd Friday of July |
11987 |
B.L. = 10250 11987 < 10250*1.4
11987 > 10250*0.6
|
Count towards B.L |
B.L. is calculated using 10000 and 10500 since they count
towards B.L. |
| 3rd Friday of July |
15000 |
B.L. = 10829 15000 < 10829*1.4
15000 > 10829*0.6
|
Count towards B.L |
B.L. is calculated using 10000,10500, 11987
since they count towards B.L. |
| 4th Friday of July |
8000 |
B.L. = 11872 8000 < 11872*1.4
8000 > 11872*0.6
|
Count towards B.L |
B.L. is calculated using 10000,10500, 11987, 15000
since they count towards B.L. |
| 1st Friday of August |
20000 |
B.L. = 11098 20000 < 11098*1.4
20000 > 11098*0.6
|
Too High |
B.L. is calculated using 10000,10500, 11987, 15000,
8000 since they count towards B.L. |
| 2nd Friday of August |
8000 |
B.L. = 11098 8000 < 11098*1.4
8000 > 11098*0.6
|
Count towards B.L |
B.L. is calculated using
10000,10500, 11987, 15000, 8000 since they count towards B.L. Note that
20000 is skipped because it was an
outlier.
|
b) Some changes have been done to the system
One very important thing to keep in mind is that, consistent
high or consistent low values might not be actually outliers at all. You might
see these consistent high values because of certain changes that you have done
in your system and which caused that change. In this scenario, a new baseline
has to be defined because now what you consider as normal has been changed after
you made changes to your system!
According to this algorithm, if for 7 consecutive days (a
week), you see that there are outliers on the same side, it means that
the baseline has to be revised now. On the 8th day it will consider the current
value as the new Baseline and goes into the Learning Mode again because obviously
that new value is not at all the baseline. That might be an outlier as well. So,
the algorithm cannot be sure until it learns again. Now whenever it generates
any alert to the user, it will also indicate the user that it is currently
running in the learning mode and the generated alert might be a false positive.
It will perform the same task as explained above in the "Learning Mode" Section.
The only difference is that it will update the baseline each week for a specific
day (starting from the day when it switched to Learning Mode) and after 4 weeks
(when it has obtained 4 different values for a specific day, lets say Monday) ,
it will switch back to "Dynamic Mode" of operation.
This algorithm can be slightly improved if the user tells the algorithm when
he/she makes any changes to it. With this approach, whenever the user says "I
have made changes", it will switch to "Learning Mode" from "Dynamic Mode" and
would not have to wait for a period of 7 days as described above.
Conclusions
Using this algorithm, dynamic baselines for traffic for different days can be
defined and on the basis of these, alerts could be generated to the user when
the behavior of the traffic is abnormal as compared to the calculated baselines.
Acknowledgements
I thank Rainer Gerhards of Adiscon
for having detailed discussions with me on this algorithm and also for reviewing this paper.
I would also like to thank Sweth
Chandramouli for his valuable input.
References
-
http://www.terraseer.com/csr/help/index.htm#concepts/interquartile_distance.htm
-
http://www.statcan.ca/english/edu/power/ch12/range.htm
Author's Address
Wajih-ur-Rehman
wrehman@ro1.adiscon.com
Adiscon GmbH
Mozartstrasse 21
97950 Grossrinderfeld
Germany
Disclaimer
The information within
this paper may change without notice. Use of this information constitutes
acceptance for use in an AS IS condition. There are NO warranties with regard to
this information. In no event shall the authors be liable for any damages
whatsoever arising out of or in connection with the use or spread of this
information. Any use of this information is at the user's own risk.
|