Articles  
 

An Algorithm for Traffic Baseline

Created 2003-08-04 by Wajih-ur-Rehman
Updated 2003-09-04 by Wajih-ur-Rehman

Abstract

This paper mainly describes a Base lining algorithm for Traffic such that this baseline could be used to generate alerts to the user if the traffic behavior goes out of the range as specified by the baseline. For an algorithm of such sort, there are two main things that must be taken care of:

  1. What is normal and what is not?
  2. The algorithm should be self-learning and adaptive.

The algorithm that has been described in this document caters for both of these issues as you will see in the detailed description of the algorithm below.

Detailed Description of the Baseline Algorithm

Purpose

The main purpose of this algorithm is to define some baselines using the historic traffic data to predict the behavior (or to be more precise, the volume) of the current data and dynamically adapt itself to the changing conditions / volume and hence to define new baselines on the fly.

Inputs

To start this algorithm, there would be two inputs to it initially:

  1. Holidays for each month. On a public holiday, the traffic behavior might change abnormally and hence if it is not taken care of in the algorithm, it might generate a false positive.
  2. At what percentage x, a number would be classified as an outlier (lets say 20% or 30%). This is truly based on the user assessment and may vary from environment to environment

Once the algorithm is started, no / little human interaction would be required and the algorithm would learn itself using the historic data and according to certain rules that have been defined within this algorithm.

Description

This algorithm has been divided into two parts:

  1. Learning Mode: In this part, the algorithm would calculate a starting baseline value with which to compare the records in future.
  2. Dynamic Mode: In this part, the algorithm would dynamically adapt itself to the changing conditions and hence update the baseline on the fly.

1. Learning Mode

When the algorithm starts, it will be in "Learning Mode". In this mode, the algorithm would definitely need to calculate some baseline value based on historic data in order to come up with some baseline value with which to compare the volume in future. This algorithm defines baseline for each day of the week separately! The reason behind this is that the traffic for one day might be very different from the traffic on another day and hence should not be intermingled to spoil the baselines. For example, the traffic on Sundays might be very less as compared to traffic on Mondays or vice versa but the traffic on two Mondays or two Sundays should be more or less the same.

Now to determine the initial baseline, one very simple way is to take the average values of the days and come up with a baseline. There is, however, a problem with this approach. The problem lies in the fact that there could be certain outliers as well in the historic data and the average value that has been taken will also be affected by the outliers in the historic data. So its not a good starting point.

To get the initial baseline value, Interquartile Range (IQR) has been used to eliminate the outliers that might be present in the historic data. The interquartile distance is the difference between the values for the 25th-percentile and the 75th-percentile of the test statistic. These values could be obtained by ordering the data from smallest to largest. The 25th-percentile value is the value that divides the ordered set such that 25% of the statistics are smaller and 75% are greater. The 75th-percentile value is the value that divides the ordered set such that 75% of the statistics are smaller and 25% are greater. If the number of test statistics cannot be evenly divided by 2, these values are calculated as the mean of the two test statistics closest to the appropriate position. To determine the outliers, multiply the interquartile distance by 1.5. Any values farther from the median than 1.5 times the interquartile distance are considered outliers.

The example below would elaborate it further.

Example

This example calculates the initial baseline for Monday only (The counts given in the table below are only of Mondays). A similar procedure would be adopted for calculating the initial baselines for the other days. Consider the following table of total counts of events on Mondays after sorting:

1 800
2 950
3 990
4 998
5 1000
6 1050
7 1100
8 1200
9 1500
10 2000
11 2010

 

25th-percentile =  0.25 * (11+1) = 3rd value i.e. 990

75th-percentile = 0.75*(11+1) = 9th value i.e. 1500

IQR = 1500 - 990 = 510

Median = 1050

Upper Limit = 1050 + 510*1.5 = 1815

Lower Limit = 1050 - 510*1.5 =  285

Outliers = Values not present within limits = 2000 and 2010

After the outliers are discarded, the new data contents are:

1 800
2 950
3 990
4 998
5 1000
6 1050
7 1100
8 1200
9 1500

Now calculate the starting baseline as the average of these values

Starting Baseline = 1065

This value can now be considered as a refined baseline for Monday. Similar procedure would be adopted for other days to determine their initial baselines as well. Note that these baselines are just some starting values. With the passage of time these baselines would continue to adjust themselves.

Once the initial baselines for all the days have been calculated, the algorithm would shift its mode to "Dynamic Mode"

2. Dynamic Mode

Now the algorithm can start comparing the day to day data with the baselines that it calculated in the "Learning Mode" to generate alerts and also to update the baselines. There can be two different scenarios in day to day operations.

  • Every thing is running normally with a few outliers on some days
  • Some changes have been done to the system which drastically changed what has been considered as normal before. Hence new baselines need to be computed.

a) Every thing is running normally with a few outliers

The algorithm will pick up the count of a specific day and will check whether it is within the range x (specified by the user as input to this algorithm) compared to its corresponding baseline that it calculated for that day in the Learning Mode. If it is, it will be marked as “Count towards the baseline”. If it is not, it will be marked as “Too High” or “Too Low” depending upon from which side it is out of the specified range. On the next day of the same week, it will take the average of all the previous values that have been marked as “count towards the baseline” and compare the current value of this day. If it is within the range, it will be marked as “Count towards the baseline” otherwise, it will once again be marked as either “Too High” or “Too Low” depending upon from which side it is out of range. e.g.

Assume:

  • Baseline for Friday = 10000
  • x = 40 %
  • B.L. = short for Baseline
  • !< is short for not less than
  • !> is short for not greater than
Day Count Calculations Status Comments
1st Friday Of July 10500 B.L. = 10000

10500 < 10000*1.4

10500 > 10000*0.6

Count towards B.L  
2nd Friday of July 11987 B.L. = 10250

11987 < 10250*1.4

11987 > 10250*0.6

 

Count towards B.L B.L. is calculated using 10000 and 10500 since they count towards B.L.
3rd Friday of July 15000 B.L. = 10829

15000 < 10829*1.4

15000 > 10829*0.6

 

Count towards B.L B.L. is calculated using  10000,10500, 11987 since they count towards B.L.
4th Friday of July 8000 B.L. = 11872

8000 < 11872*1.4

8000 > 11872*0.6

 

Count towards B.L B.L. is calculated using  10000,10500, 11987, 15000 since they count towards B.L.
1st Friday of August 20000 B.L. = 11098

20000 < 11098*1.4

20000 > 11098*0.6

 

Too High B.L. is calculated using  10000,10500, 11987, 15000, 8000 since they count towards B.L.
2nd Friday of August 8000 B.L. = 11098

8000 < 11098*1.4

8000 > 11098*0.6

 

Count towards B.L

B.L. is calculated using  10000,10500, 11987, 15000, 8000 since they count towards B.L. Note that 20000 is skipped because it was an outlier.

b) Some changes have been done to the system

One very important thing to keep in mind is that, consistent high or consistent low values might not be actually outliers at all. You might see these consistent high values because of certain changes that you have done in your system and which caused that change. In this scenario, a new baseline has to be defined because now what you consider as normal has been changed after you made changes to your system!

According to this algorithm, if for 7 consecutive days (a week), you see that there are outliers on the same side, it means that the baseline has to be revised now. On the 8th day it will consider the current value as the new Baseline and goes into the Learning Mode again because obviously that new value is not at all the baseline. That might be an outlier as well. So, the algorithm cannot be sure until it learns again. Now whenever it generates any alert to the user, it will also indicate the user that it is currently running in the learning mode and the generated alert might be a false positive. It will perform the same task as explained above in the "Learning Mode" Section. The only difference is that it will update the baseline each week for a specific day (starting from the day when it switched to Learning Mode) and after 4 weeks (when it has obtained 4 different values for a specific day, lets say Monday) , it will switch back to "Dynamic Mode" of operation.

This algorithm can be slightly improved if the user tells the algorithm when he/she makes any changes to it. With this approach, whenever the user says "I have made changes", it will switch to "Learning Mode" from "Dynamic Mode" and would not have to wait for a period of 7 days as described above.

Conclusions

Using this algorithm, dynamic baselines for traffic for different days can be defined and on the basis of these, alerts could be generated to the user when the behavior of the traffic is abnormal as compared to the calculated baselines.

Acknowledgements

I thank Rainer Gerhards of Adiscon for having detailed discussions with me on this algorithm and also for reviewing this paper. I would also like to thank Sweth Chandramouli for his valuable input.

References

  1. http://www.terraseer.com/csr/help/index.htm#concepts/interquartile_distance.htm
  2. http://www.statcan.ca/english/edu/power/ch12/range.htm

Author's Address

Wajih-ur-Rehman
wrehman@ro1.adiscon.com

Adiscon GmbH
Mozartstrasse 21
97950 Grossrinderfeld
Germany

Disclaimer

The information within this paper may change without notice. Use of this information constitutes acceptance for use in an AS IS condition. There are NO warranties with regard to this information. In no event shall the authors be liable for any damages whatsoever arising out of or in connection with the use or spread of this information. Any use of this information is at the user's own risk.

MonitorWare
 Home
 The Products
MonitorWare Products
Product Comparison
Which one to Purchase?
Order and Pricing
Upgrade Insurance Info
News Releases
Version History
MonitorWare Tools
 Event Repository
 Download
 Reference library
General Information
Step-by-step guides
 - All
 - Installation and Configuration
 - Services related
 - Actions related
 - Central Monitoring
Common Uses
Syslog configuration
Syslog Log Samples
Security Reference
 Help
Support
Manual
FAQ
 - All
 - General questions
 - Configurations related
 - Monitorware Agent
 - Monitorware Console
Articles
Seminars Online
 - All
 - General
 - MonitorWare Console
 - MonitorWare Agent
 - WinSyslog related
 - EventReporter
 Order & pricing
Order now
Product Comparison
Pricing Information
Upgrade Insurance Info
Local Reseller
 Contact Us
 Search
 
 



Printer Version Send this page to a friend

Copyright © 1988-2005 Adiscon GmbH All rights reserved.
Contact us via Secure Web Response | Privacy Policy
Topic Links: syslog | Free Weblinks Directory