Dynamic Self-Learning Traffic Classification Architecture

Introduction

Recent years have seen a dramatic increase in the variety of applications using the Internet. In addition to 'traditional' applications (e.g. email, web or ftp) new applications have gained strong momentum (e.g. streaming, gaming or peer-to-peer (P2P)). The ability to dynamically identify and classify flows according to their network applications is highly beneficial for:

Trend analyses (estimating the size and origins of capacity demand trends for network planning)
Adaptive, network-based marking of traffic requiring specific Quality of Service (QoS) without direct client-application or end-host involvement
Dynamic access control (adaptive firewalls that can detect forbidden applications, Denial of Service (DoS) attacks or other unwanted traffic)
Lawful Interception (enabling minimally invasive warrants and wire-taps based on statistical summaries of traffic details)
Intrusion detection (detect suspicious activities related to security breaches due to malicious users or worms)

The most common traffic identification technique, based on the inspection of ‘known port numbers’ is no longer accurate, as many applications no longer use fixed, predictable port numbers. For instance work in [1] shows traffic classification solely based on port numbers will identify only a fraction of the overall P2P traffic.

A more reliable technique used in many current industry products involves stateful reconstruction of session and application information from packet content. Although this technique avoids reliance on fixed port numbers, it imposes significant complexity and processing load on the traffic identification device.

More recently signature-based methods have been proposed, primarily for classifying P2P traffic. Although these approaches are more efficient than stateful reconstruction and provide better classification than the port-based approach, they are still application protocol dependent (requiring analysis of packet contents).

Previous research used a number of different parameters (features) to describe network traffic including the size and duration of flows, packet length and interarrival time distributions, flow idle times etc. We propose to use Machine Learning (ML) to automatically classify and identify network applications based on these features.

Machine Learning (ML) usually refers to systems performing tasks associated with Artifical Intelligence (AI). Such tasks involve recognition, diagnosis, planning, prediction etc. Mitchell [2] defines Machine Learning as follows: "A computer program is said to learn from experimence E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E". Witten and Frank [3] state: "things learn when they change their behavior in a way that makes them perform better in the future".

Machine Learning can be viewed as general inductive process that automatically builds a model by learning the inherent structure of a dataset depending on the characteristics. Over the past decade, Machine Learning has evolved from a field of laboratory demonstrations to a field of significant commercial value [4]. Machine Learning techniques have been very successful in the areas of Data Mining, speech and voice recognition, text recognition, face recognition etc.

A good model should be descriptive (describe the training data), predictive (generalise well for unseen test data) and explanatory (provide a plausible description).

Architecture

The figure below visualizes the planned architecture. As input we either use traffic traces or capture network data in real-time. Based on the packet data we perform packet classification using IP addresses, TCP or UDP ports, and protocol and compute the flow attributes (features). It may be necessary to limit the number of flows passed to the learning algorithm by randomly sampling flows before the ML.

Then the flow characteristics and a model of the flow attributes are used to learn the classes (1). The model of the attributes totally depends on the ML algorithm used. In the extreme case a model might not be needed at all. Once the classes have been learned new flows can be classified based on their attributes, the attribute model and the learned classes (2). The results of the learning and classification process are exported for further analysis or evaluation only. The results of the classification process would be the input for e.g. trend analysis, QoS mapping etc.

architecture

References

[1] Thomas Karagiannis, Andre Broido, Nevil Brownlee, kc claffy, “Is P2P dying or just hiding?”, Proceedings of Globecom 2004, November/December 2004.

[2] Tom M. Mitchell, “Machine Learning”, McGraw-Hill Education (ISE Editions), December 1997.

[3] Ian H. Witten, Eibe Frank, "Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)", Morgan Kaufmann, June 2005.

[4] Tom M. Mitchell, “Does Machine Learning Really Work?”, AI Magazine 18(3), pp. 11-20, 1997.

Centre closure

Dynamic Self-Learning Traffic Classification Architecture