Dynamic Self-Learning Traffic
Classification Architecture
Introduction
Recent years have seen a dramatic increase in the variety of
applications using the Internet. In addition to 'traditional'
applications (e.g. email, web or ftp) new applications have gained
strong momentum (e.g. streaming, gaming or peer-to-peer (P2P)). The
ability to dynamically identify and classify flows according to their
network applications is highly beneficial for:
- Trend analyses
(estimating the size and origins of capacity demand trends for network
planning)
- Adaptive,
network-based marking of traffic requiring specific Quality of Service
(QoS) without direct client-application or end-host involvement
- Dynamic access control
(adaptive firewalls that can detect forbidden applications, Denial of
Service (DoS) attacks or other unwanted traffic)
- Lawful Interception
(enabling minimally invasive warrants and wire-taps based on
statistical summaries of traffic details)
- Intrusion detection
(detect suspicious activities related to security breaches due to
malicious users or worms)
The most common traffic identification technique, based on the
inspection of ‘known port numbers’ is no longer
accurate, as many applications no longer use fixed, predictable port
numbers. For instance work in [1] shows traffic classification solely
based on port numbers will identify only a fraction of the overall P2P
traffic.
A more reliable technique used in many current
industry products involves stateful reconstruction of session and
application information from packet content. Although this technique
avoids reliance on fixed port numbers, it imposes significant
complexity and processing load on the traffic identification device.
More recently signature-based methods have been
proposed, primarily for classifying P2P traffic. Although these
approaches are more efficient than stateful reconstruction and provide
better classification than the port-based approach, they are still
application protocol dependent (requiring analysis of packet contents).
Previous research used a number of different
parameters (features) to describe network traffic including the size
and duration of flows, packet length and interarrival time
distributions, flow idle times etc. We propose to use Machine Learning
(ML) to automatically classify and identify network applications based
on these features.
Machine Learning (ML) usually refers to systems
performing tasks associated with Artifical Intelligence (AI). Such
tasks involve recognition, diagnosis, planning, prediction etc.
Mitchell [2] defines Machine Learning as follows: "A computer program
is said to learn from experimence E with respect to some class of tasks
T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E". Witten and Frank
[3] state: "things learn when they change their behavior in a way that
makes them perform better in the future".
Machine Learning can be viewed as general inductive process that
automatically builds a model by learning the inherent structure of a
dataset depending on the characteristics. Over the past decade, Machine
Learning has evolved from a field of laboratory demonstrations to a
field of significant commercial value [4]. Machine Learning techniques
have been very successful in the areas of Data Mining, speech and voice
recognition, text recognition, face recognition etc.
A good model should be descriptive (describe the training
data), predictive (generalise well for unseen test data) and
explanatory (provide a plausible description).
Architecture
The figure below
visualizes the planned architecture. As input we either use traffic
traces or capture network data in real-time. Based on the packet data
we perform packet classification using IP addresses, TCP or UDP ports,
and protocol and compute the flow attributes (features). It may be
necessary to limit the number of flows passed to the learning algorithm
by randomly sampling flows before the ML.
Then the flow characteristics and a model of the
flow attributes are used to learn the classes (1). The model of the
attributes totally depends on the ML algorithm used. In the extreme
case a model might not be needed at all. Once the classes have been
learned new flows can be classified based on their attributes, the
attribute model and the learned classes (2). The results of the
learning and classification process are exported for further analysis
or evaluation only. The results of the classification process would be
the input for e.g. trend analysis, QoS mapping etc.
References
[1] Thomas
Karagiannis, Andre Broido, Nevil Brownlee, kc claffy, “Is P2P
dying or just hiding?”, Proceedings of Globecom 2004,
November/December 2004.
[2] Tom M.
Mitchell, “Machine Learning”, McGraw-Hill Education
(ISE Editions), December 1997.
[3]
Ian H. Witten, Eibe Frank, "Data Mining: Practical Machine
Learning Tools and Techniques (Second Edition)", Morgan Kaufmann, June
2005.
[4] Tom M.
Mitchell, “Does Machine Learning Really Work?”, AI
Magazine 18(3), pp. 11-20, 1997.