Anomaly detection in network traffic with a relationnal clustering criterion

07/11/2017
Auteurs : Damien Nogues
Publication GSI2017
OAI : oai:www.see.asso.fr:17410:22607
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit
 

Résumé

Unsupervised anomaly detection is a very promising technique for intrusion detection. Among many other approaches, clustering algorithms have often been used to perform this task. However, to describe network traffic, both numerical and categorical variables are commonly used. So most clustering algorithms are not very well-suited to such data. Few clustering algorithms have been proposed for such heterogeneous data. Many approaches do not possess suitable complexity.
In this article, using Relational Analysis, we propose a new, unified clustering criterion. This criterion is based on a new similarity function for values in a lattice, which can then be applied to both numerical and categorical variables. Finally we propose an optimisation heuristic of this criterion and an anomaly score which outperforms many state of the art solutions.

Anomaly detection in network traffic with a relationnal clustering criterion

Collection

application/pdf Anomaly detection in network traffic with a relationnal clustering criterion Damien Nogues
Détails de l'article
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit

Anomaly detection in network traffic with a relationnal clustering criterion

Média

Voir la vidéo

Métriques

0
0
246.31 Ko
 application/pdf
bitcache://042695dbdc548a5952d0a7f5306059295cb95252

Licence

Creative Commons Aucune (Tous droits réservés)

Sponsors

Sponsors Platine

alanturinginstitutelogo.png
logothales.jpg

Sponsors Bronze

logo_enac-bleuok.jpg
imag150x185_couleur_rvb.jpg

Sponsors scientifique

logo_smf_cmjn.gif

Sponsors

smai.png
logo_gdr-mia.png
gdr_geosto_logo.png
gdr-isis.png
logo-minesparistech.jpg
logo_x.jpeg
springer-logo.png
logo-psl.png

Organisateurs

logo_see.gif
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/17410/22607</identifier><creators><creator><creatorName>Damien Nogues</creatorName></creator></creators><titles>
            <title>Anomaly detection in network traffic with a relationnal clustering criterion</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2018</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Thu 8 Mar 2018</date>
	    <date dateType="Updated">Thu 8 Mar 2018</date>
            <date dateType="Submitted">Mon 15 Oct 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">042695dbdc548a5952d0a7f5306059295cb95252</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>37354</version>
        <descriptions>
            <description descriptionType="Abstract">Unsupervised anomaly detection is a very promising technique for intrusion detection. Among many other approaches, clustering algorithms have often been used to perform this task. However, to describe network traffic, both numerical and categorical variables are commonly used. So most clustering algorithms are not very well-suited to such data. Few clustering algorithms have been proposed for such heterogeneous data. Many approaches do not possess suitable complexity.<br />
In this article, using Relational Analysis, we propose a new, unified clustering criterion. This criterion is based on a new similarity function for values in a lattice, which can then be applied to both numerical and categorical variables. Finally we propose an optimisation heuristic of this criterion and an anomaly score which outperforms many state of the art solutions.
</description>
        </descriptions>
    </resource>
.

Anomaly detection in network traffic with a relationnal clustering criterion Damien Nogues1,2 1 Thales 2 Université Pierre et Marie Curie, Complex Networks Abstract. Unsupervised anomaly detection is a very promising tech- nique for intrusion detection. Among many other approaches, clustering algorithms have often been used to perform this task. However, to de- scribe network traffic, both numerical and categorical variables are com- monly used. So most clustering algorithms are not very well-suited to such data. Few clustering algorithms have been proposed for such het- erogeneous data. Many approaches do not possess suitable complexity. In this article, using Relational Analysis, we propose a new, unified clus- tering criterion. This criterion is based on a new similarity function for values in a lattice, which can then be applied to both numerical and cat- egorical variables. Finally we propose an optimisation heuristic of this criterion and an anomaly score which outperforms many state of the art solutions. 1 Anomaly detection using clustering : motivations To detect breaches in information systems, most organisations rely on traditional approaches such as anti-viruses and firewalls. However, these approaches are too often too straightforward to evade. Many techniques designed to circumvent detection have been proposed in the literature [17, 8]. For this reason, new security systems have been developed : Intrusion De- tection Systems (IDS). At their core, these systems act as a system wide alarm system. But most IDS use a misuse detection technique. This means, they rely on some sort of signature database to perform the detection. Unfortunately, signatures do not permit to detect new or modified attacks. Hence, such systems are often not sufficient to detect advanced, targeted attacks. This is the reason why anomaly based IDS have been proposed. Further- more, due to the prohibitive cost of labelling a training dataset of sufficient size, unsupervised anomaly detection is often preferred in intrusion detection. In unsupervised anomaly detection, some hypothesis are necessary. First, nor- mal behaviour must be far more frequent than anomalies, second anomalies and normal events must be structurally different. Many anomaly detection techniques have been proposed. Interested readers are encouraged to consult this very complete survey [4] and reference therein. In this paper, we will focus solely on clustering based anomaly detection. These techniques usually work in the following way : first the data is partitioned using a clustering algorithm, then large clusters are used to define what is the normal state of the system, allowing to decide which small clusters are composed of anomalies. However, most clustering algorithms are not well suited for intrusion detec- tion. Some do not offer sufficient computational performances. Many algorithms are unable to create very small clusters which contain the anomalies, especially when the number of clusters is a given parameter. And most are not able to han- dle both qualitative and numerical data. To adress those issues, we developed a new clustering algorithm, able to handle heterogeneous data, and particularly efficient for anomaly detection. 2 A new clustering criterion Events used in intrusion detection are often described using both qualitative and numerical data. For example, network flows are described with numerical fea- tures such as the duration of the communication and the quantity of exchanged information, and qualitative features such as the protocol and port numbers used. But, most clustering algorithms are not designed to handle heterogeneous data. Most algorithms, starting with the most common k-means [15], rely on dis- tances. Hence, they are only able to handle numerical data. Some approaches are designed for categorical data, but relatively few ap- proaches tackle heterogeneous data. Most of the time, either numerical data will be discretized, or numerical values will be assigned to categorical data modali- ties. But these approaches will lead to either loss or addition of information to the data, making them not really satisfactory. Our approach however rely on an unified notion of similarity and dissimilarity, defined on values of lattices, which can be applied to both types of variables. In a way, it is similar to the approach introduced in [3] called triordonnances. But unfortunately, this approach has a cubic complexity, making it unpractical. 2.1 Intuitive definition Defining a clustering criterion using similarities and dissimilarities is quite straight- forward. Often, the intuitive definition given to define the goal of clustering is the following : clustering is the task of partitioning a set of objects so that objects within the same cluster are more similar to each other than they are to objects belonging to different clusters. To define our clustering criterion, we will use the theory of Relational Analy- sis. This theory consists in representing relations between objects with a binary coding. Readers unfamiliar with this theory can refer to the following [11, 12]. A clustering, being nothing more than a partition of the object set, can be thought of as the equivalence relation ”being in the same cluster”. This will be represented with the following matrix : Xi,j =  1 if oi and oj belong to the same cluster 0 otherwise Using the formalism of relational analysis this intuitive definition can easily be translated into a mathematical formula. Let us assume we have n objects o1, . . . , on and let us consider two functions s and s which measure the similarity (resp. the dissimilarity) between pairs of objects. Let us also define the similarity matrix S (resp. the dissimilarity matrix S) with : Si,j = s(oi, oj) Si,j = s(oi, oj) We can now consider the following score for any given value of α : Qα(X) = X 1≤i,j≤n αSi,jXi,j + (1 − α)Si,jXi,j, with 0 ≤ α ≤ 1 (1) assuming the following are satisfied : – ∀i, j, Xi,j ∈ {0; 1} – ∀i, j, Xi,j = 1 − Xi,j – ∀i, Xi,i = 1 (reflexive relation) – ∀i, j, Xi,j = Xj,i (symmetrical relation) – ∀i, j, k, Xi,j + Xj,k − Xi,k ≤ 1 (transitive relation) These constraints ensure we obtain an equivalence relation. This criterion is quite straightforward to analyse : P 1≤i,j≤n Si,jXi,j is the sum of similarities of objects belonging to the same clusters and P 1≤i,j≤n Si,jXi,j is the sum of dissimilarities between objects from distinct clusters, which are the values we wish to maximize. Our clustering algorithm will then consist of a maximisation heuristic of Qα(X). We need only to define the similarity and dissimilarity functions to obtain our criterion. This will be done with a single definition of similarity. The main difference between categorical and numerical values is the notion of order. There is a natural order on numerical values, but it makes no sense to compare two modalities of a categorical variable. This is why we will use a special case of partially ordered set : lattices, to define our criterion. 2.2 Similarity and dissimilarity on a lattice To obtain the similarity between objects described with both numerical and categorical values, we used an unified definition for values belonging to a lattice. A lattice is a partially ordered set of objects (L, ≤) where every pair of elements (x, y) have a unique least upper bound x ∨ y and a unique greatest lower bound x ∧ y. Meaning ∀x, y ∈ L there exists x ∧ y ∈ L and x ∨ y ∈ L such as : – x ∧ y ≤ x ≤ x ∨ y – x ∧ y ≤ y ≤ x ∨ y – if z ≤ x and z ≤ y then z ≤ x ∧ y (resp. if x ≤ z and y ≤ z then x ∨ y ≤ z) Using these notions we can generalise the notion of interval. Let us consider Ia,b, the interval defined by two elements of a lattice L as Ia,b = {x ∈ L, a ∧ b ≤ x ≤ a ∨ b}. And let us consider S a finite sequence which takes its values in L. The number of elements of S belonging to the interval Ia,b will be noted na,b. Using these notations we can now define the similarity between two elements a and b of a sequence of size n of elements belonging to a lattice as : Sa,b = 1 − na,b n This similarity function verifies intuitive properties of similarities. It takes its values between 0 and 1, and any object is more similar to itself, than it is to other objects : ∀a 6= b, Sa,a > Sa,b. This function is also particularly interesting for anomaly detection. It gives a greater importance to values which define a sparse interval. Using this similarity we can now also define a dissimilarity function. The less two objects are similar, the more they will be dissimilar. The dissimilarity should also be positive. And we can also wish that the dissimilarity of an object with itself be 0, so that the dissimilarity is a semi metric function. For those reasons Sa,b = Sa,a+Sb,b 2 − Sa,b has been chosen as a dissimilarity function. 2.3 Similarity on heterogeneous data With these notions we can now define a similarity and a dissimilarity function for objects represented by heterogeneous variables. Each object is defined by several variables. So the similarity (and dissimilar- ity) between two objects will simply be the sum of similarity (resp.dissimilarity) of each variables. Let us now see how our definition of similarity on lattices can be applied to qualitative and numerical values. To that end, we will see how we can create lattices from our variables. Similarity on numerical variables For numerical variables, the values of the variables and their natural order define a lattice. With this order, we simply obtain a ∨ b = max(a, b) and a ∧ b = min(a, b). It follows that if a variable takes the values a and b with a ≤ b, then na,b is the number of objects which take its value in [a, b] for this variable. It follows that Sa,b is the proportion of objects which does not take their values in [a, b]. This similarity offers several advantages. We do not use the actual values of the variables, only their relative orders, hence normalisation of the values is unnecessary. Also, unlike distance based similarities, our similarity will not be dominated by a few abhorrent values. Finally, it also allows us to consider ”ordinal” variables, such as a variable which takes the values ”low”, ”medium”, and ”high”, exactly the same way we consider a numerical variable. Similarity on categorical variables Comparing modalities of a categorical variable is senseless. This is reflected in the lattice we construct. If we have a variable whose set of modalities is U, let us consider the following lattice (U ∪ {>, ⊥}, ≤) where the values of U are not comparable, i.e. ∀a, b ∈ U, if a 6= b then a  b and b  a, and ∀a ∈ U a ≤ > and ⊥ ≤ a. A graphical representation for such a lattice is given Figure 1. > a b c d . . . ⊥ Fig. 1: Graphical representation of the lattice associated to a categorical variable whose modalities are a, b, c, d, . . . . It follows, if a 6= b , Ia,b = I⊥,>, hence na,b = n the number of objects. Also, na,a = na is the number of objects whose variable takes the modality a. The similarity function becomes : Sa,b =  1 − na n if a = b 0 otherwise 3 Optimisation of the criterion and anomaly detection Having defined a new clustering criterion, we will now see an optimisation heuris- tic we developed to optimize it. As the finality of this work is anomaly detection, we will propose an anomaly score for the obtained clusters which offers good per- formances. 3.1 The optimisation procedure, a gradient descent Computing the best clustering for a given criterion is a difficult task. An exhaus- tive approach swiftly becomes infeasible. The number of partitions of a set of n elements is Bn the nth Bell number : Bn = e−1 ∞ X k=1 kn k! This value is gigantic for relatively small values of n. For example, B10 = 115975 and B75 > 1080 the estimated number of atoms in the observable uni- verse. That explains why optimisations heuristics are usually employed. Here, we proposed a version of a gradient descent [2] in the lattice of sub- partitions. Without going into detail, our approach will look for a good clustering by either merging two clusters or splitting a cluster into two sub-clusters. These operations will be performed when they improve our criterion until a local max- imum has been found. Details about the algorithm can be found in [13]. This simple heuristic gives good performances. On small sets where the op- timum solution is known such as the ”felin” or ”iris” datasets, it gives the best solution. Also, by using small optimisations and an equivalent formula for the criterion it also gives empirically linear complexity. The computations times are provided Figure 2. 0 20 40 60 80 100 120 140 160 180 200 1.105 2.105 3.105 4.105 5.105 6.105 7.105 8.105 9.105 1.106 Computation time (s) Number of objects α = 0.65 2.75 clusters + + + + + + + + + + + α = 0.6 7.16 clusters × × × × × × × × × × × (a) Mean computation time for the clus- tering algorithm (on 100 runs) as a func- tion of the number of elements for α = 0.6 and α = 0.65. The computation time depends on the computed clusters struc- ture. So, to maintain a coherent struc- ture, the clustered data is multiple copies of the ”German Credit” dataset [10]. This dataset contains 1000 elements. This dataset has been copied from 100 times (105 elements) to 1000 times (106 ele- ments). 0 1 2 3 4 5 6 0 2 4 6 8 10 12 14 Number of variables Computation time (s) Average computation time + + + + + + + + + + + + + + (b) Mean computation time for the clus- tering algorithm (on 100 runs) as a func- tion of the number of variables for α = 0.5. The data used to has been randomly generated. Fig. 2: Computational performances of our algorithm. 3.2 Anomaly score and performances Our goal is to perform anomaly detection on network traffic using the computed clustering. As mentioned previouxly, we assume that normal traffic represents a broad majority of our data. This will lead us to suppose that large clusters are composed of normal events. So, to detect anomalous clusters we will consider clusters that are highly dissimilar to large clusters. The anomaly score we used to assess whether a cluster is anomalous uses the weighted sum of mean dissimilarity between the cluster and the others. Formally, if C = C1, . . . , Cp is the computed clustering, and n1, . . . np are the sizes of those clusters, the mean dissimilarity between Ci and Cj is 1 ninj P a∈Ci P b∈Cj Sa,b. The anomaly score for cluster Ci will be Ai = X j6=i nj 1 ninj X a∈Ci X b∈Cj Sa,b (2) = 1 ni X j6=i X a∈Ci X b∈Cj Sa,b (3) Using this anomaly score we obtain excellent performances. To assess the performances of our approach we used subsets of the KDD 99 dataset. It is necessary to select only a subset of the dataset because Denial Of Service (DOS) attacks represent roughly 80% of the records. For comparison, the performances of our approach are detailled in Tables 1a and 1b. Algorithm AUC our method 0.990 Modified clustering-TV [14] 0.973 SVM [5] 0.949 Fixed-width clustering [5] 0.940 K-NN [5] 0.895 fpMafia [9] 0.867 (a) Comparison of the Areas Under the ROC Curve (AUC) obtained with various methods. These values and the method to select the dataset have been taken from [9]. This dataset consists of the KDD 99 dataset, from which the two most frequent denial of service attacks : ”smurf” and ”neptune” have been removed. Algorithm AUC our method 0.994 MLKR [6] 0.983 LPF [16] 0.98 RKOF [7] ≤ 0.962 Active learning [1] 0.935 (b) Comparison of the Areas Under the ROC Curve (AUC) obtained with various methods. These values and the method to select the dataset have been taken from [6]. This dataset consists of the KDD 99 dataset, from which all but the User to root types of attacks have been removed from the test dataset. It consists of 60593 benign events and 228 attacks. Table 1: Performances of our approach compared to state of the art. To measure these performances, we used the area under the ROC curve (AUC). The ROC curve is the curve created by plotting the true positive rate against the false positive rate for various anomaly score cut-off values. A perfect classifier would yield an AUC of 1. On two different subsets of the dataset used in the literature, thanks to its ability to create small clusters of attacks, our method outperforms the state of the art techniques. References 1. Naoki Abe, Bianca Zadrozny, and John Langford. Outlier detection by active learning. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 504–509. ACM, 2006. 2. Augustin Cauchy. Méthode générale pour la résolution des systemes d’équations si- multanées. Comptes rendus hebdomadaires des séances de l’Académie des sciences, 25(1847):536–538, 1847. 3. Said Chah. Comparaisons par triplets en classification automatique. Revue de statistique appliquée, 34(1):61–79, 1986. 4. Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):15, 2009. 5. Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, and Sal Stolfo. A geometric framework for unsupervised anomaly detection. In Applications of data mining in computer security, pages 77–101. Springer, 2002. 6. Jun Gao, Weiming Hu, Wei Li, Zhongfei Zhang, and Ou Wu. Local outlier detection based on kernel regression. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 585–588. IEEE, 2010. 7. Jun Gao, Weiming Hu, Zhongfei Mark Zhang, Xiaoqin Zhang, and Ou Wu. Rkof: robust kernel-based local outlier detection. In Advances in Knowledge Discovery and Data Mining, pages 270–283. Springer, 2011. 8. Mark Handley, Vern Paxson, and Christian Kreibich. Network intrusion detection: Evasion, traffic normalization, and end-to-end protocol semantics. In USENIX Security Symposium, pages 115–131, 2001. 9. Kingsly Leung and Christopher Leckie. Unsupervised anomaly detection in net- work intrusion detection using clusters. In Proceedings of the Twenty-eighth Aus- tralasian conference on Computer Science-Volume 38, pages 333–342. Australian Computer Society, Inc., 2005. 10. Moshe Lichman. UCI machine learning repository, 2013. 11. Jean-François Marcotorchino and Pierre Michaud. Optimisation en analyse ordi- nale des données. Masson, 1979. 12. Jean-François Marcotorchino and Pierre Michaud. Heuristic approach of the sim- ilarity aggregation problem. Methods of operations research, 43:395–404, 1981. 13. Damien Nogues. Method for unsupervised classification of a plurality of objects and device for unsupervised classification associated with said method, 2015. EP Patent App. EP20,140,200,529. 14. Joshua Oldmeadow, Siddarth Ravinutala, and Christopher Leckie. Adaptive clus- tering for network intrusion detection. In Advances in Knowledge Discovery and Data Mining, pages 255–259. Springer, 2004. 15. Hugo Steinhaus. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. Cl. III. 4, pages 801–804, 1956. 16. Jian Yang, Ning Zhong, Yiyu Yao, and Jue Wang. Local peculiarity factor and its application in outlier detection. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, pages 776–784, New York, NY, USA, 2008. ACM. 17. Ilsun You and Kangbin Yim. Malware obfuscation techniques: A brief survey. In 2010 International conference on broadband, wireless computing, communication and applications, pages 297–300. IEEE, 2010.