Adithya Gangadhar Shetty
7 min readAug 11, 2021

K-mean Clustering

K-means algorithm explores for a preplanned number of clusters in an unlabelled multidimensional dataset, it concludes this via an easy interpretation of how an optimized cluster can be expressed.

Primarily the concept would be in two steps;

  • Firstly, the cluster centre is the arithmetic mean (AM) of all the data points associated with the cluster.
  • Secondly, each point is adjoint to its cluster centre in comparison to other cluster centres. These two interpretations are the foundation of the k-means clustering model.

You can take the centre as a data point that outlines the means of the cluster, also it might not possibly be a member of the dataset.

In simple terms, k-means clustering enables us to cluster the data into several groups by detecting the distinct categories of groups in the unlabelled datasets by itself, even without the necessity of training of data.

This is the centroid-based algorithm such that each cluster is connected to a centroid while following the objective to minimize the sum of distances between the data points and their corresponding clusters.

As an input, the algorithm consumes an unlabelled dataset, splits the complete dataset into k-number of clusters, and iterates the process to meet the right clusters, and the value of k should be predetermined.

Specifically performing two tasks, the k-means algorithm

  • Calculates the correct value of K-centre points or centroids by an iterative method
  • Assigns every data point to its nearest k-centre, and the data points, closer to a particular k-centre, make a cluster. Therefore, data points, in each cluster, have some similarities and far apart from other clusters.

How the K-means algorithm works

To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids

It halts creating and optimizing clusters when either:

  • The centroids have stabilized — there is no change in their values because the clustering has been successful.
  • The defined number of iterations has been achieved.

Network Security Based on K-Means Clustering Algorithm in Data Mining

Abstract. Nowadays, the network has become the basis of everything. Meanwhile, network security has become one of today’s most urgent social problem. Intrusion detection systems are sold through real-time monitoring of network traffic, and take corresponding measures when the suspicious transfer of suspicious problems of a new network security device. Intrusion detection system compared to traditional network security measures, have great advantages. Can solve the shortcomings of the original passive inspired, can also process it before the damage occurred, appearance of the intrusion detection system, has become an important part of network security.

Preface In today’s society, computer network security has become the chief problem of information society. With the continuous development of technology, the network intrusion behavior has the hidden power, the means of destruction is complex, there is no time space to restrict the existence of network, there is a great harm to the network security . Therefore, network security is the most important component of today’s society. As for the detection and prevention of intrusion detection, it becomes the primary problem that we need to solve. The research on intrusion detection system also becomes extremely important. Based on the data mining of k-means clustering algorithm, this paper conducts research on network security and discusses how to create a network security and harmonious environment . Intrusion detection system is a system that can detect all software and hardware, and the application value is high. At present the system has already become the main network security management tool, can collect different set information in the system, and then combined with the function of the system of detection and automatic response. Intrusion detection system is a behavior classifier, which operates through the judgment of information intrusion and non-invasive behavior. Here is the concept associated with intrusion detection. In the early intrusion detection system, Denning successfully proposed the general intrusion detection system model , which laid a solid foundation for future research of intrusion detection system.

Data Mining Algorithm

Data mining algorithm consists of cluster analysis algorithm, correlation analysis and classification algorithm. Clustering algorithms can be the object of the data set is divided into a lot of similar classes, and classification algorithm is similar, are complete data grouping, and then reference algorithm definition, with the help of clustering algorithm can obtain high similarity of the same object. Cluster analysis is a common method in data mining analysis, which can be used to show unsupervised anomaly detection, and can solve problems existing in traditional data mining methods. This method can be used in a new database without having to rely on pre-determined data categories and data category samples in intrusion detection system. Cluster analysis creates a good environment for the establishment of intrusion detection system.

Establishment of Intrusion Detection Model

Four general intrusion detection model is set up, the first to use collection system, guarantee the connection records in the process of use, and can get clustering analysis of data sets, and then with the help of clustering algorithm distribution connection records, distinguish normal and abnormal connection records. In this study, k-means algorithm was used to complete cluster analysis. Clustering algorithm results in more clustering, so there are some connection records in each cluster. According to the properties of a given connection record, the properties can be used to determine the two kinds of abnormal clustering and normal clustering. The exception clustering represents the clustering of the abnormal connection records, and the normal clustering represents the clustering of the normal connection records. In system applications, if you can’t use tagged data, you can’t clearly determine the normal or abnormal condition of the connection record, and then make the clustering tag. Typically, a threshold is used to record the record of the connection above the threshold for the normal clustering, whereas the other is exception clustering. Using cluster analysis result intrusion methods that connection records, first carries on the standardization, and then from the cluster aggregation clustering, to find the right to his central value close to the distance, complete classification operation according to the tag.

Uses of K-Means clustering in Security Domain

Due to its ability to categorize data and simplicity of implementation it has been found by quite some researchers to be useful in anomaly detection, outlier detection, fraud detection etc.

Let’s us understand in brief about these kind of detections first:

Anomaly detection: Detection of a behavior in the system that do not conform to the expected behavior.

Outlier detection: Detection of behavior in the system that do conforms to the expected behavior but vary a little too much from the rest of the records.

Fraud Detection: These are the false pretenses made in the intention of gaining something from the system.

In 2015 Chauhan and Shukla in their research paper laid an approach to for outlier detection using K-Means cluster algorithm that was useful to give basics of concepts in outlier detection for beginner researchers.

Another paper was published by Münzet in 2007 proposing a novel anomaly detection method in data mining using the K-Means clustering algorithm. The model designed was concluded to be faster in terms of detection quality. The concept used was basically the same that we earlier discussed. He used unlabeled records in the dataset and divided it into clusters of regular traffic and anomalies using the K-means algorithm.

K-Means is but a basic algorithm which can be integrated with other algorithms to get more efficiency. In 2018 Aung& Min published a paper where they created a hybrid model for detecting Denial of Service (DoS) attacks, Probing (Probe) attacks, User-to-Root (U2R) attacks and Remote- to-Local (R2L) attacks. Let’s understand these in brief before going any further:

Denial of Service (DoS) attack: It is an attack meant to shut down a machine or network, making it inaccessible to its intended users.

Probing attach: A probe is an attack which is deliberately crafted so that its target detects and reports it with a recognizable “fingerprint” in the report. The attacker then uses the collaborative infrastructure to learn the detector’s location and defensive capabilities from this report.

User-to-Root attack: Initially attacker access normal user account, later gain access to the root by exploiting the vulnerabilities of the system.

Remote-to-Local attack: Unauthorized access to the system through methods like guessing password etc.

In the paper they found out that they can group similar nature of attack group by using K-means algorithm. And then with Random Forest algorithm they can be classified into normal and attack connections. They used KDDCup 99 dataset which show some impressive results.