Document Type : Original Research Paper

Authors

Department of Computer Science, Faculty of Mathematics and Computer Science, Amirkabir University of Technology (Tehran Ploytechnic), Tehran, Iran.

Abstract

BACKGROUND AND OBJECTIVES: Clustering is one of the basic techniques in data mining and machine learning, which is used to divide a set of data into homogeneous subsets. There are different methods for clustering, each of which has its own strengths and weaknesses. One of the main challenges in clustering is finding the optimal number of clusters and optimal allocation of data to these clusters. Genetic algorithm, as an optimization method based on natural evolution, has a high ability to solve complex problems and search for large solution spaces and can be used as an effective tool in clustering. The purpose of this article is to investigate the efficiency and accuracy of genetic algorithm in data classification and compare it with traditional clustering methods for classification. In order to evaluate the performance of this algorithm, several insurance data sets are used and the obtained results are analyzed with different criteria such as accuracy. Also, different parameters of the genetic algorithm are examined and their effects on the final performance of the algorithm are studied in order to determine the most optimal settings for data classification.
METHODS: In this research, to form chromosomes, at first, the number of clusters was determined. Considering that each cluster center had as many features as the number of features in the data set, the length of each chromosome was determined by multiplying the number of clusters by the number of features. New and diverse methods were used for Crossover, Mutation and Survival processes. Also, the evaluation criterion similar to the K-means algorithm was chosen to optimize the clustering performance. This innovative approach led to improving the accuracy and efficiency of the classification process.
FINDINGS: By applying the method described in this article to three insurance data sets for fraud detection, we have interesting results with 12% improvement in F1 and 10% increase in accuracy in the first data set, 1% improvement in F1 and 1% improvement in accuracy in the first data set. Second and finally, 1% improvement in F1 and 2% improvement in the accuracy of the third data set compared to the K-means method and other methods have been achieved. Due to the 2-mode data in this data set, the problem is solved for two clusters using the algorithm and the best label for each cluster is selected according to the real labels of the data and the result is presented as the results of classification problems. Additionally, significant improvements in metrics such as ARI and other clustering evaluation criteria have been achieved, and remarkable progress has been made compared to the standard genetic algorithm.
CONCLUSION: Genetic Algorithm is able to solve complex problems without definite solution and can perform better in data clustering than traditional methods such as K-means. By combining probabilities and randomness, this approach provides the possibility to examine more points as cluster centers and improve clustering performance. The results show that this method works better than the famous methods in some cases and provides a suitable structure for data clustering.

Keywords

Main Subjects

Letters to Editor


IJIR Journal welcomes letters to the editor for the post-publication discussions and corrections which allows debate post publication on its site, through the Letters to Editor. Letters pertaining to manuscript published in IJIR should be sent to the editorial office of IJIR within three months of either online publication or before printed publication, except for critiques of original research. Following points are to be considering before sending the letters (comments) to the editor.

[1] Letters that include statements of statistics, facts, research, or theories should include appropriate references, although more than three are discouraged.

[2] Letters that are personal attacks on an author rather than thoughtful criticism of the author’s ideas will not be considered for publication.

[3] Letters can be no more than 300 words in length.

[4] Letter writers should include a statement at the beginning of the letter stating that it is being submitted either for publication or not.

[5] Anonymous letters will not be considered.

[6] Letter writers must include their city and state of residence or work.

[7] Letters will be edited for clarity and length.
CAPTCHA Image