تشخیص تقلب در بیمه خودرو با استفاده از خوشه‌بندی بهبود یافته با الگوریتم ژنتیک

یوسفی مهر, بهنام; قطعی, مهدی; مرادی, سینا; تفکر, یاسمین; توکلی, ساجد

doi:10.22056/ijir.2025.02.02

نوع مقاله : مقاله علمی

نویسندگان

گروه علوم کامپیوتر، دانشکده ریاضی و علوم کامپیوتر، دانشگاه صنعتی امیرکبیر (پلی تکنیک تهران)، تهران، ایران.

https://doi.org/10.22056/ijir.2025.02.02

چکیده

پیشینه و اهداف: خوشه‌بندی یکی از روش‌های اساسی در داده‌کاوی و یادگیری ماشین است که برای تقسیم مجموعه‌ای از داده‌ها به زیرمجموعه‌های همگن به کار می‌رود. روش‌های مختلفی برای انجام خوشه‌بندی وجود دارد که هریک نقاط قوت و ضعف خاص خود را دارند. یکی از چالش‌های اصلی در خوشه‌بندی، یافتن تعداد خوشه‌های بهینه و تخصیص بهینة داده‌ها به این خوشه‌هاست. الگوریتم ژنتیک، به‌عنوان روش بهینه‌سازی مبتنی بر تکامل طبیعی، توانایی بالایی در حل مسائل پیچیده و جست‌وجوی فضای جواب‌های بزرگ دارد و می‌تواند به‌عنوان یک ابزار مؤثر در خوشه‌بندی به کار رود. هدف این مقاله، بررسی کارایی و دقت الگوریتم ژنتیک در کلاس‌بندی داده‌ها و مقایسة آن با روش‌های سنتی خوشه‌بندی برای کلاس‌بندی است. به‌منظور ارزیابی عملکرد این الگوریتم، چندین مجموعه داده بیمه استفاده شده و نتایج به‌دست‌آمده با معیارهای مختلفی مانند دقت تحلیل می‌شوند. همچنین، پارامترهای مختلف الگوریتم ژنتیک بررسی‌شده و تأثیر آن‌ها بر عملکرد نهایی الگوریتم مطالعه می‌شود تا بهینه‌ترین تنظیمات برای کلاس‌بندی داده‌ها تعیین شود.
روش‌شناسی: در این پژوهش، به‌منظور تشکیل کروموزوم‌ها، ابتدا تعداد خوشه‌ها مشخص شد. با توجه به اینکه هر مرکز خوشه به‌ اندازة تعداد ویژگی‌های مجموعه داده دارای ویژگی بود، طول هر کروموزوم به‌صورت حاصل‌ضرب تعداد خوشه‌ها در تعداد ویژگی‌ها تعیین شد. برای فرایندهایCrossover ،Mutation و Survival از روش‌های نوین و متنوعی بهره گرفته شد. همچنین، معیار ارزیابی مشابه الگوریتم K-means انتخاب شد تا عملکرد خوشه‌بندی بهینه‌سازی شود. این رویکرد نوآورانه به بهبود دقت و کارایی فرایند کلاس‌بندی منجر شد.
یافته‌ها: با اعمال روش توضیح‌داده‌شده در این مقاله برای تشخیص تقلب در ۳ مجموعه دادة بیمه، به نتایج جالب توجهی با 12% بهبود در F1 و 10% افزایش دقت در مجموعه دادة اول،‌ 1% بهبود F1 و دقت در مجموعه دادة دوم و در نهایت نیز 1% بهبود در F1 و 2% بهبود در دقت مجموعه دادة سوم نسبت به روشK-means و سایر روش‌ها حاصل ‌شده است. با توجه به ۲ کلاس بودن داده‌ها در این مجموعه‌ داده‌ها‌، مسئله به‌ازای ۲ خوشه با استفاده از الگوریتم حل‌ شده و بهترین برچسب برای هر خوشه با توجه به برچسب‌های واقعی دادگان انتخاب ‌شده و نتیجه به‌صورت نتایج حاصل از مسائل دسته‌بندی ارائه‌ شده است، همچنین بهبود چشمگیری در معیارهایی همچون ARI و سایر معیارهای ارزیابی خوشه‌بندی حاصل‌ شده و پیشرفت چشمگیری نسبت به الگوریتم ژنتیک عادی نیز حاصل‌ شده است.
نتیجه‌گیری: الگوریتم ژنتیک قابلیت حل مسائل پیچیده و بدون راه‌حل قطعی را دارد و می‌تواند در خوشه‌بندی داده‌ها عملکرد بهتری نسبت به روش‌های سنتی مانندK-means داشته باشد. این رویکرد با ترکیب احتمالات و تصادفی بودن، امکان بررسی نقاط بیشتر به‌عنوان مراکز خوشه و بهبود عملکرد خوشه‌بندی را فراهم می‌کند. نتایج نشان می‌دهد که این روش در برخی موارد بهتر از روش‌های معروف عمل می‌کند و ساختار مناسبی برای خوشه‌بندی داده‌ها ارائه می‌دهد.

کلیدواژه‌ها

موضوعات

فناوری‌های نوین بیمه‌ای

عنوان مقاله [English]

Detecting car insurance fraud using improved clustering with genetic algorithm

نویسندگان [English]

Behnam Yousefimehr
Mehdi Ghatee
Sina Moradi
Yasamin Tafakor
Sajed Tavakoli

Department of Computer Science, Faculty of Mathematics and Computer Science, Amirkabir University of Technology (Tehran Ploytechnic), Tehran, Iran.

چکیده [English]

BACKGROUND AND OBJECTIVES: Clustering is one of the basic techniques in data mining and machine learning, which is used to divide a set of data into homogeneous subsets. There are different methods for clustering, each of which has its own strengths and weaknesses. One of the main challenges in clustering is finding the optimal number of clusters and optimal allocation of data to these clusters. Genetic algorithm, as an optimization method based on natural evolution, has a high ability to solve complex problems and search for large solution spaces and can be used as an effective tool in clustering. The purpose of this article is to investigate the efficiency and accuracy of genetic algorithm in data classification and compare it with traditional clustering methods for classification. In order to evaluate the performance of this algorithm, several insurance data sets are used and the obtained results are analyzed with different criteria such as accuracy. Also, different parameters of the genetic algorithm are examined and their effects on the final performance of the algorithm are studied in order to determine the most optimal settings for data classification.
METHODS: In this research, to form chromosomes, at first, the number of clusters was determined. Considering that each cluster center had as many features as the number of features in the data set, the length of each chromosome was determined by multiplying the number of clusters by the number of features. New and diverse methods were used for Crossover, Mutation and Survival processes. Also, the evaluation criterion similar to the K-means algorithm was chosen to optimize the clustering performance. This innovative approach led to improving the accuracy and efficiency of the classification process.
FINDINGS: By applying the method described in this article to three insurance data sets for fraud detection, we have interesting results with 12% improvement in F1 and 10% increase in accuracy in the first data set, 1% improvement in F1 and 1% improvement in accuracy in the first data set. Second and finally, 1% improvement in F1 and 2% improvement in the accuracy of the third data set compared to the K-means method and other methods have been achieved. Due to the 2-mode data in this data set, the problem is solved for two clusters using the algorithm and the best label for each cluster is selected according to the real labels of the data and the result is presented as the results of classification problems. Additionally, significant improvements in metrics such as ARI and other clustering evaluation criteria have been achieved, and remarkable progress has been made compared to the standard genetic algorithm.
CONCLUSION: Genetic Algorithm is able to solve complex problems without definite solution and can perform better in data clustering than traditional methods such as K-means. By combining probabilities and randomness, this approach provides the possibility to examine more points as cluster centers and improve clustering performance. The results show that this method works better than the famous methods in some cases and provides a suitable structure for data clustering.

کلیدواژه‌ها [English]

Artificial intelligence
Car insurance
Clustering
Fraud detection
Genetic algorithm

مراجع

Ahmadlou, Y.; Pourebrahimi, A.; Tanha, J.; Rajabzadeh, A., (2023). Presenting a hybrid model for identifying claims of suspicious damages in agricultural insurance. Iran. J. Insur. Res., 12(1): 63-78. (16 pages) [In Persian].

Ahmed, M.; Seraj, R.; Islam, S.M.S., (2020). The k-means algorithm: A comprehensive survey and performance evaluation. Electron, 9(8): 1295.

Aibinu, A.M.; Salau, H.B.; Rahman, N.A.; Nwohu, M.N.; Akachukwu, C.M., (2016). A novel clustering based genetic algorithm for route optimization. Eng. Sci. Technol. Int. J., 19(4): 2022-2034. (12 Pages)

Babaie, S.S.; Omid Mahdi, E.E.; Firoozan, T., (2016). A Novel Combined Approach of k-Means and Genetic Algorithm to Cluster Cultural Goods in Household Budget. In Proceedings of the 4th International Conference on Frontiers in Intelligent Computing: Theory and Applications (FICTA) 2015 Springer India.

Bhatia, S., (2014). New improved technique for initial cluster centers of K means clustering using Genetic Algorithm. In International Conference for Convergence for Technology-2014. IEEE.

Chang, D.X.; Zhang, X.D.; Zheng, C.W., (2009). A genetic algorithm with gene rearrangement for K-means clustering. Pattern Recognition, 42(7): 1210-1222 (12 Pages).

Chiang, S.; Chu, S.C.; Hsin, Y.C.; Wang, M.H., (2006). Genetic distance measure for K-modes algorithm. Int. J. Innovative Comput. Inf.Control, 2(1): 33-40 (7 Pages).

De Falco, I.; Della Cioppa, A.; Tarantino, E., (2002). Mutation-based genetic algorithm: performance evaluation. Appl. Soft Comput., 1(4): 285-299 (14 Pages).

Hruschka, E.R.; Campello, R.J.; Freitas, A.A., (2009). A survey of evolutionary algorithms for clustering. IEEE Trans. Syst. Man, Cybern, Part C (applications and reviews), 39(2): 133-155 (22 Pages).

Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J., (2023). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci., 622. 178-210 (32 Pages).

Jain, A.K., (2010). Data clustering: 50 years beyond K-means. Pattern Recognit. Let., 31(8): 651-666 (15 Pages).

Jebari, K.; Madiafi, M., (2013). Selection methods for genetic algorithms. Int. J. Emerging Sci., 3(4): 333-344 (11 Pages).

Katoch, S.; Chauhan, S.S.; Kumar, V., (2021). A review on genetic algorithm: Past, present, and future. Multimedia tools Appl., 80: 8091-8126 (35 Pages).

Kudova, P., (2007). Clustering genetic algorithm. In 18th International Workshop on Database and Expert Systems Applications (DEXA 2007). IEEE.

Lu, Z.; Zhang, K.; He, J.; Niu, Y., (2016). Applying k-means clustering and genetic algorithm for solving mtsp. In Bio-inspired Computing–Theories and Applications: 11th International Conference, BIC-TA 2016, Xi'an, China, 2017, Revised Selected Papers, Part II 11. Springer Singapore.

Maulik, U.; Bandyopadhyay, S., (2000). Genetic algorithm-based clustering technique. Pattern Recognit., 33(9): 1455-1465 (10 Pages).

Poikolainen, I.; Neri, F.; Caraffini, F., (2015). Cluster-based population initialization for differential evolution frameworks. Inf. Sci., 297: 216-235 (19 Pages).

Rahman, M.A.; Islam, M.Z., (2014). A hybrid clustering technique combining a novel genetic algorithm with K-Means. Knowl. Based Syst., 71: 345-365 (20 Pages).

Roy, D.K.; Sharma, L.K., (2010). Genetic k-means clustering algorithm for mixed numeric and categorical data sets. Int. J. Artif. Intell. Appl., 1(2): 23-28 (5 Pages).

Seidi Aghil Abadi, Z.; Sehhat, S.; Salehi, R., (2017). Investigation and analysis of fraudulent factors in the third-party civil liability car Insurance (Third-party insurance-physical damage). Iran. J. Insur. Res., 7(1): 13-26. (14 pages) [In Persian].

Shetty, P.; Singh, S., (2021). Hierarchical clustering: A survey. Int. J. Appl. Res., 7(4): 178-181 (3 Pages).

Singh, A.; Yadav, A.; Rana, A., (2013). K-means with three different distance metrics. Int. J. Comput. Appl., 67(10): 13-17. (5 pages)

Sonia, S.; Rai, S., (2012). Genetic k-means algorithm - implementation and analysis. Int. J. Recent Tech and Eng., 1(2): 1-4 (4 pages).

Tajaddodi Nodehi, M.; Hosseini Khatibani, S.; Yazdinejad, M.; Zolfi, S., (2024). Predicting people's health insurance costs using machine learning and ensemble learning methods. Iran. J. Insur. Res., 13(1): 1-14 (14 pages) [In Persian].

Yong, Y.; Xin cheng, G., (2012, July). A new minority kind of sample sampling method based on genetic algorithm and K-means cluster. In 2012 7th International Conference on Computer Science & Education (ICCSE). IEEE.

Yousefimehr, B.; Ghatee, M., (2025). A distribution-preserving method for resampling combined with LightGBM-LSTM for sequence-wise fraud detection in credit card transactions. Expert Syst. Appl., 262: 125661.

Zainuddin, F.; Abd Samad, M.F., (2020). A review of crossover methods and problem representation of genetic algorithm in recent engineering applications. Indones. J. Electr. Eng. Comput. Sci., 19(3): 759-769. (11 pages)

نامه به سردبیر

سردبیر نشریه پژوهشنامه بیمه، هرگونه پیشنهاد و انتقاد دیگر نویسندگان و خوانندگان را در خصوص نقد و بررسی این مقاله مندرج در سامانه نشریه را ظرف مدت 3 ماه از تاریخ انتشار آنلاین مقاله در سامانه و قبل از انتشار چاپی نشریه، به منظور اصلاح و نظردهی امکان پذیر نموده است.، البته این نقد در مورد تحقیقات اصلی مقاله نمی باشد.
توجه به موارد ذیل پیش از ارسال نامه به سردبیر لازم است در نظر گرفته شود:
[1] نامه هایی که شامل گزارش آماری، واقعیت ها، تحقیقات یا نظریه پردازی ها هستند، لازم است همراه با منابع معتبر و مناسب همراه باشد، اگرچه ارسال بیش از زمان 3 نامه توصیه نمی گردد.
[2] نامه هایی که بجای انتقاد سازنده به ایده های تحقیق، مشتمل بر حملات شخصی به نویسنده باشند، توجه و چاپ نمی شود.
[3] نامه ها نباید بیش از 300 کلمه باشد.
[4] نویسندگان نامه لازم است در ابتدای نامه تمایل یا عدم تمایل خود را نسبت به چاپ نظریه ارسالی نسبت به یک مقاله خاص اعلام نمایند.
[5] به نامه های ناشناس ترتیب اثر داده نمی شود.
[6] شهر، کشور و محل سکونت نویسندگان نامه باید در نامه مشخص باشد.
[7] به منظور شفافیت بیشتر و محدودیت حجم نامه، ویرایش بر روی آن انجام می پذیرد.

نام و نام خانوادگی *

پست الکترونیکی *

وابستگی سازمانی *

توضیحات *

شناسه امنیتی *

پژوهشنامه بیمه

تشخیص تقلب در بیمه خودرو با استفاده از خوشه‌بندی بهبود یافته با الگوریتم ژنتیک

مراجع

مراجع

ارسال نظر در مورد این مقاله

دوره 14، شماره 2 - شماره پیاپی 52
فروردین 1404
صفحه 109-118

تشخیص تقلب در بیمه خودرو با استفاده از خوشه‌بندی بهبود یافته با الگوریتم ژنتیک

مراجع

مراجع

ارسال نظر در مورد این مقاله

دوره 14، شماره 2 - شماره پیاپی 52فروردین 1404صفحه 109-118

دوره 14، شماره 2 - شماره پیاپی 52
فروردین 1404
صفحه 109-118