CLASSIFICATION BOOSTING IN IMBALANCED DATA
Main Article Content
Abstract
Most existing classification approaches assumed underlying training data set to be evenly distributed. However, in the imbalanced classification, the training data set of one majority class could far surpass those of the minority class. This becomes a problem because it’s usually produces biased classifiers that have a higher predictive accuracy over the majority class, but poorer predictive accuracy over minority class. One popular method recently used to rectify this is the SMOTE (Synthetic Minority Over-Sampling Technique) which combines algorithms at data level. Therefore, this paper presents a novel approach for learning and imbalanced data sets, based on a combination of the SMOTE algorithm and the boosting procedure by focusing on a two-class problem. The Bidikmisi data set is imbalanced, because the distribution of majority class examples is 15 times the number of minority class examples. All models have been evaluated using stratified 5-fold cross-validation, and the performance criteria (such as Recall, F-Value and G-Mean) are examined. The results show that the SMOTE-Boosting algorithms have a better classification performance than the AdaBoost.M2 method, as the g-mean value increases 4-fold after the SMOTE method is used. We can say that SMOTE-Boosting algorithm is quite successful when taking advantage of boosting algorithms with SMOTE. When boosting affects the accuracy of the random forest by focusing on all data classes, the SMOTE algorithm alters the performance values of the random forest only in minority classes.
Downloads
Article Details
Transfer of Copyrights
- In the event of publication of the manuscript entitled [INSERT MANUSCRIPT TITLE AND REF NO.] in the Malaysian Journal of Science, I hereby transfer copyrights of the manuscript title, abstract and contents to the Malaysian Journal of Science and the Faculty of Science, University of Malaya (as the publisher) for the full legal term of copyright and any renewals thereof throughout the world in any format, and any media for communication.
Conditions of Publication
- I hereby state that this manuscript to be published is an original work, unpublished in any form prior and I have obtained the necessary permission for the reproduction (or am the owner) of any images, illustrations, tables, charts, figures, maps, photographs and other visual materials of whom the copyrights is owned by a third party.
- This manuscript contains no statements that are contradictory to the relevant local and international laws or that infringes on the rights of others.
- I agree to indemnify the Malaysian Journal of Science and the Faculty of Science, University of Malaya (as the publisher) in the event of any claims that arise in regards to the above conditions and assume full liability on the published manuscript.
Reviewer’s Responsibilities
- Reviewers must treat the manuscripts received for reviewing process as confidential. It must not be shown or discussed with others without the authorization from the editor of MJS.
- Reviewers assigned must not have conflicts of interest with respect to the original work, the authors of the article or the research funding.
- Reviewers should judge or evaluate the manuscripts objective as possible. The feedback from the reviewers should be express clearly with supporting arguments.
- If the assigned reviewer considers themselves not able to complete the review of the manuscript, they must communicate with the editor, so that the manuscript could be sent to another suitable reviewer.
Copyright: Rights of the Author(s)
- Effective 2007, it will become the policy of the Malaysian Journal of Science (published by the Faculty of Science, University of Malaya) to obtain copyrights of all manuscripts published. This is to facilitate:
(a) Protection against copyright infringement of the manuscript through copyright breaches or piracy.
(b) Timely handling of reproduction requests from authorized third parties that are addressed directly to the Faculty of Science, University of Malaya. - As the author, you may publish the fore-mentioned manuscript, whole or any part thereof, provided acknowledgement regarding copyright notice and reference to first publication in the Malaysian Journal of Science and Faculty of Science, University of Malaya (as the publishers) are given.
You may produce copies of your manuscript, whole or any part thereof, for teaching purposes or to be provided, on individual basis, to fellow researchers. - You may include the fore-mentioned manuscript, whole or any part thereof, electronically on a secure network at your affiliated institution, provided acknowledgement regarding copyright notice and reference to first publication in the Malaysian Journal of Science and Faculty of Science, University of Malaya (as the publishers) are given.
- You may include the fore-mentioned manuscript, whole or any part thereof, on the World Wide Web, provided acknowledgement regarding copyright notice and reference to first publication in the Malaysian Journal of Science and Faculty of Science, University of Malaya (as the publishers) are given.
- In the event that your manuscript, whole or any part thereof, has been requested to be reproduced, for any purpose or in any form approved by the Malaysian Journal of Science and Faculty of Science, University of Malaya (as the publishers), you will be informed. It is requested that any changes to your contact details (especially e-mail addresses) are made known.
Copyright: Role and responsibility of the Author(s)
- In the event of the manuscript to be published in the Malaysian Journal of Science contains materials copyrighted to others prior, it is the responsibility of current author(s) to obtain written permission from the copyright owner or owners.
- This written permission should be submitted with the proof-copy of the manuscript to be published in the Malaysian Journal of Science
References
Bühlaman, P., & Hothorn, T. (2007). Boosting Algorithms: Regularization, Prediction and Model Fitting. Statistical Science, 22(4): 477-505.
Cahyani, N., Fithriasari, K., Irhamah & Iriawan, N. (2018). On the comparison of deep learning neural network and binary logistic regression for classifying the acceptance status of bidikmisi scholarship applicants in east java. MATEMATIKA: Malaysian Journal of Industrial and Applied
Mathematics, 34 (Special Issue): 83-90.
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 321-357.
Chawla, N.V., Lazarevic, A., Hall, L.O. & Bowyer, K.W. (2003). SMOTEBoost: Improving the prediction of the minority class in boosting. Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 22-26 September, 107–119, Springer.
Freund, Y. & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. Proceedings of the 2nd European Conference on Computational Learning Theory, Barcelona, Spain, 13-15 March, 23-37, Springer.
Freund, Y. & Schapire, R. (1996). Experiments with a new boosting algorithm. Proceedings of the 13th International Conference on Machine Learning, 325-332.
Han, J., Kamber, M. & Pei, J. (2006). Data Mining Concepts and Techniques 2nd Edition. USA: Kaufman Publisher.
Han, J., Kamber, M. & Pei, J. (2012). Data Mining Concepts and Techniques 3rd Edition. USA: Kaufman Publisher.
Imran, M., Afroze, M., Sanampudi, SK., & Qyser, AAM. (2016). Data mining of imbalanced dataset in educational data using Weka tool. International Journal of Engineering Science and Computing, 6(6): 7666-7669.
Japkowicz, N. & Stephen, S. (2002). The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis, 6(5), 203-231.
Leaes, A., Fernandes, P., Lopes, L. & Assunção, J. (2017). Classifying with AdaBoost.M1: The training error threshold myth. Proceedings of the Thirtieth International Florida Artificial Intelligence Research Society Conference, Marco Island, Florida, 22-24 May.
Li, X., Wang, L. & Sung, E. (2008). AdaBoost with SVM-based component classifiers. Engineering Applications of Artificial Intelligence, 21(5) 785-795. From University of Wollongong Publications: http://ro.uow.edu.au/eispapers/602.
Schapire, R. & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37: 297–336.
Sun, Y., Wong, A.K.C. & Wang, Y. (2005). Parameter inference of cost-sensitive boosting algorithm. Proceedings of the 4th International Conference Machine Learning and Data Mining in Pattern Recognition, Leipzig, German, 9-11 July, pp. 21-30, Springer.
Suryaningtyas, W., Iriawan, N., Fithriasari, K., Ulama, BSS., Susanto, I., & Pravitasari, AA. (2018). On the Bernoulli Mixture Model for Bidikmisi Scholarship Classification with Bayesian MCMC. Journal of Physics: Conference Series, 1090: 1-8.
Ting, K. (2000). A Comparative Study of Cost-Sensitive Boosting Algorithms. Proceedings of 17th International Conference on Machine Learning, Stanford, CA, pp. 983-990.