A Genetic-Bayesian Short Message Service Spam Filter with Text Normalization and Semantic Indexing

Erhiri; J; Adebayo; A.O; Akinsanya; A.O; Sodiya; A.S; Eze; M.O; Ebiesuwa Seun

doi:https://doi.org/10.14445/22492593/IJCOT-V7I6P303

Research Article | Open Access | Download PDF

Volume 7 | Issue 6 | Year 2017 | Article Id. IJCOT-V7I6P303 | DOI : https://doi.org/10.14445/22492593/IJCOT-V7I6P303

A Genetic-Bayesian Short Message Service Spam Filter with Text Normalization and Semantic Indexing

Erhiri, J, Adebayo, A.O, Akinsanya, A.O, Sodiya, A.S, Eze, M.O, Ebiesuwa Seun

Citation :

Erhiri, J, Adebayo, A.O, Akinsanya, A.O, Sodiya, A.S, Eze, M.O, Ebiesuwa Seun, "A Genetic-Bayesian Short Message Service Spam Filter with Text Normalization and Semantic Indexing," International Journal of Computer & Organization Trends (IJCOT), vol. 7, no. 6, pp. 14-18, 2017. Crossref, https://doi.org/10.14445/22492593/IJCOT-V7I6P303

Abstract

Ever since the first Short Message Service (SMS) service was introduced in 1993, its popularity has continued to soar over the years such that SMS communication now constitutes a major segment in the spectrum of telecommunication. The popularity and extensive usage has attracted the interest of many researchers to the inherent potential in harvesting data and metadata from collection of SMS corpus for the performance of linguistic, diachronic, normalization and sociolinguistic studies and also in the validation and comparison of different classifiers in SMS spam filters. However, freely available dataset where this type of information can be found for research purposes are quite difficult to obtain. This is mostly due to the confidentiality of SMS where users want to reveal as little of the contents of their phones as possible. This paper is geared towards the examination of the techniques adopted in the creation of SMS corpus and the ethical consideration involved in the protection of users’ interest and privacy. For a successful SMS corpus creation, a main consideration is the requirement to protect the rights and interests of the message donors and any other person mentioned in the text messages, without altering the original text in order to gather sufficient metadata information. A review of existing work in the field was done to ascertain ethical observations adopted. Participant consent, data anonymization, and ensuring participants’ safe information storage are basic ethical consideration adopted to ensure a successful SMS corpus creation.

Keywords

corpora, Corpus, Metadata, Linguistic, Diachronic, Normalization, Sociolinguistic

References

[1] Almeida, T., Gómez Hidalgo, J.M., Pasqualini Silva, T. Towards SMS Spam Filtering: Results under a New Dataset.International Journal of Information Security Science, Vol 2, No 1, 2013.
[2] Australian Council for International Development (ACFID).Principles and Guidelines for ethical research and evaluation in development.14 Napier Close, Deakin ACT 2600 Private Bag 3, Deakin ACT 2600, Australia, 2016.
[3] BAAL (2006).Recommendations on Good Practice in Applied Linguistics„. Retrieved from: http://www.baal.org.uk/goodprac.htm.
[4] Chen, T. & Kan, M. Y (2012). Creating a Live, Public Short Message Service Corpus.
[5] Cloudmark whitepaper. SMS Spam Overview. Preserving the value of SMS texting. Retrieved from:https://www.cloudmark.com/en/s/resources/whitepapers/sms-spam-overview.
[6] Durscheid, C and E. Stark. ¨ SMS4science: An International Corpus-Based Texting Project and the Specific Challenges for Multilingual Switzerland, Chapter 5. Oxford University Press, 2011.
[7] Elizondo, J. Not 2 Cryptic 2 DCode: Paralinguistic Restitution, Deletion, and Nonstandard Orthography in Text Messages. Ph. D. thesis, Swarthmore College, 2011.
[8] Fairon, C. and Paumier, S. A translated corpus of 30,000 French SMS. In Proceedings of Language Resources and Evaluation Conference.2006,Genova.
[9] GOV.UK. Data Protection Act. Retrieved from: https://www.gov.uk/data-protection/the-data-protection-act.
[10] How, Y. and M. Kan. Optimizing predictive text entry for short message service on mobile phones. In Proceedings of Human-Computer Interaction Institute(HCII). Lawrence Erlbaum Associates, 2005.
[11] Oates. B.J.Researching Information Systems and Computing.SAGE Publications Ltd, London, 2009.
[12] Sanders, E. (2012). Collecting and Analyzing Chats and Tweets in SoNaR. In Proceedings of Language Resources and Evaluation Conference 2012,Istanbul, Turkey.
[13] Song.Z, Strassel. S, Lee. H, Walker. K, Wright.J, Garland.J, Fore.D, Gainor.B, Cabe.P, Thomas.T, Callahan.B, Sawyer.A. Collecting Natural SMS and Chat Conversations in Multiple Languages:The BOLT Phase 2 Corpus. Linguistic Data Consortium, University of Pennsylvania, 2012.
[14] Sotillo, S. SMS Texting Practices and Communicative Intention. Hershey: IGI Global, Chapter 16, pp.252–265, 2010.
[15] Tagg, C., (2009). A corpus linguistics study of SMS text messaging. Ph.D. thesis, University of Birmingham, united Kingdom.
[16] Treurniet, M., De Clercq, O., Oostdijk, N., Heuvel, H. vanden, (2012) Collecting a Corpus of Dutch SMS. In Proceedings of LREC 2012, Istanbul, Turkey.
[17] Verheijen L., Stoop W. Collecting Facebook Posts and WhatsApp Chats. In: Sojka P., Horák A., Kope?ek I., Pala K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science, vol 9924. Springer, ChamRock, F. (2001) Policy and practice in the anonymisation of linguistic data„ InternationalJournal of Corpus Linguistics 6/1: 1-26.
[18] Walkowska, J. Gathering and Analysis of a Corpus of Polish SMS Dialogues. Challenging Problems of Science. Computer Science. Recent Advances in Intelligent Information Systems, 145–157, 2009.