Eliminating Video Music Separator Using K-Means Algorithm and MFCC

Eman Hato

doi:https://doi.org/10.14445/22492593/IJCOT-V15I2P304

Research Article | Open Access | Download PDF

Volume 15 | Issue 2 | Year 2025 | Article Id. IJCOT-V15I2P304 | DOI : https://doi.org/10.14445/22492593/IJCOT-V15I2P304

Eliminating Video Music Separator Using K-Means Algorithm and MFCC

Eman Hato

Received	Revised	Accepted	Published
02 Jun 2025	04 Jul 2025	25 Jul 2025	13 Aug 2025

Citation :

Eman Hato, "Eliminating Video Music Separator Using K-Means Algorithm and MFCC," International Journal of Computer & Organization Trends (IJCOT), vol. 15, no. 2, pp. 35-39, 2025. Crossref, https://doi.org/10.14445/22492593/IJCOT-V15I2P304

Abstract

The news theme serves as the first separator in the news video's well-defined framework. This separator significantly raises the false detection rates during video temporal segmentation since it comprises a sequence of rapidly moving interlaced pictures with a particular musical accompaniment. The separator frames are eliminated before the video segmentation process starts to minimize extraneous frames and lower false detections. To effectively and efficiently eliminate unnecessary video frames, this paper proposes an automatic technique for separating the music portion of a news video using Mel-frequency Cepstral Coefficients (MFCC) and the K-means clustering algorithm. There are two steps in the suggested approach. The audio signal taken from the input video is used to calculate the MFCC features in the first stage. To do this, the audio stream is divided into overlapping windows, and each window is processed separately. The result is a matrix of MFCC coefficients. In the second stage, the k-means algorithm is employed to initially cluster centers from a predefined matrix, making them more closely related to each cluster, specifically music and speech in this case. The algorithm then classifies the MFCC features into music and speech clusters. To locate the intervals of consecutive music clusters, the sequences of the same cluster are determined and removed from the input video. The results demonstrated the effectiveness of the proposed method, achieving a clustering accuracy of 99%. Its efficiency was further evidenced by a reduction in errors during the segmentation process and the elimination of irrelevant information.

Keywords

Hamming windows, K-Means, Feature extraction, MFCC, Audio clustering.

References

[1] R. Priya, and T. N. Shanmugam, “A Comprehensive Review of Significant Researches on Contentbased Indexing and Retrieval of Visual Information,” Frontiers of Computer Science, vol. 7, pp. 782-799, 2013.
[CrossRef] [Google Scholar] [Publisher Link]

[2] Nitin J. Janwe, and Kishor K. Bhoyar, “Multi-Label Semantic Concept Detection in Videos Using Fusion of Asymmetrically Trained Deep Convolutional Neural Networks and Foreground Driven Concept Co-Occurrence Matrix,” Applied Intelligence, vol. 48, no. 8, pp. 2047-2066, 2018
[CrossRef] [Google Scholar] [Publisher Link]

[3] Arbind Agrahari Baniya et al., “Frame Selection Using Spatiotemporal Dynamics and Key Features as Input Pre-Processing for Video Super-Resolution Models,” SN Computer Science, vol. 5, no. 3, pp. 1-15, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[4] T. Kar et al., “Video Shot-Boundary Detection: Issues, Challenges and Solutions,” Artificial Intelligence Review, vol. 57, no. 4, pp. 1-38, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[5] M. Rhevanth et al., “Deep Learning Framework Based on Audio–Visual Features for Video Summarization,” Proceedings of the Springer Conference on Advanced Machine Intelligence and Signal Processing, vol. 858, pp. 229-243, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Yunzuo Zhang et al., “Key Frame Extraction Method for Lecture Videosbased on Spatio-Temporal Subtitles,” Multimedia Tools and Applications, vol. 83, no. 2, pp. 5437-5450, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[7] Tsung-Han Tsai, Ping-Cheng Hao, and Chiao-Li Wang, “Self-Defined Text-Dependent Wake-Up-Words Speaker Recognition System,” IEEE Access, vol. 9, pp. 138668-138676, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Adal A. Alashban et al., “Spoken Language Identification System Using Convolutional Recurrent Neural Network,” Applied Sciences, vol. 12, no. 18, pp. 1-17, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Young-Long Chen et al.., “Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition,” Applied Sciences, vol. 13, no. 12, pp. 1-19, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Šárka Brodinová et al., “Robust and Sparse K-Means Clustering for High-Dimensional Data,” Advances in Data Analysis and Classification, vol. 13, pp. 905-932, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[11] Mehta, V., Bawa, S. and Singh, J., “Analytical Review of Clustering Techniques and Proximity Measures,” Artificial Intelligence Review , vol. 53, pp. 5995-6023, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[12] Caroline X. Gao et al., “An Overview of Clustering Methods with Guidelines for Application in Mental Health Research,” Psychiatry Research, vol. 327, p.115265, 2023.
[CrossRef] [Google Scholar] [Publisher Link]