A Machine Learning Approach to Automatic Music Genre Classification

Pre-Prin

ted V

ersion

A Machine Learning Approach toAutomatic Music Genre

ClassificationCarlos N. Silla Jr.1, Alessandro L. Koerich2 & Celso A. A. Kaestner3

1University of Kent – Computing LaboratoryCanterbury, CT2 7NFKent, United [email protected]

2Pontifical Catholic University of ParanáR. Imaculada Conceição 1155, 80215-901

Curitiba - Paraná - [email protected]

3Federal University of Technology of ParanáAv. Sete de Setembro 3165, 80230-901

Curitiba - Paraná - [email protected]

Abstract

This paper presents a non-conventional approach forthe automatic music genre classification problem. Theproposed approach uses multiple feature vectors anda pattern recognition ensemble approach, according tospace and time decomposition schemes. Despite beingmusic genre classification a multi-class problem, we ac-complish the task using a set of binary classifiers, whoseresults are merged in order to produce the final musicgenre label (space decomposition). Music segments arealso decomposed according to time segments obtainedfrom the beginning, middle and end parts of the originalmusic signal (time-decomposition). The final classifica-tion is obtained from the set of individual results, accord-ing to a combination procedure. Classical machine learn-ing algorithms such as Naïve-Bayes, Decision Trees, kNearest-Neighbors, Support Vector Machines and Multi-Layer Perceptron Neural Nets are employed. Experimentswere carried out on a novel dataset called Latin MusicDatabase, which contains 3,160 music pieces categorizedin 10 musical genres. Experimental results show thatthe proposed ensemble approach produces better resultsthan the ones obtained from global and individual seg-ment classifiers in most cases. Some experiments related

to feature selection were also conducted, using the geneticalgorithm paradigm. They show that the most importantfeatures for the classification task vary according to theirorigin in the music signal.

Keywords: Machine Learning, Pattern Classification,Music Classification, Feature Selection

1. INTRODUCTIONMusic is nowadays a significant part of the Internet

content: the net is probably the most important source ofmusic pieces, with several sites dedicated to the spread-ing, distribution and commercialization of music. In thiscontext, automatic procedures capable of dealing withlarge amounts of music in digital formats are imperative,and Music Information Retrieval (MIR) has become animportant research area.

One of the tasks focused by MIR is the AutomaticMusic Genre Classification (AMGC) problem. In essencemusic genres are categorical labels created by human ex-perts in order to identify the style of the music. The musicgenre is a descriptor that is largely used to organize col-lections of digital music. It is not only a crucial metadata

Pre-Prin

ted V

ersion

Carlos N. Silla Jr., Alessandro L. Koerich, CelsoA. A. Kaestner

A Machine Learning Approach to AutomaticMusic Genre Classification

in large music databases and electronic music distribu-tion (EMD) [28], but also the most frequent item used insearch queries, as pointed by several authors [29], [34],[7], [35], [18].

Up to now the standard procedure for organizingmusic content is the manual use of meta informationtags, such as the ID3 tags associated to music coded inMPEG-1 Audio Layer 3 (MP3) compression format [13].This metadata includes song title, album, year, track num-ber and music genre, as well as other specific details ofthe file content. This information, however, is usually in-complete and inaccurate, since it depends on human sub-jectiveness. As there is no industrial standard, more con-fusion arises: for example, the same music piece is indi-cated as belonging to different music genres according todifferent EMD sites, due to different interpretation and/ordifferent codification schemes [17].

We emphasize that AMGC also poses an interestingresearch problem from a pattern recognition perspective.Music can be considered as a high-dimensional digitaltime-variant signal, and music databases can be very large[1]; thus, it offers a good opportunity for testing non-conventional pattern recognition approaches.

Also, from a pattern recognition perspective, the taskof automatic music genre classification poses an inter-esting research problem: music signal, a complex time-variant signal,

As each digital music piece can be associated to a vec-tor representative, by applying some extracting procedureto calculate appropriate feature values, to put this prob-lem as a classical classification task in the pattern recog-nition framework is straightforward [26]. Typically a mu-sic database (as in a EMD) contains thousands of piecesfrom dozens of manually-defined music genres [28], [21],[31] characterizing a complex multi-class problem.

In this paper we present a non-conventional machinelearning approach to the AMGC problem. We use the en-semble approach [15], [6] based on both space and timedecomposition schemes. Firstly, the AMGC problem isnaturally multi-class; as pointed in the literature [6], [10],inter class similarity / extra class distinction are inverselyproportional to the number of classes involved. This sit-uation can be explained by considering that is difficult tomost of the classifiers to construct an adequate separationsurface among the classes [22]. One solution is to decom-pose the original space and the original multi-class prob-lem as a series of binary classification problems, wheremost of the known classifiers work better, and to mergethe obtained results in order to produce the final result.Secondly, as music is a time varying signal, several seg-ments can be employed to extract feature vectors, in orderto produce a set of feature vectors that characterizes a de-composition of the original signal according to the timedimension. We employ an ensemble approach that en-

compasses both decompositions, in a way that classifiersare applied to space-time partial views of the music, andthe obtained classification results are merged to producethe final class label.

We also conduct some experiments related to featureselection, employing a genetic algorithm framework, inorder to show the relation between the relative importanceof each feature and its origin in the music signal.

This paper is organized as follows: section 2 presentsa formal view of the problem and summarize the results ofseveral research works in automatic music genre classifi-cation; section 3 presents our proposal, based on space-time decomposition strategies, describing also the em-ployed features, classification and results combination al-gorithms; section 4 presents the employed dataset and theexperimental results obtained with the space-time decom-position; section 5 describe the employed feature selec-tion procedure, and the obtained experimental results; fi-nally, section 6 presents the conclusions of our work.

2. PROBLEM DEFINITION AND RELATEDWORKS

Nowadays the music signal representation is no longeranalogous to the original sound wave. The analogical sig-nal is sampled, several times per second, and transformedby an analogous-to-digital converter into a sequence ofnumeric values in a convenient scale. This sequence rep-resent the digital audio signal of the music, and can beemployed to reproduce the music [13].

Hence the digital audio signal can be represented by asequence S =< s1, s2 . . . sN > where si stands for thesignal sampled in the instant i and N is the total num-ber of samples of the music. This sequence contains a lotof acoustic information, and an features related to timbraltexture, rhythm and pitch content can be extracted fromit. Initially the acoustic features are extracted from shortframes of the audio signal; then they are aggregated intomore abstract segment-level features [1]. So a feature vec-tor X =< x1, x2 . . . xD > can be generated, where eachfeature xj is extracted from S (or some part of it) by anappropriate extraction procedure.

Now we can formally define the AMGC problem asa pattern classification problem, using segment-level fea-tures as input. From a finite set of music genres G wemust select one class g which best represents the genre ofthe music associated to the signal S.

From a statistical perspective the goal is to find themost likely g ∈ G, given the feature vector X , that is

g = arg maxg∈G

P (g|X)

where P (g|X) is the a posteriori probability of the musicbelong to the genre g given the features expressed by X .

2

https://www.researchgate.net/publication/33050811_Toward_a_theory_of_music_information_retrieval_queries_System_design_implications?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/224141518_On_Combining_Classifiers?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/221573225_Content-based_Organization_and_Visualization_of_Music_Archives?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/3333877_Musical_genre_classification_of_audio_signals_IEEE_Trans_Speech_Audio_Process?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/4137483_Music_Genre_Classification_with_Taxonomy?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/200688620_Survey_of_Music_Information_Needs_Uses_and_Seeking_Behaviours_Preliminary_Findings?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/221112749_Pairwise_Classification_as_an_Ensemble_Technique?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/220723202_Digital_Music_Interaction_Concepts_A_User_Study?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/239017121_Representing_Musical_Genre_A_State_of_the_Art?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/239017121_Representing_Musical_Genre_A_State_of_the_Art?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/251772056_Machine_Learning_Ensemble_Methods_in?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/251772056_Machine_Learning_Ensemble_Methods_in?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/230606311_The_Definitive_Guide?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/230606311_The_Definitive_Guide?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/281420448_Aggregate_Features_and_AdaBoost_for_Music_Classification?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

Pre-Prin

ted V

ersion



Using the Bayes’ rule the equation can be rewritten as

g = arg maxg∈G

P (X|g).P (g)P (X)

where P (X|g) is the probability in which the feature vec-tor X occurs in class g, P (g) is the a priori probability ofthe music genre g (which can be estimated from frequen-cies in the database) and P (X) is the probability of oc-currence of the feature vector X . The last probability is ingeneral unknown, but if the classifier computes the likeli-hoods of the entire set of genres, then

∑g∈G P (g|X) = 1

and we can obtain the desired probabilities for each g ∈ Gby

P (g|X) =P (X|g).P (g)∑

g∈G P (X|g).P (g)

The AMGC problem was initially defined in the workof Tzanetakis and Cook [34]. In this work a comprehen-sive set of features were proposed to represent a musicpiece. These features are obtained from a signal process-ing perspective, and include timbral texture features, beat-related features and pitch-related features. As classifica-tion procedures they employ Gaussian classifiers, Gaus-sian mixture models and the k Nearest-Neighbors (k-NN)classifier. The experiments occur in a database calledGTZAN that include 1, 000 samples from ten music gen-res, with features extracted from the first 30-seconds ofeach music. Obtained results indicate an accuracy ofabout 60% using a ten-fold cross validation procedure.The employed feature set has become of public use,as part of the MARSYAS framework (Music Analysis,Retrieval and SYnthesis for Audio Signals, available athttp://marsyas.sourgeforge.net/), a free software platformfor developing and evaluating computer audio applica-tions [34].

Kosina [17] developed MUGRAT (MUsic GenreRecognition by Analysis of Texture, available athttp://kyrah.net/mugrat), a prototypical system for mu-sic genre recognition based on a subset of the featuresgiven by the MARSYAS framework. In this case the fea-tures were extracted from 3-second segments randomlyselected from the entire music signal. Experiments werecarried out in a database composed by 186 music samplesbelonging to 3 music genres. Employing a 3-NN clas-sifier Kosina obtains an average accuracy of 88.35% us-ing a ten-fold cross-validation procedure. In this workthe author also confirm that manually-made music genreclassification is inconsistent: the very same music piecesobtained from different EMD sources were differently la-beled in the ID3 genre tag.

Li, Ogihara and Li [19] present a comparative studybetween the features included in the MARSYAS frame-work and a set of features based on Daubechies Wavelet

Coefficient Histograms (DWCH), using also other classi-fication methods such as Support Vector Machines (SVM)and Linear Discriminant Analysis (LDA). For compari-son purposes they employ two datasets: (a) the originaldataset of Tzanetakis and Cook (GTZAN), with featuresextracted from the beginning of the music signal, and (b)a dataset composed by 755 music pieces of 5 music gen-res, with features extracted from the interval [31, 61] sec-onds. Conducted experiments show that the SVM clas-sifier outperforms all other methods: in case (a) it im-prove accuracy to 72% using the original feature set andto 78% using the DWCH feature set; in case (b) the re-sults were 71% for the MARSYAS feature set and 74%to the DWCH feature set. The authors also evaluate somespace decomposition strategies: the original multi-classproblem (5 classes) was decomposed in a series of bi-nary classification problems, according to a OAA and RRstrategies. The best results were achieved with SVM andOAA space decomposition using DWCH feature set. Ac-curacy was improved of 2 to 7% according to employedfeature set – DWCH and MARSYAS respectively – in thedataset (a), and from 2 to 4% in dataset (b).

Grimaldi, Cunningham and Kokaram [11], [12] em-ploy a space decomposition strategy to the AMGC prob-lem, employing specialized classifiers for each space vi-sion and an ensemble approach to obtain the final clas-sification decision. The authors decompose the originalproblem according to One-Against-All (OAA), Round-Robin (RR) – called pairwise comparison [10] – and ran-dom selection of subspaces [14] methods. They also em-ploy different feature selection procedures, such as rank-ing according to information gain (IG) and gain ratio(GR), and Principal Component Analysis (PCA). Experi-ments were conducted in a database of 200 music piecesof 5 music genres, using the k-NN classifier and a 5-fold cross validation procedure. The feature set was ob-tained from the entire music piece, using discrete wavelettransform (DPWT). For k-NN classifier the PCA analy-sis proves to be the most effective feature selection tech-nique, achieving an accuracy of 79%. The RR ensembleapproach scores 81% for both IG and GR, showing to bean effective ensemble technique. When applying a for-ward sequential feature selection based on the GR rank-ing, the ensemble scores 84%.

The work of Meng, Ahrendt and Larsen [25] dealswith the relative importance of the features. They em-ploy features based on three time scales: (a) short-termfeatures, computed over 30 milliseconds windows, andrelated to timbral texture; (b) middle-term features, ob-tained from 740 milliseconds windows and related tomodulation and/or instrumentation; and (c) long-term fea-tures, computed over 9.62 seconds windows and relatedto beat pattern and rhythm. They use two classifiers: asingle layer neural net and a Gaussian classifier based on

3

https://www.researchgate.net/publication/242382007_MUSIC_GENRE_RECOGNITION?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==



https://www.researchgate.net/publication/228875903_An_evaluation_of_alternative_feature_selection_strategies_and_ensemble_techniques_for_classifying_music?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/221318709_A_Wavelet_Packet_representation_of_audio_signals_for_music_genre_classification_using_different_ensemble_and_feature_selection_techniques?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/221299170_A_Comparative_Study_on_Content-Based_Music_Genre_Classification?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/221112749_Pairwise_Classification_as_an_Ensemble_Technique?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/4137379_Improving_music_genre_classification_by_short_time_feature_integration?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/220656193_Musical_Genre_Classification_of_Audio_Signals?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==


Pre-Prin

ted V

ersion



the covariance matrix. Experiments were made over twodatasets, the first one with 100 music pieces from 5 mu-sic genres and the second with 354 music pieces from 6music genres. Experiments show that a combination ofmiddle and long-term features produce better results forclassification.

Yaslan and Catalpete [37] employ a large set of clas-sifiers to study the problem of feature selection in theAMGC problem. They use the linear and quadratic dis-criminant classifiers, the Naïve-Bayes classifier, and vari-ations of the k-NNclassifier. They employ the GTZANdatabase and the MARSYAS framework [34] for fea-ture extraction. The features were analyzed according togroups, for each one of the classifiers. They employ theForward Feature Selection (FFS) and Backward FeatureSelection (BFS) methods, in order to find the best featureset for the problem. These methods are based on guidedsearch in the feature space, starting from the empty setand from the entire set of features, respectively. They re-port positive results in classification with the use of thefeature selection procedure.

Up to now most researchers attack the AMGC prob-lem by ensemble techniques using only space decompo-sition. We emphasize that these techniques employ dif-ferent views of the feature space and classifiers dedicatedto these subspaces to produce partial classifications, anda combination procedure to the final class label. Further-more the features used in these works are selected fromone specific part of the music signal or from the entiremusic signal.

One exception is the work of Bergstra et al. [1]. Theyuse the ensemble learner AdaBoost [9] which performsthe classification iteratively by combining the weightedvotes of several weak learners. Their model uses sim-ple decision stumps, each of which operates on a singlefeature dimension. The procedure shows to be effectivein three music genre databases, winning the music genreidentification task in the MIREX 2005 (Music Informa-tion Retrieval EXchange).

The first work that uses time decomposition using reg-ular classifiers applied to complete feature vectors wasproposed by Costa, Valle-Jr and Koerich [3]. This workpresents experiments based on ensemble of classifiers thatuses three time segments of the music audio signal, andwhere the final decision is given by the majority vote rule.They employ a MLP neural net and the k-NN classifiers.Experiments were conducted on a database of 414 musicpieces of 2 genres. However, final results regarding thequality of the method for the classification task were in-conclusive. Koerich and Poitevin [16] employ the samebase and an ensemble approach with a different set ofcombination rules. Given the set of individual classifica-tions and their corresponding score – a number associatedto each class also obtained from the classification proce-

dure – they use the maximum, the sum and pondered sum,product and pondered product of the scores to assign thefinal class. Their experiments show better results than theindividual classifications when using two segments andthe pondered sum and the pondered product as result com-bination rules.

3. THE SPACE-TIME DECOMPOSITIONAPPROACH

In this paper we evaluate the effect of using the en-semble approach in the AMGC problem, where individ-ual classifiers are applied to a special decomposition ofthe music signal that encompasses both space and time di-mensions. We use feature space decomposition followingthe OAA and RR approaches, and also features extractedfrom different time segments [30], [31], [32]. Thereforeseveral feature vectors and component classifiers are usedin each music part, and a combination procedure is em-ployed to produce the final class label for the music.

3.1. SPACE DECOMPOSITION

Music genre classification is naturally a multi-classproblem. However, we employ a combination of the re-sults given by binary classifiers, whose results are after-wards merged in order to produce the final music genrelabeling.

This procedure characterize a space decomposition ofthe feature space, since features are used according to dif-ferent views of the problem space. The approach is justi-fied because for two class problems the classifiers tend tobe simple and effective. This point is related to the type ofthe separation surface constructed by the classifier, whichare limited in several cases [22].

Two main techniques are employed to produce the de-sired decomposition: (a) in the one-against-all (OAA) ap-proach, a classifier is constructed for each class, and allthe examples in the remaining classes are considered asnegative examples of that class; and (b) in the round-robin(RR) approach, a classifier is constructed for each pair ofclasses, and the examples belonging to the other classesare discarded. Figures 1 and 2 schematically illustrate theapproaches. For a M -class problem (M music genres)several classification results arise: according to the OAAtechnique M class labels are produced, whereas for theRR technique M(M − 1)/2 class labels are generated.These labels are combined according to a decision proce-dure in order to produce the final class label.

We emphasize that the decomposition here is madeonly by manipulating the instances in the considereddatabase: relabeling conveniently the negative examplesin the OAA approach, and discarding the examples of thenon-considered classes in the RR approach. Hence, each

4

https://www.researchgate.net/publication/4210304_Combination_of_homogeneous_classifiers_for_musical_genre_classification?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/220931265_Audio_Music_Genre_Classification_Using_Different_Classifiers_and_Feature_Selection_Methods?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/220942923_Time-Space_Ensemble_Strategies_for_Automatic_Music_Genre_Classification?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/220753811_Automatic_music_genre_classification_using_ensemble_of_classifiers?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/221995973_A_decision-theoretic_generalization_of_on-line_learning_and_an_application_to_boosting?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/2938189_Automatic_Classification_of_Audio_Data?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/201841001_A_Decision-Theoretic_Generalization_of_On-Line_Learning_and_an_Application_to_Boosting?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==



Pre-Prin

ted V

ersion



one of the generated classification problem follows ex-actly the formal description presented in the previous sec-tion.

Figure 1. One-Against-All Space Decomposition Approach

Figure 2. Round-Robin Space Decomposition Approach

3.2. TIME DECOMPOSITION

An audio record of a music piece is a time varyingsignal. The idea behind time decomposition is that wecan obtain a more adequate representation of the musicpiece if we consider several time segments of the sig-nal. This procedure aims to better treat the great variationthat usually occurs along music pieces, and also permitsto compare the discriminative power of the features ex-tracted from different parts of the music.

Figure 3 illustrate this point: it presents the averagevalues of 30 features extracted from different music sub-intervals, obtained over 150 music pieces of the genreSalsa. We emphasize the irregularity of the results, show-ing that feature values vary depending on the intervalfrom which they were obtained.

If we employ the formal description of the AMGCproblem given in the previous section, the time decompo-

Figure 3. Average values over 150 music pieces of genre Salsa for 30features extracted from different music sub-intervals

sition can be formalized as follows. From the original mu-sic signal S =< s1, s2 . . . sN > we obtain different sub-signals Spq. Each sub-signal is simply a projection of Son the interval [p, q] of samples, or Spq =< sp, . . . sq >.In the generic case that use K sub-signals, we further ob-tain a sequence of feature vectors X1, X2 . . . XK . A clas-sifier is applied for each one of these feature vectors, gen-erating the assigned music genres g1, g2 . . . gK ; then theymust be combined to produce the final class assignment,as we will see in the following.

In our case we employ feature vectors extracted from30-seconds segments from the beginning (Sbeg), middle(Smid) and end (Send) parts of the original music sig-nal. The corresponding feature vectors are denoted Xbeg ,Xmid and Xend. Figure 4 illustrates the time decomposi-tion process.

Figure 4. Time Decomposition Approach

3.3. THE SET OF EMPLOYED FEATURES

The MARSYAS framework was employed for featureextraction, so we use the same feature set proposed byTzanetakis and Cook [34]. These features can groupedin three categories: Timbral Texture, Beat Related andPitch Related. The Timbral Texture features account forthe means and variance of the spectral centroid, rolloff,flux, the time zero domain crossings, the first 5 Mel Fre-quency Cepstral Coefficients (MFCC) and low energy.

5



Pre-Prin

ted V

ersion



The Beat-Related features (features 1 to 6) include therelative amplitudes and the beat per minute. Pitch-Relatedfeatures (features 26 to 30) include de maximum periodsof the pitch peak in the pitch histograms. The final featurevector is 30-dimensional (Beat: 6; Timbral Texture: 19;Pitch: 5). For a more detailed description of the featuresrefer to [34] or [33].

A normalization procedure is applied, in order to ho-mogenize the input data for the classifiers: if maxV andminV are the maximum and minimum values that ap-pears in all dataset for a given feature, a value V is re-placed by newV using the equation

newV =(V −minV )

(maxV −minV )

The final feature vector is outlined at Table 1.

Table 1. Feature vector description

Feature # Description1 Relative amplitude of the first histogram peak2 Relative amplitude of the second histogram peak3 Ratio between the amplitudes of the second peak

and the first peak4 Period of the first peak in bpm5 Period of the second peak in bpm6 Overall histogram sum (beat strength)7 Spectral centroid mean8 Spectral rolloff mean9 Spectral flow mean10 Zero crossing rate mean11 Standard deviation for spectral centroid12 Standard deviation for spectral rolloff13 Standard deviation for spectral flow14 Standard deviation for zero crossing rate15 Low energy16 1 rt. MFCC mean17 2 nd. MFCC mean18 3 rd. MFCC mean19 4 th. MFCC mean20 5 th. MFCC mean21 Standard deviation for 1 rt. MFCC22 Standard deviation for 2 nd. MFCC23 Standard deviation for 3 rd. MFCC24 Standard deviation for 4 th. MFCC25 Standard deviation for 5 th. MFCC26 The overall sum of the histogram (pitch strength)27 Period of the maximum peak of the

unfolded histogram28 Amplitude of maximum peak of the

folded histogram29 Period of the maximum peak of the

folded histogram30 Pitch interval between the two most prominent

peaks of the folded histogram

We note that the feature vectors are always calculatedover intervals; in fact several features like means, vari-ances, and number of peaks, only have meaning if ex-

tracted from signal intervals. So they are calculated overS or one of its subintervals Skl, that is, an aggregate seg-ment obtained from the elementary frames of the musicaudio signal.

3.4. CLASSIFICATION AND COMBINATION DECI-SION PROCEDURES

A large set of standard algorithms for supervised ma-chine learning are used to accomplish the AMGC task.We follow an homogeneous approach, that is, the verysame classifier is employed as individual component clas-sifier for every music part. We use the following algo-rithms [26]: (a) a classic decision tree classifier (J48); (b)the instance-based k-NNclassifier; (c) the Naïve-Bayesclassifier (NB), which is based on conditional probabil-ities and attribute independence; (d) a Multi Layer Per-ceptron neural network (MLP) with the backpropagationmomentum algorithm; and (e) a Support Vector Machineclassifier (SVM) with pairwise classification. In order todo all experimentation we employ a framework based onthe WEKA Datamining Tool [36], with standard parame-ters.

As previously mentioned, a set of possibly differentcandidate classes are produced by the individual classi-fiers. These results can be considered according to spaceand time decomposition dimensions. The time dimensionpresents K classification results, and space dimension fora M -class problem produces M (OAA) or M(M − 1)/2(RR) results, as already explained. These partial resultsmust be composed to produce the final class label.

We employ a decision procedure in order to find thefinal class associated by the ensemble of classifiers. Spacedecomposition results are combined by the majority voterule, in the case of the RR approach, and by a rule basedon the a posteriori probability of the employed classifier,for the OAA approach. Time decomposition results arecombined using the majority vote rule.

4. SPACE-TIME DECOMPOSITION EX-PERIMENTS AND RESULTS

One common concern in the AMGC problem researchis how reliable the obtained experiments and results are,because the way the musical genres are associated to themusic pieces in the so far employed databases. Craft,Wiggins and Crawford [4] raise this question and arguethat genre labeling is affected by two major factors: thefirst one is related to how the composer intended to drawupon different stylistics elements from one or more mu-sic genres; the second one is related to the social andcultural background of any participant involved in label-ing the database. They conclude that the evaluation of agenre labeling system is highly dependent upon these two

6

https://www.researchgate.net/publication/220723028_How_Many_Beans_Make_Five_The_Consensus_Problem_in_Music-Genre_Classification_and_a_New_Evaluation_Method_for_Single-Genre_Categorisation_Systems?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==


https://www.researchgate.net/publication/221900847_Data_Mining_Practical_Machine_Learning_Tools_And_Techniques?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

Pre-Prin

ted V

ersion



factors, and recommend that for the evaluation of genrelabeling systems that use individual genres this culturaldifferences should be eliminated. However, according toMcEnnis and Cunningham [24] cultural differences – orsocial context – should be preserved, because they play animportant role in the human subjectiveness associated tothe task of assigning musical genres to music pieces. Inthe work of McEnnis, McKay and Fujinaga [23] the issueof properly constructing databases that can be useful forother research works is addressed.

Considering these concerns of the research com-munity, and in order to accomplish the desiredAMGC task, a new database was constructed: theLatin Music Database (feature vectors available inwww.ppgia.pucpr.br/∼silla/lmd/) [30], [32], [33]. Thisdatabase contains 3,160 MP3 music pieces of 10 differentLatin genres, originated from music pieces of 543 artists.

In this database music genre assignment was manuallymade by a group of human experts, based on the humanperception of how each music is danced. The genre la-beling was performed by two professional teachers withover 10 years of experience in teaching ballroom Latinand Brazilian dances. The professionals made a first se-lection of the music they considered to be relevant to aspecific genre regarding the way it is danced; the projectteam did a second verification occurs in order to avoidmistakes. The professionals classified around 300 musicpieces per month, and the development of the completedatabase took around one year.

In order to verify the application of our proposal inthe AMGC problem an extensive set of tests were con-ducted. We consider two main goals: (a) to verify iffeature vectors extracted from different parts of the au-dio signal have similar discriminant power in the AMGCtask; and (b) to verify if the proposed ensemble approach,encompassing space and time decompositions, providesbetter results than classifiers applied to a single featurevector extracted from the music audio signal. Our primaryevaluation measure is the classification accuracy, that is,the average number of correctly classified music pieces.

The experiments were carried out on stratified train-ing, validation and test datasets. In order to deal with bal-anced classes, 300 different song tracks from each genrewere randomly selected. In the experiments we use a ten-fold cross-validation procedure, that is, the presented re-sults are obtained from 10 randomly independent experi-ment repetitions.

For time decomposition we use three time segmentsof 30 seconds each, extracted from the beginning, middleand end part of each music piece. We note that 30-secondsare equivalent to 1,153 frames in a MP3 file. Accordingto the explained formalism we have, for a music signalcomposed of N samples: Sbeg =< s1, . . . s1,153 >,Smid =< s( N

3 )+500, . . . s( N3 )+1,653 > and

Send =< sN−1,153−300, . . . sN−300 >. An empiricdisplacement of 300 sample frames was employed inorder to discard the final part of the music, which isusually constituted by silence or noise.

Table 2 presents the results for each classifier in thesegments Sbeg , Smid and Send for the OAA and RR ap-proaches. Column BL stands for baseline, and shows theresults for the indicated classifier without space decom-position. As the SVM classifier was employed as defaultfor the RR approach, in this case column BL results wereomitted.

The following analysis can be done on the resultsshown in Table 2: (a) for the J48 classifier the RR ap-proach increases accuracy, but for the OAA approachthe increment is not significative; (b) the 3-NN classifierpresents the same results for every segment, but they varyaccording to the adopted strategy; (c) for the MLP classi-fier the decomposition strategies increase accuracy in thebeginning segment for RR, and in the end segment forOAA and RR; (d) for the NB classifier the approaches didnot increase significantly the classification performance;and (e) the SVM classifier presents not only the best clas-sification results, when using RR, but also the worst onein every segment when employing the OAA approach. Aglobal view shows that the RR approach surpasses theOAA approach for most classifiers; the only exception isthe case MLP for Smid.

Table 3 presents the results for the space-time decom-position ensemble strategy in comparison with the spacedecomposition strategy applied to the entire music piece.In this table the TD column indicate values obtained with-out space decomposition, but using time decompositionwith the majority vote rule. The column BL stands for theapplication of the classifier in the entire music piece withno decompositions.

Table 3. Accuracy (%) using space–time decomposition versus entiremusic piece

Space–time Ensembles Entire MusicClassifier TD OAA RR BL OAA RRJ48 47.33 49.63 54.06 44.20 43.79 50.63

3-NN 60.46 59.96 61.12 57.96 57.96 59.93MLP 59.43 61.03 59.79 56.46 58.76 57.86NB 46.03 43.43 47.19 48.00 45.96 48.16

SVM – 30.79 65.06 – 37.46 63.40

Results in Table 3 show that for the entire music piecethe use of the RR strategy increases classification accu-racy of any classifier regardless the baseline, whereas theOAA strategy presents superior results only for the MLPneural net classifier. When comparing the global resultsfor the entire music piece, the RR strategy results over-come the OAA strategy in most cases.

In the case of using combined space-time decompo-

7


https://www.researchgate.net/publication/220723389_Sociology_and_Music_Recommendation_Systems?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==


https://www.researchgate.net/publication/220924167_On-demand_metadata_extraction_network_OMEN?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

Pre-Prin

ted V

ersion



Table 2. Accuracy (%) using OAA and RR approaches in the individual segmentsSbeg Smid Send

Classifier BL OAA RR BL OAA RR BL OAA RRJ48 39.60 41.56 45.96 44.44 44.56 49.93 38.80 38.42 45.53

3-NN 45.83 45.83 45.83 56.26 56.26 56.26 48.43 48.43 48.43MLP 53.96 52.53 55.06 56.40 53.08 54.59 48.26 51.96 51.92NB 44.43 42.76 44.43 47.76 45.83 47.79 39.13 37.26 39.19

SVM – 26.63 57.43 – 36.82 63.50 – 28.89 54.60

sition, both OAA and RR strategies marginally increaseclassification accuracy. When comparing the entire musicwith space-time decomposition the results are similar ofthe ones in the previous experiments: for the J48, 3-NNand MLP in all cases the decomposition results are better;for NB results are inconclusive; and for SVM results aresuperior only for the RR strategy. The best overall resultis achieved using SVM with space-time decompositionand the RR approach.

5. FEATURE SELECTION AND RELATEDEXPERIMENTS

The feature selection (FS) task is a selection of aproper subset of original feature set, in order to simplifyand reduce the preprocessing and classification steps, butassuring the same or better final classification accuracy[2], [5].

The feature selection methods are often classified intwo groups: the filter approach and the wrapper approach[27]. In the filter approach the feature selection processis carried out before the use of any recognition algorithm,as a preprocessing step. In the wrapper approach the pat-tern recognition algorithm is used as a sub-routine of thesystem to evaluate the generated solutions.

In our system we employ several feature vectors, ac-cording to space and time decompositions. The featureselection procedure is employed in the different time seg-ment vectors, allowing us to compare the relative impor-tance and/or discriminative power of each features ac-cording to their time origin. Another goal is to verify howthe results obtained with the ensemble-based method areaffected by the features selected from the single segments.

The employed feature selection procedure is basedon the genetic algorithm (GA) paradigm and uses thewrapper approach. Individuals – chromosomes in theGA paradigm – are F -dimensional binary vectors, whereF is the maximum feature vector size; in our case wehave F = 30, the number of features extracted by theMARSYAS framework.

The GA general procedure can be summarized as fol-lows [33]: (a) each individual works as a binary mask foran associated feature vector; (b) an initial assignment israndomly generated: a value 1 indicates that the corre-sponding feature must be used, and 0 that it must be dis-

carded; (c) a classifier is trained using only the selectedfeatures; (d) the generated classification structure is ap-plied to a validation set to determine the fitness value ofthis individual; (e) we proceed selection to conserve thebest individuals, and crossover and mutation operators areapplied in order to obtain the next generation; and (f) steps(c) to (e) are repeated until a stopping criteria is attained.

In our feature selection procedure each generation iscomposed of 50 individuals, and the evolution processends when it converges – no significant change in succes-sive generations – or when a fixed max number of gener-ations is achieved.

Tables 4, 5 and 6 present the obtained results with thefeature selection procedure applied to the beginning, mid-dle and end music segment respectively [33]. In thesetables the classifier is indicated in the first column; thesecond column presents a baseline (BL) result, which isobtained applying the corresponding classifier directly tothe complete feature vector obtained from the MARSYASframework; columns 3 and 4 show the results for OAAand RR space decomposition approaches without featureselection; columns FS, FSOAA and FSRR show the cor-responding results with the feature selection procedure.

Table 4. Classification accuracy (%) using space decomposition for thebeginning segment of the music (Sbeg)

Classifier BL OAA RR FS FSOAA FSRRJ48 39.60 41.56 45.96 44.70 43.52 48.53

3-NN 45.83 45.83 45.83 51.19 51.73 53.36MLP 53.96 52.53 55.06 52.73 53.99 54.13NB 44.43 42.76 44.43 45.43 43.46 45.39

SVM – 23.63 57.43 – 26.16 57.13

Table 5. Classification accuracy (%) using space decomposition for themiddle segment of the music (Smid)


3-NN 56.26 56.26 56.26 60.02 60.95 62.55MLP 56.40 53.08 54.59 54.73 54.76 49.76NB 47.76 45.83 47.79 50.09 48.79 50.69

SVM – 38.62 63.50 – 32.86 59.70

8

https://www.researchgate.net/publication/220765601_Feature_Selection_Algorithms_A_Survey_and_Experimental_Evaluation?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

https://www.researchgate.net/publication/222303175_Selection_of_Relevant_Features_and_Examples_in_Machine_Learning?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==

Pre-Prin

ted V

ersion



Table 6. Classification accuracy (%) using space decomposition for theend segment of the music (Send)


3-NN 48.43 48.43 48.43 51.11 51.10 53.49MLP 48.26 51.96 51.92 47.86 50.53 49.64NB 39.13 37.26 39.19 39.66 37.63 39.59

SVM – 28.89 54.60 – 28.22 55.33

Analyzing these results for each classifier we can out-line the following conclusions: (a) for J48 and 3-NN thefeature selection method with the RR space-time decom-position approach produce better accuracy than the otheroptions; (b) for the MLP classifier feature selection seemsto be ineffective: best results are obtained with the com-plete feature set; (c) for the NB classifier FS produce thebetter results without space decomposition in Sbeg andSend, and with the RR approach in Smid; (d) for the SVMclassifier the best results arrive with the use of the RR ap-proach, and FS increase accuracy only in the Send seg-ment. This classifier also presents the best overall result:using the RR space decomposition in Smid without usingFS.

In order to consider the ensemble approach with timedecomposition, Table 7 present the results of the con-ducted experiments using space and time decompositions,for OAA and RR approaches, with and without feature se-lection. We emphasize that this table encompasses threetime segments Sbeg , Smid and Send, merged according tothe combination procedure.

Table 7. Classification accuracy (%) using global space–timedecompositions


3-NN 60.46 59.96 61.12 63.20 62.77 64.10MLP 59.43 61.03 59.79 59.30 60.96 56.86NB 46.03 43.43 47.19 47.10 44.96 49.79

SVM – 30.79 65.06 – 29.47 63.03

Summarizing the results in Table 7, we conclude thatthe FSRR method improves classification accuracy for theclassifiers J48, 3-NN and NB. Also, OAA and FSOAAmethods present similar results for the MLP classifier,and only for the SVM classifier the best result is obtainedwithout FS.

These results – and also the previous ones obtainedfor the individual segments – indicate that space decom-position and feature selection are more effective for clas-sifiers that produce simpler separation surfaces betweenclasses, like J48, 3-NN and NB, in contrast with the re-

sults obtained from the MLP and SVM classifiers, whichcan produce complex separation surfaces. This situationcorroborates our initial hypothesis related to the use ofspace decomposition strategies.

As already mentioned, we also want to analyze if dif-ferent features have the same importance according totheir time origin. Table 8 shows a schematic map indi-cating the features selected in each time segment. In thistable we employ a binary BME mask – for (B)eginning,(M)iddle and (E)nd time segments – where 0 indicatesthat the feature was not selected in the corresponding timesegment, and 1 otherwise.

Table 8. Selected features in each time segment (BME mask)Feature 3-NN J48 MLP NB SVM #

1 000 001 010 101 111 72 000 000 010 010 011 43 000 001 010 011 000 44 000 111 010 111 001 85 000 000 110 101 100 56 111 101 111 111 110 137 011 110 110 000 100 78 001 111 110 000 111 99 111 111 111 111 111 15

10 110 011 111 111 111 1311 100 001 111 001 110 812 011 010 111 011 111 1113 111 011 111 111 111 1414 001 010 101 000 011 615 011 111 111 111 111 1416 111 111 111 111 111 1517 111 100 111 111 111 1318 111 111 111 111 111 1519 111 010 111 111 111 1320 011 010 110 101 101 921 111 111 111 101 111 1422 111 110 111 111 111 1423 111 111 111 100 111 1324 011 000 111 001 011 825 111 011 101 111 111 1326 000 010 100 111 111 827 000 111 000 101 101 728 111 111 011 111 111 1429 000 100 000 000 101 330 000 011 000 111 000 5

Several conclusions can be draw from this Table. Thelast column in this table indicates how many times the cor-responding feature was selected in the experiments (max15 selections). Although different features can have dif-ferent importance according to the classifier, we arguethat this counting gives a global idea of the discrimina-tive power of each feature. For example, features 6, 9, 10,13, 15, 16, 17, 18, 13, 21, 22, 23, 25 and 28 are highlyselected, so they are important for music genre classifica-tion. For more discussion, see [30], [32] and [33]. Weremember that features 1 to 6 are Beat related, 7 to 25 arerelated to Timbral Texture, and 26 to 30 are Pitch related.

9



Pre-Prin

ted V

ersion



6. CONCLUSIONSIn this paper we present a novel approach to the Music

Genre Classification Problem, which is based on ensem-ble approach and the decomposition of the music signalaccording to space and time dimensions. Feature vectorsare selected from different time segments of the begin-ning, middle and end parts of the music; in order to sim-plify the generated classifier, space decomposition strate-gies based on the One-Against-All and Round-Robin ap-proaches were used. From the set of partial classifica-tion results originated from these views of the problemspace, an unique final classification label is provided. Alarge brand of classical categorization algorithms wereemployed in the individual segments, and an heuristiccombination procedure was used to produce the final mu-sic genre label.

In order to evaluate the proposal we have conducted aextensive set of experiments in a relatively large database– the Latin Music Database, with more than 3,000 mu-sic pieces from 10 music genres – specially constructedfor this research project. This database was methodicallyconstructed and is open to new research projects in thearea.

Several conclusions can be inferred from the obtainedresults. Firstly, we conclude that the use of the initial30-second segment of the beginning of the music piece– which is the most frequent strategy used up to now toobtain the music feature vector – is not adequate: our testresults show that the middle part is better than the initialor the end part of the music signal for feature extraction(Table 2). We believe that this occurs because in this mid-dle part the music signal is more stable and more compat-ible with the corresponding music genre. In fact, resultsof the tests using the middle segment are similar to theones using the complete music signal; in the latter case,however, the processing time is higher, since there is anobvious relation between the length of the time intervalused for feature extraction and the computational com-plexity of the corresponding extraction procedure.

Secondly, we conclude that the use of three time seg-ments and the ensemble of classifiers approach providebetter results in accuracy for the AMGC task than the onesobtained from the individual segments (Table 3). Thisresult is in accordance with the conclusions of Li, Ogi-hara and Li [19], who state that specific approaches mustbe used for the labeling of different music genres whensome hierarchical classification is considered. So, we be-lieve that our space-time decomposition scheme providebetter classification results. Unfortunately a direct com-parison with the results of Tzanetakis and Cook [34], orthe ones of Li, Ogihara and Li [19] is not possible becausethe GTZAN database provides only the feature values forthe initial 30-second segment of the music pieces. Ourtemporal approach also differs from the one employed

by Bergstra et al. [1], that initially uses simple deci-sion stumps applied individually to each feature, and thenfeature selection and classification in parallel using Ad-aBoost.

In third place we can analyze the results concerningthe space decomposition approaches. As already men-tioned, the use of binary classifiers is adequate in prob-lems that present complex class separation surfaces. Ourresults show that in general the RR approach presents su-perior results regarding the OAA approach (Tables 2 and3). We justify this fact using the same explanation: in RRindividual instances are eliminated – in comparison withthe relabeling in the OAA approach – so the constructionof the separation surface by the classification algorithmis simplified. Our best classification accuracy result wasobtained with the SVM classifier and space-time decom-position following the RR approach.

We also evaluate the effect of using a feature selec-tion procedure in the AMGC problem. Our FS procedureis based on the genetic algorithm paradigm. Each indi-vidual works as a mask that selects the set of featuresto be used for classification. The fitness of the individ-uals are based on the classification performance accord-ing to the wrapper approach. Classical genetic operations(crossover, mutation) are applied until a stopping criteriais attained.

The results achieved with the feature selection showthat this procedure is effective for J48, k-NN and Naïve-Bayes classifiers; for MLP and SVM the FS proceduredoes not increases classification accuracy (Tables 4, 5, 6and 7); these results are compatible with the ones pre-sented in [37]. We note that using a reduced set of fea-tures implies a smaller processing time; this is an impor-tant issue in practical applications, where a compromisebetween accuracy and efficiency must be achieved.

We also note that the features have different impor-tance in the classification, according to their origin musicsegment (Table 8). It can be seen, however, that some fea-tures are present in almost every selection, showing theyhave a strong discriminative power in the classificationtask.

In summary, the use of space-time decomposition andthe ensemble of classifiers approach provide better accu-racy for music genre labeling than the use of individualsegments – initial, middle and end parts – of the musicsignal, and even when the classifier is trained with thewhole music signal. Even if one consider that the incre-ment in accuracy obtained by our proposal is not so signif-icant, it represents an interesting trade-off between com-putational effort and classification accuracy. This is animportant issue in practical applications. Indeed, the ori-gin, number and duration of the time segments, the set ofdiscriminative features, and the use of an adequate spacedecomposition strategy still remain open questions for the

10






Pre-Prin

ted V

ersion



AMGC problem.We intend to improve our proposal in order to increase

classification accuracy, by adding a second layer of binaryclassifiers to deal with classes and / or partial state spaceviews that present higher confusion.

REFERENCES[1] J. Bergstra and N. Casagrande and D. Erhan and

D. Eck and B. Kégl, “Aggregate features andADABOOST for music classification", MachineLearning, Kluwer Academic Publishers, 65(2-3)473–484, 2006.

[2] A. Blum and P. Langley, “Selection of RelevantFeatures and Examples in Machine Learning," Ar-tificial Intelligence, 97(1-2):245–271, 1997.

[3] C.H.L. Costa, J. D. Valle-Jr and A.L. Koerich, “Au-tomatic Classification of Audio Data," in IEEE In-ternational Conference on Systems, Man, and Cy-bernetics, pages 562–567, 2004.

[4] A.J.D. Craft, G.A. Wiggins and T. Crawford, “HowMany Beans Make Five? the Consensus Problemin Music Genre Clasification and a New EvaluationMethod for Single Genre Categorisation Systems,"in Proceedings of the 8 th. International Conferenceon Music Information Retrieval, Vienna, Austria,pages 73–76, 2007.

[5] M. Dash and H. Liu “Feature Selection for Classi-fication," in Intelligent Data Analysis, 1(1-4):131–156, 1997.

[6] T.G. Dietterich T.G. “Ensemble Methods in Ma-chine Learning," in Proceedings of the 1st. Inter-national Workshop on Multiple Classifier System,Springer-Verlag, Lecture Notes in Computer Sci-ence, 1857:1–15, 2000.

[7] J.S. Downie and S.J. Cunningham, “Toward a the-ory of music information retrieval queries: Sys-tem design implications," in Proc. of the 3rd Intern.Conf. on Music Information Retrieval, pages 299–300, 2002.

[8] R. Fiebrink and I. Fujinaga, “Feature SelectionPitfalls and Music Classification," in Proc. of the7th Intern. Conf. on Music Inf. Retrieval, Victoria,Canada, pages 340–341, 2006.

[9] Y. Freund and R. E. Schapire, “A decision-theoreticgeneralization of on-line learning and an applica-tion to boosting," Journal of Computer and SystemSciences, 55(1):119–139, 1997.

[10] J. Fürnkranz, “Pairwise Classification as an Ensem-ble Technique", in Proceedings of the 13 th. Euro-pean Conference on Machine Learning, Helsinki,Springer-Verlag, pages 97–110, 2002.

[11] M. Grimaldi, P. Cunningham and A. Kokaram, “AWavelet Packet representation of audio signals formusic genre classification using different ensembleand feature selection techniques," in Proc. of the 5thACM SIGMM Intern. Workshop on Multimedia Inf.Retrieval, ACM Press, pages 102–108, 2003.

[12] M. Grimaldi, P. Cunningham and A. Kokaram, “AnEvaluation of Alternative Feature Selection Strate-gies and Ensemble Techniques for Classifying Mu-sic," in Workshop on Multimedia Discovery andMining, 14 th. European Conference on MachineLearning, 7 th. European Conference on Principlesand Practice of Knowledge Discovery in Databases,Dubrovnik, Croatia, 2003.

[13] S. Hacker, MP3: The Definitive Guide. O’ReillyPub., 2000.

[14] T.K. Ho, “Nearest neighbors in random subspaces,"Advances in Pattern Recognition, Joint IAPR Inter-national Workshops SSPR and SPR, Lecture Notesin Computer Science, 1451:640–648, Sydney, Aus-tralia, 1998.

[15] J. Kittler, M. Hatef, R.P.W. Duin and J. Matas,“On Combining Classifiers;" in IEEE Transac-tion on Pattern Analysis and Machine Intelligence,20(3)226–239, 1998.

[16] A.L. Koerich, C. Poitevin, “Combination of ho-mogenuous classifeirs for musical genre classifi-cation," in IEEE International Conference on Sys-tems, Man and Cybernetics, IEEE Press, pages554–559, Hawaii, USA, 2005.

[17] K. Kosina, “Music genre recognition," MSc. Dis-sertation, Fachschule Hagenberg, June 2002.

[18] J.H. Lee and J.S. Downie, “Survey of music infor-mation needs, uses, and seeking behaviours prelim-inary findings," in Proc. of the 5th Intern. Conf. onMusic Inf. Retrieval, Barcelona, Spain, pages 441–446, 2004.

[19] T. Li, M. Ogihara and Q. Li, “A Comparative Studyon Content-Based Music Genre Classification," inProceedings of the 26 th. annual international ACMSIGIR Conference on Research and Development inInformaion Retrieval, Toronto, ACM Press, pages282–289, 2003.

11




















https://www.researchgate.net/publication/220723450_Feature_Selection_Pitfalls_and_Music_Classification?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==
















































https://www.researchgate.net/publication/298606994_Ensemble_methods_in_machine_learning?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==










Pre-Prin

ted V

ersion



[20] M. Li and R. Sleep, “Genre Classification via anLZ78-Based String Kernel," in Proc. of the 6th In-ternational Conference on Music Information Re-trieval, London, UK, pages 252–259, 2005.

[21] T. Li and M. Ogihara, “Music Genre Classificationwith Taxonomy," in Proceedings of IEEE Interna-tional Conference on Acoustics, Speech and Sig-nal Processing, Philadelphia, USA, pages 197–200,2005.

[22] H. Liu and L. Yu, “Feature Extraction, Selection,and Construction," Chapter 16, The Handbook ofData Mining, Lawrence Erlbaum Publishers, pages409–424, 2003.

[23] D. McEnnis, C. McKay and I. Fujinaga. “Overviewof OMEN (On-demand Metadata Extraction Net-work)," in Proceedings of the International Con-ference on Music Information Retrieval, Victoria,Canada, pages 7–12, 2006.

[24] D. McEnnis and S.J. Cunningham, “Sociology andMusic Recommendation Systems," in in Proceed-ings of the 8 th. International Conference on MusicInformation Retrieval, Vienna, Austria, pages 185–186, 2007.

[25] A. Meng and P. Ahrendt and J. Larsen, “ImprovingMusic Genre Classification By Short-Time FeatureIntegration," in Proc. of the IEEE Intern. Conf. onAcoustics, Speech, and Signal Processing, Philadel-phia, USA, pages 497–500, 2005.

[26] T. M. Mitchell, Machine Learning. McGraw-Hill,1997.

[27] L.C. Molina, L. Belanche and A. Nebot, “FeatureSelection Algorithms: a Survey and experimentalEvaluation," in Proceedings of the IEEE Interna-tional Conference on Data Mining, Maebashi City,Japan, pages 306–313, 2002.

[28] J.J. Aucouturier and F. Pachet, “Representing Mu-sical Genre: A State of the Art," Journal of NewMusic Research, 32(1):83–93, 2003.

[29] E. Pampalk, A. Rauber and D. Merkl, “Content–Based Org. and Visualization of Music Archives,"in Proceedings of ACM Multimedia, Juan-les-Pins,France, pages 570–579, 2002.

[30] C.N. Silla Jr., C.A.A. Kaestner and A. L. Ko-erich, “Time-Space Ensemble Strategies for Au-tomatic Music Genre Classification," in Proceed-ings of the Brazilian Symposium on Artificial In-telligence, Ribeirão Preto, Brazil, Lecture Notes inComputer Science, 4140:339-348, 2006.

[31] C.N. Silla Jr., C.A.A. Kaestner and A. L. Koerich“The Latin Music Database: a Database for theAutomatic Classification of Music Genres (in por-tuguese)," in Proceedings of 11 th. Brazilian Sym-posium on Computer Music, São Paulo, BR, pages167–174, 2007.

[32] C.N. Silla Jr., C.A.A. Kaestner and A.L. Koerich,“Automatic Music Genre Classification using En-semble of Classifiers," in Proceedings of the IEEEInternational Conference on Systems, Man and Cy-bernetics, pages 1687–1692, Montreal, Canada,2007.

[33] C. N. Silla Jr., Classifiers Combination for Auto-matic Music Classification (in portuguese). MSc.Dissertation, Graduate Program in Applied Com-puter Science, Pontifical Catholic University ofParaná, January 2007.

[34] G. Tzanetakis and P. Cook, “Musical Genre Clas-sification of Audio Signals," IEEE Transactionson Speech and Audio Processing, 10(5):293–302,2002.

[35] F. Vignoli, “Digital music interaction concepts:a user study," Proceedings of the 5th Interna-tional Conference on Music Information Retrieval,Barcelona, Spain, pages 415-420, 2004.

[36] I. H. Witten and E. Frank, Data Mining: PracticalMachine Learning Tools and Techniques. MorganKaufmann, 2005.

[37] Y. Yaslan and Z. Cataltepe, “Audio Music GenreClassification Using Different Classifiers ans Fea-ture Selection Methods, " in Proceedings of theInternational Conference on Pattern Recognition,Hong-Kong, China, pages 573–576, 2006.

12











































https://www.researchgate.net/publication/220723548_Genre_Classification_via_an_LZ78-Based_String_Kernel?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==














https://www.researchgate.net/publication/220327485_A_Machine_Learning_Approach_to_Automatic_Music_Genre_Classification?el=1_x_8&enrichId=rgreq-cfb5d2167fecf21a1570da9cd320500e-XXX&enrichSource=Y292ZXJQYWdlOzIyMDMyNzQ4NTtBUzoxMjM5NzExNjA3NzY3MDRAMTQwNjU2ODQzNjIyMw==











A Machine Learning Approach to Automatic Music Genre Classification

Documents

Transcript of A Machine Learning Approach to Automatic Music Genre Classification