On the usefulness of differentiated transient/steady-state processing in machine recognition of...

10
Audio Engineering Society Convention Paper Presented at the 118th Convention 2005 May 28–31 Barcelona, Spain This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42 nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. On the usefulness of differentiated transient/steady-state processing in machine recognition of musical instruments Slim ESSID 1 , Pierre LEVEAU 1,2 , Ga¨ el RICHARD 1 , Laurent DAUDET 2 , and Bertrand DAVID 1 1 GET - ENST (T´ el´ ecom Paris) - TSI, 46, rue Barrault - 75634 Paris Cedex 13 - FRANCE 2 Laboratoire d’Acoustique Musicale, 11 rue de Lourmel, 75015 Paris Correspondence should be addressed to Slim ESSID ([email protected]) ABSTRACT This paper addresses the usefulness of the segmentation of musical sounds into transient/non-transient parts for the task of machine recognition of musical instruments. We put into light the discriminative power of the attack-transient segments on the basis of objective criteria, consistent with the well-known psychoacoustics findings. The sound database used is composed of real-world mono-instrument phrases. Moreover, we show that, paradoxically, it is not always optimal to consider such a segmentation of the audio signal in a machine recognition system for a given decision window. Our evaluation exploits efficient automatic segmentation techniques, a wide variety of signal processing features as well as feature selection algorithms and support vector machine classification. 1. INTRODUCTION The attack and end transients of music notes carry a significant part of the information for musical instrument identification, as evidenced by music cognition and music acoustics studies [1, 2]. It is known that information about the production mode of the sound is essentially located at the beginning and at the end of the notes, like breath impulsions for the wind instruments, bow strokes for the bowed strings, or plucking or hammering for percussive pitched instruments (for example piano and guitar). Additionally, music cognition experiments have shown that features related to the beginning of music notes (for example attack-time [2]) can help humans to discriminate different instrument notes. For machine recognition tasks, signal processing features extracted from the attack transients (such

Transcript of On the usefulness of differentiated transient/steady-state processing in machine recognition of...

Audio Engineering Society

Convention PaperPresented at the 118th Convention2005 May 28–31 Barcelona, Spain

This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, orconsideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be

obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, isnot permitted without direct permission from the Journal of the Audio Engineering Society.

On the usefulness of differentiatedtransient/steady-state processing in machinerecognition of musical instruments

Slim ESSID1, Pierre LEVEAU1,2, Gael RICHARD1, Laurent DAUDET2, and Bertrand DAVID1

1GET - ENST (Telecom Paris) - TSI, 46, rue Barrault - 75634 Paris Cedex 13 - FRANCE

2Laboratoire d’Acoustique Musicale, 11 rue de Lourmel, 75015 Paris

Correspondence should be addressed to Slim ESSID ([email protected])

ABSTRACT

This paper addresses the usefulness of the segmentation of musical sounds into transient/non-transient partsfor the task of machine recognition of musical instruments. We put into light the discriminative power of theattack-transient segments on the basis of objective criteria, consistent with the well-known psychoacousticsfindings. The sound database used is composed of real-world mono-instrument phrases. Moreover, we showthat, paradoxically, it is not always optimal to consider such a segmentation of the audio signal in a machinerecognition system for a given decision window. Our evaluation exploits efficient automatic segmentationtechniques, a wide variety of signal processing features as well as feature selection algorithms and supportvector machine classification.

1. INTRODUCTION

The attack and end transients of music notes carrya significant part of the information for musicalinstrument identification, as evidenced by musiccognition and music acoustics studies [1, 2]. It isknown that information about the production modeof the sound is essentially located at the beginningand at the end of the notes, like breath impulsionsfor the wind instruments, bow strokes for the bowed

strings, or plucking or hammering for percussivepitched instruments (for example piano and guitar).Additionally, music cognition experiments haveshown that features related to the beginning ofmusic notes (for example attack-time [2]) can helphumans to discriminate different instrument notes.

For machine recognition tasks, signal processingfeatures extracted from the attack transients (such

Essid et al. transient/steady-state segments in instrument recognition

as crest factor, onset duration) have also provedto be efficient for instrument family identificationin previous work on isolated notes [3]. However,performing reliable extraction of such features onmono-instrument phrases in real playing conditionsis not straightforward. In fact, the state-of-the-art approaches of automatic music instrumentrecognition on solo performances are based on acutting of the signal into short signal windows(about 30ms), and they do not differentiate thetransient and steady-state windows. Therefore,since non-transient segments are usually muchlonger than transient ones, the information carriedby the transients gets diluted over the entire signal,hence its impact on the final classification decisionbecomes weak.

Our study considers a differentiated processing onthe transient and non-transient parts of the musicalphrases. It assumes that we have at hand at leastone algorithm that performs automatic segmen-tation of the signal, with an estimated error rate.Then, adapted features can be selected for each part.

We thus show, using class-separability criteria andrecognition accuracy measures, that attack transientsegments of the musical notes are more informativethan other segments for instrument classification .Subsequently, we discuss the efficiency of such a seg-mentation in the perspective of developing a realisticmachine recognition system.

2. SIGNAL SEGMENTATION

The signal analysis is based on 32-ms constant-length windows, with a 50%-overlap. After segmen-tation, each window is assigned to one of the follow-ing two categories: transient or non-transient.

Two types of transient/non-transient segmentationare performed. The first type is based on an onsetdetector: when an onset is detected, a fixed num-ber of windows including and following the onsetare considered as transient. The second type in-volves a continuous transientness criterion: windowsfor which this criterion exceeds a fixed threshold areconsidered as transient. The next two sections (2.1and 2.2) describe these two methods in detail.

2.1. Fixed-duration transient annotation basedon onset detection2.1.1. Onset Detection Algorithm

The automatic onset detection is based on a detec-tion function that uses a spectral difference, tak-ing the phase increment into account. The origi-nal method was introduced in [4]. It is based onthe computation of a prediction error. If the signalis composed of stationary sinusoids, the first-orderprediction of the Discrete Fourier Transform (DFT)Xk,n, of the signal x at frequency k and time n is:

Xk,n = |Xk,n−1|ej(2φk,n−1−φk,n−2)

where φk,n is the time-unwrapped phase of Xk,n.

When an onset occurs, there is a break in the pre-dictability, and therefore a peak in the predictionerror. We thus define the function ρ:

ρ(n) = ΣKk=1|Xk,n − Xk,n|

that exhibits peaks at onset locations. However,when evaluating this detection function on realsounds, we find that these peaks occur sometimeslate with respect to the note onset times, becauseof too long raising times of the function peaks. Al-though the onset is detected, the peak is located atthe maximum spectral difference and not at the trueonset time. Thus, we perform an additional opera-tion to sharpen the peaks of the detection functionρ: a derivation (noted ∆) followed by a rectification:

γ(n) = max({∆(ρ(n)), 0})

Onsets are then extracted by peak-picking the newdetection function γ, called Delta Complex SpectralDifference. A peak is selected if it is over a thresholdδ(n), computed dynamically:

δ(n) = δstatic + λ ∗median(γ(n−M)), ..., γ(n + M))

2.1.2. Onset Detection EvaluationThe above onset detection algorithm was comparedto other standard onset detection algorithms, suchas Spectral Difference (amplitude or complex do-main) and Phase Deviation [5]. It has also been eval-uated on a database of solo instrument recordings,

AES 118th Convention, Barcelona, Spain, 2005 May 28–31

Page 2 of 10

Essid et al. transient/steady-state segments in instrument recognition

manually annotated and cross-validated [6]. On theReceiver Operating Characteristic curves1 (Figure1), the Delta Complex Spectral Difference shows asignificant improvement in comparison to the otherones (its ROC curve is constantly over all the otherones).

10 20 30 40 50 60 70

20

30

40

50

60

70

80

90

100

False Detections [%]

Goo

d D

etec

tions

[%]

ROC Curve

Complex Spectral Difference

Phase Deviation

Spectral Difference

Delta Complex Spectral Difference

Fig. 1: ROC Curves of the detection functions.(square): Complex Domain Spectral Difference,(plus): Phase Deviation, (circle): Spectral Differ-ence, (star): Delta Complex Spectral Difference

2.2. Transientness criterionOne of the main limitations of the system describedabove is that it assumes that all transient regionshave the same length. Obviously, this is overly sim-plified when the signals considered range from per-cussive (that have very sharp attack transients) tostring or wind instruments (that can have very longattack durations). Therefore, a more signal-adaptivealgorithm has been developed, based on the contin-uous transientness criterion introduced by Goodwinin [7]. Like most onset detection algorithms, it isbased on a spectral difference, with some adaptionto the signal level. Given f the spectral flux func-tion:

f(n) = ΣKk=1(|Xk,n| − |Xk,n−1|),

the following operation is performed:

1good detections as a function of false alarms

if(f [n] > βn−1)βn = f [n]elseβn = αβn−1withα < 1

α must be set close to 1, and β initialized to thetheoretical or empirical maximum of f . Figure 2shows the adaption provided by this operation. Thewindows for which the criterion β remains over afixed threshold are considered as transient windows.

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8−1

0

1Signal

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.80

200

400

600

800Transientness criterion

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.80

0.5

1Normalized Transientness criterion

Time [s]

Fig. 2: Transientness Criterion. top: original signal(trumpet sample), middle: transientness criterion,bottom: normalized transientness criterion, circlesat 1 indicate transient windows

2.3. A priori comparison of the segmentationmethodsOnce the two types of segmentation are performed,coincidence between the segmentations can be eval-uated. Since we lack a ground-truth for the tran-sientness, we can only provide this indication on therobustness of our segmentation with respect to theemployed method.

For a nearly equal numbers of transient windows de-tected by both methods, we found that about 40% ofeach set of transient windows were common. For arandom segmentation, the coincidence is about 7%.These quantities show that, although significantlycorrelated, the two methods are far from giving thesame results. This means that results obtained onlyon transient windows may vary according to the cho-sen segmentation method.

AES 118th Convention, Barcelona, Spain, 2005 May 28–31

Page 3 of 10

Essid et al. transient/steady-state segments in instrument recognition

3. FEATURE EXTRACTION AND SELECTION

3.1. Feature extractionA wide selection of more than 300 signal processingfeatures is considered including some of the MPEG-7 descriptors. They are briefly described hereafter.Interested readers are referred to [8] for more de-tailed description.

3.1.1. Temporal features

• Autocorrelation Coefficients were reported tobe useful in [9]; they represent the “signal spec-tral distribution in the time domain”.

• Zero Crossing Rates (ZCR) are computed overshort windows and long windows; they can dis-criminate periodic signals (small ZCR values)from noisy signals (high ZCR values).

• Local temporal waveform moments are mea-sured, including the four first statistical mo-ments. The time first and second time deriva-tives of these features were also taken to followtheir variation over successive windows. Also,the same moments were computed from thewaveform amplitude envelope over long win-dows. The amplitude envelope was obtainedusing a low-pass filtering (10-ms half Hanningwindow) of signal absolute complex envelopes.

• Amplitude Modulation features are meant todescribe the ”tremolo” when measured in thefrequency range 4-8 Hz, and the ”graininess”or ”roughness” of the played notes if the focusis put in the range 10-40 Hz [10]. A set ofsix coefficients was extracted as described inEronen’s work [10], namely AM frequency, AMstrength and AM heuristic strength (for thetwo frequency ranges). Two coefficients wereappended to the previous to cope with the factthat an AM frequency is measured systemati-cally (even when there is no actual modulationin the signal); they were the product of tremolofrequency and tremolo strength, as well as theproduct of graininess frequency and graininessstrength.

3.1.2. Cepstral features

Mel-Frequency Cepstral Coefficients (MFCC) wereconsidered as well as their time first and second timederivatives [11]. The first few MFCC give some es-timate of the spectral envelope of the signal.

3.1.3. Spectral features

• The first two coefficients (except the constant1) from an Auto-Regressive (AR) analysis ofthe signal are examined as an alternative de-scription of the spectral envelope (which canbe roughly approximated as the frequency re-sponse of this AR filter).

• A subset of features is obtained from the sta-tistical moments, namely the spectral centroid(from the first order moment), the spectralwidth (from the second order moment), thespectral asymmetry defined from the spectralskewness, and the spectral kurtosis describ-ing the “peakedness/flatness” of the spectrum.These features have proven to be successful fordrum loop transcription [12] and for musical in-strument recognition [13]. Their time first andsecond derivatives were also computed in orderto provide an insight into spectral shape varia-tion over time.

• A precise description of the spectrum flatness isfetched, namely MPEG-7 Audio Spectrum Flat-ness (successfully used for instrument recogni-tion [13]) and Spectral Crest Factors which areprocessed over a number of frequency bands[14].

• Spectral slope is obtained as the slope of a linesegment fit to the magnitude spectrum [8]; spec-tral decrease is also measured, describing the“decreasing of the spectral amplitude” [8], aswell as spectral variation representing the vari-ation of the spectrum over time [8], frequencycutoff (frequency roll-off in some studies [8])computed as the frequency below which 99% ofthe total spectrum energy is accounted, and analternative description of the spectrum flatnesscomputed over the hole frequency band [8].

• Frequency derivative of the constant-Q coeffi-cients is extracted, describing spectral ”irregu-

AES 118th Convention, Barcelona, Spain, 2005 May 28–31

Page 4 of 10

Essid et al. transient/steady-state segments in instrument recognition

larity” or ”smoothness” and reported to be suc-cessful by Brown [15].

• Octave Band Signal Intensities are exploited tocapture in a rough manner the power distri-bution of the different harmonics of a musicalsound without recurring to pitch-detectiontechniques. Using a filterbank of overlappingoctave band filters, the log energy of eachsubband (OBSI) and also the logarithm ofthe energy Ratio of each subband sb to theprevious sb − 1 (OBSIR) are measured [16].

3.1.4. Perceptual features

We consider relative specific loudness (Ld) rep-resenting “a sort of equalization curve of thesound”, sharpness (Sh)- as a perceptual alternativeto the spectral centroid based on specific loudnessmeasures- and spread (Sp), being the distance be-tween the largest specific loudness and the totalloudness [8] and their variation over time.

3.2. Feature selection

Feature Selection (FS) arises from data miningproblems where a subset of d features are to beselected from a larger set of D candidates. Theselected subset is required to include the mostrelevant features, i.e. the combination yielding thebest classification performance. Feature selectionhas been extensively addressed in the statisticalmachine learning community [17, 18, 19] andutilized for various classification tasks includinginstrument recognition [20, 16, 21]. Several strate-gies have been proposed to tackle the problemthat can be classified into 2 major categories: the“filter” algorithms use the initial set of featuresintrinsically, whereas the “wrapper” algorithmsrelate the Feature Selection Algorithm (FSA) tothe performance of the classifiers to be used. Thelatter are more efficient than the former, but morecomplex.

We chose to use a simple filter approach basedon Fisher’s Linear Discriminant Algorithm (LDA)[22]. This algorithm computes the relevance of eachcandidate feature using the weights estimated by

the LDA. We merely used the spider for Matlab tool.

We perform feature selection class pairwise in thesense that we fetch a different subset of relevantfeatures for every possible pair of classes. Thisapproach proved more successful than the classicone where a single set of attributes is used for allclasses [16, 21].

In order to measure the efficiency of the featuresselected xi, we use an average class separability cri-terion S, obtained as the mean value of bi-class sep-arabilities computed for each class-pair p accordingto:

Sp = tr

(

2∑

c=1

πcΣc

)−1

(

2∑

c=1

(µc − M)′(µc − M))

,

where πc is the a priori probability of the classc, Σc and µc are respectively the covariance ma-trix and the mean of the class c observations andM =

i xi. The higher the measured value of S,the better classes are discriminated.

4. CLASSIFICATION SCHEME

Classification is based on Support Vector Machines(SVM). SVM are powerful classifiers arising fromStructural Risk Minimization Theory [23] that haveproven to be efficient for various classification tasks.They are by essence binary classifiers which aim atfinding the hyperplane that separates the featuresrelated to each class Ci with the maximum margin.In order to enable non-linear decision surfaces,SVM map the D-dimensional input feature spaceinto a higher dimension space where the two classesbecome linearly separable, using a kernel function.Interested readers are referred to [24, 25] for furtherdetails.

We use SVM in a “one vs one” scheme. This meansthat as many binary classifiers as possible class pairsare trained and test segments are classified by everybinary classifier to arrive at a decision. After pos-terior class probabilities have been fit to SVM out-puts following Platt’s approach [26], we use the usual

AES 118th Convention, Barcelona, Spain, 2005 May 28–31

Page 5 of 10

Essid et al. transient/steady-state segments in instrument recognition

Maximum a posteriori Probability (MAP) decisionrule [22] .

5. EXPERIMENTAL STUDY

5.1. Experimental parameters

Ten instruments from all instrument families areconsidered, namely, Alto Sax, Bassoon, Bb Clarinet,Flute, Oboe, Trumpet, French Horn, Violin, Celloand Piano. Solo sound samples were excerpted fromCompact Disc (CD) recordings mainly obtainedfrom personal collections. Table 1 sums up theproperties of the data used in the following exper-iments. There is a complete separation betweensources2 used for training and sources used fortesting so as to assess the generalization capabilityof the recognition system.

Audio signals were down-sampled to a 32-kHz sam-pling rate, centered with respect to their long-termtemporal means and their amplitude normalizedwith respect to their maximum values. All spec-tra were computed with a Fast Fourier Transformafter Hamming windowing. Windows consisting ofsilence signal were detected thanks to a heuristic ap-proach based on power thresholding then discardedfrom both train and test datasets.

Instruments Sources Train Test

AltoSax 9 5’29” 3’2”Bassoon 7 3’0” 2’12”BbClarinet 11 6’6” 5’50”Flute 8 4’17” 3’2”Oboe 9 6’54” 4’17”French Horn 6 3’33” 2’46”Trumpet 8 7’13” 5’17”Cello 5 5’52” 4’38”Violin 6 10’18” 7’39”Piano 18 18’28” 12’30”

Table 1: Sound database used. “Sources” is thenumber of different sources used, “Train” and “Test”are respectively the total lengths (in seconds) of thetrain and test sets.

2sound excerpts are from different sources if they comefrom recordings of different artists

Segmentation of the whole sound database was per-formed with the two methods described in section2. For fixed-length segmentation, two lengths wereused: 2 windows (about 60 ms) and 4 windows(about 120 ms). Each segmentation is used to gen-erate two datasets: a transient-window dataset, andthe complementary one, the non-transient-windowdataset. On Figure 3, two examples of decisionframes are shown: the Cl(t,4) decision frame, tak-ing 4 overlapping windows at the transient location,and the Cl(nt, 2) taking 2 overlapping windows in anon-transient part of the signal.

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

−0.05

0

0.05

0.1

0.15

Time [s]

Am

plitu

de

Cl(t,4) decision frame Cl(nt,2) decision frame

Fig. 3: Examples of decision frames, dashed-linerectangles represent overlapping analysis windows.

5.2. Efficiency of features selected over differentsegments

Pairwise feature selection was performed on the fol-lowing data sets to obtain the 40 most relevant ones(for each pair):

• 3 datasets including observations from segmentslabeled as transient (the related selected featuresets will be referred to as FS(t,2), FS(t,4) andFS(t,a)), where FS(t,2) (resp. FS(t,4)) is the se-lected feature set on the transient segments forsegment of length 2 (resp. 4), and FS(t,a) theselected feature set on the frames with adaptivetransient lengths).

• 3 datasets including observations from theremaining segments labeled as non-transient

AES 118th Convention, Barcelona, Spain, 2005 May 28–31

Page 6 of 10

Essid et al. transient/steady-state segments in instrument recognition

by the same segmentation methods (FS(nt,2),FS(nt,4) and FS(nt,a) sets, same notations asabove);

• the “baseline” dataset including all observa-tions regardless of the transientness of the signal(FS(b)).

Significant variability is observed on the subsets offeatures selected for each pair of instruments overthe considered datasets. FSA outputs have beenposted on the web3 for interested readers to lookinto it in depth.

FS(b) FS(t,2) FS(nt,2) FS(nt,4) FS(t,4) FS(t,a) FS(nt,a)0

0.005

0.01

0.015

0.02

0.025

S

Fig. 4: Mean class separability (over all class pairs)with features selected on different segments.

Class separability measures (see section 3.2) result-ing from all previous feature sets are depicted in fig-ure 4 from which the following can be deduced:

• S values obtained with transient segmentdata (FS(t,2), FS(t,4), FS(ta)) are greaterthan values reached by non-transient (FS(nt,2),FS(nt,4), FS(nta)) and FS(b), hence a betterclass separability is achieved using descriptorsspecifically selected for the transient segmentdata (regardless of the segmentation methodused);

• among the segmentation methods, the adaptiveone (FS(t,a)) gives rise to observations which,

3see www.tsi.enst.fr/essid/pub/pubAES118/

when processed with the adapted features, en-ables the best discrimination between instru-ments;

• data from the non-transient segments resultsin poor class separability, smaller than theone yielded by the undifferentiated processing(FS(b)).

These results thus confirm the widespread assertionthat attack-transients are particularly relevant in in-strument timbre discrimination.

5.3. Classification over different segments

Based on the different sets of selected features(described in section 5.2) we proceed to SVM clas-sification of the musical instruments exploiting onlythe transient, only the steady-state or all the audiosegments. Recognition success is evaluated over anumber of decision frames. Each decision framecombines elementary decisions taken over Lt, Lnt orL consecutive analysis windows respectively for thetransient-based classifier, the non-transient-basedclassifier and the generic classifier (exploiting allaudio segments).

Table 2 sums up the the recognition accuracies foundin the following situations:

• classification based on FS(t,Lt), FS(nt,Lnt) andFS(b) with Lt = Lnt = L = 2;

• classification based on FS(t,Lt), FS(nt,Lnt) andFS(b) with Lt = Lnt = L = 4.

The decision frame lengths were thus chosen inorder to enable a fair comparison of the classifica-tion performance of the different schemes. Notethat these lengths are imposed by the lengths ofthe transient segments which implies Lnt = Lt andL = Lt.

On average better classification is achieved whenusing the transient segments, this is true for thetwo tested transient-segment lengths. Better resultsare found, on average, with Lt = 4. It can be saidthat transients are essential for proper machine

AES 118th Convention, Barcelona, Spain, 2005 May 28–31

Page 7 of 10

Essid et al. transient/steady-state segments in instrument recognition

% correct Cl(t,2) Cl(nt,2) Cl(b,2) Cl(t,4) Cl(nt,4) Cl(b,4)

Piano 94 93 92 95 93 93

AltoSax 76 77 83 79 77 85Bassoon 75 53 59 77 54 61BbClarinet 53 54 52 57 52 53Flute 89 68 77 87 73 79Oboe 24 42 39 24 42 39

FrenchHorn 58 44 49 58 45 50Trumpet 52 54 52 54 55 53

Cello 98 89 93 98 90 95Violin 74 78 81 77 77 82

Average 69 65 68 71 66 69

Table 2: Results of classification based on FS(t,Lt), FS(nt,Lt) and FS(b) with Lt = 2 and Lt = 4, respectivelyCl(t,2), Cl(nt,2), Cl(b,2), Cl(t,4), Cl(nt,4), Cl(b,4)

recognition of instruments as the worst results areobtained when they are not taken into consideration.

Nevertheless, looking at individual accuracies, onecan note interesting exceptions. A glaring one isthe oboe’s which is clearly better classified whenthe focus is put on its non-transient segments(42% on non-transients against 24% on transients).Since we consider that 1% differences are notstatistically consistent, this is the only case wherenon-transient segments lead to better classificationperformance. It can be noted that the recognitionaccuracies of the alto sax and the violin found withthe generic classifier are better compared to thetransient-segment one. In fact, the undifferentiatedprocessing leads to more successful classificationin these cases. The confusion matrices reveal thatthe alto sax is more frequently confused with theviolin when examined over the transient segmentswhile the violin is more often classified as cello, Bbclarinet and alto sax (even though less confusedwith trumpet).

Table 3 shows the recognition accuracies of a “morerealistic system”, where longer decision frames aretolerated, using a generic classifier. Better overallperformance is achieved compared to a classifica-tion scheme exploiting only transient-segment deci-sion frames. It can be concluded that processingonly the information of the transient windows is not

sufficient to improve the results of generic classifiers,when decision is taken in a fixed-length frame of re-alistic size. According to the high scores obtained ontransient frames, developing a fusion system mergingboth transient and non-transient windows informa-tions contained in such a frame could be of interest.

% correct Cl(b,30) Cl(b,120)

Piano 97 99

AltoSax 90 95Bassoon 64 74BbClarinet 57 62Flute 84 89Oboe 37 60

FrenchHorn 57 72Trumpet 60 63

Cello 99 100Violin 85 87

Average 73 80

Table 3: Classification results with L = 30 and L =120

6. CONCLUSIONS

In this paper we studied the pertinence of using adifferentiated transient/steady-state processing forautomatic classification of musical instruments onsolo performances. Transient windows tend to con-centrate relevant information for music instrument

AES 118th Convention, Barcelona, Spain, 2005 May 28–31

Page 8 of 10

Essid et al. transient/steady-state segments in instrument recognition

identification. In fact, it has been shown that, inmost cases, transient-segment observations lead toa better instrument discrimination (regardless ofthe method used to perform the segmentation).

Nevertheless, in the perspective of developing arealistic machine recognition system wherein a fixeddecision-frame length is imposed (typically 1 or 2sfor realtime systems), it is not straightforward tooptimally exploit such segmentations. Indeed, bet-ter classification performance can then be achievedwhen an undifferentiated processing is performedon all signal windows compared to the case wherethe decision is taken only on the transient-signalwindows within the decision frame.

Systems adequately merging expert classifiers basedon transient and steady-state segments should be de-signed to enable a better overall performance. Fur-thermore, to complete this study, specific transientparameters could be developed.

7. ACKNOWLEDGEMENTS

The authors wish to thank Marine Campedel forfruitful discussions on current feature selection tech-niques.

8. REFERENCES

[1] M. Clark, P. Robertson, and D. A. Luce. Apreliminary experiment on the perceptual basisfor musical instrument families. Journal of theAudio Engenieering Society, 12:199–203, 1964.

[2] McAdams S., Winsberg S., de Soete G., andKrimphoff J. Perceptual scaling of synthesizedmusical timbres: common dimensions, specfici-ties and latent subject classes. PsychologicalResearch, (58):177–192, 1995.

[3] A. Eronen. Comparison of features for musi-cal instrument recognition. In Proceedings ofWASPAA, 2001.

[4] J.P. Bello, C. Duxbury, M. Davies, and M.B.Sandler. On the use of phase and energy formusical onset detection in the complex domain.IEEE Signal Processing Letters, 2004.

[5] J.P. Bello, L. Daudet, S. Abdallah, C. Duxbury,M. Davies, and M.B. Sandler. A tutorial on on-set detection in music signals. IEEE Transac-tions on Speech and Audio Processing, 2005. tobe published.

[6] P. Leveau, L. Daudet, and G. Richard. Method-ology and tools for the evaluation of automaticonset detection algorithms in music, submitted.Proceedings of ISMIR 2004, 2004.

[7] M. Goodwin and C. Avendano. Enhancementof audio signals using transient detection andmodification. In Proceedings of the 117th AESConvention, 2004.

[8] Geoffroy Peeters. A large set of audio featuresfor sound description (similarity and classifica-tion) in the cuidado project. Technical report,IRCAM, 2004.

[9] Judith C. Brown. Musical instrument identi-fication using autocorrelation coefficients. InInternational Symposium on Musical Acoustics,pages 291–295, 1998.

[10] Antti Eronen. Automatic musical instrumentrecognition. Master’s thesis, Tampere Univer-sity of Technology, April 2001.

[11] Lawrence R. Rabiner. Fundamentals of SpeechProcessing. Prentice Hall Signal Processing Se-ries. PTR Prentice-Hall, Inc., 1993.

[12] Olivier Gillet and Gael Richard. Automatictranscription of drum loops. In IEEE Inter-national Conference on Acoustics, Speech andSignal Processing (ICASSP), Montral, Canada,May 2004.

[13] Slim Essid, Gael Richard, and Bertrand David.Efficient musical instrument recognition on soloperformance music using basic features. In AES25th International Conference, London, UK,June 2004.

[14] Information technology - multimedia contentdescription interface - part 4: Audio, jun 2001.ISO/IEC FDIS 15938-4:2001(E).

AES 118th Convention, Barcelona, Spain, 2005 May 28–31

Page 9 of 10

Essid et al. transient/steady-state segments in instrument recognition

[15] Judith C. Brown, Olivier Houix, and StephenMcAdams. Feature dependence in the auto-matic identification of musical woodwind in-struments. Journal of the Acoustical Society ofAmerica, 109:1064–1072, March 2000.

[16] Slim Essid, Gael Richard, and Bertrand David.Musical instrument recognition based on classpairwise feature selection. In 5th Interna-tional Conference on Music Information Re-trieval (ISMIR), Barcelona, Spain, October2004.

[17] Ron Kohavi and G. John. Wrappers for featuresubset selection. Artificial Intelligence Journal,97(1-2):273–324, 1997.

[18] A. L. Blum and P Langley. Selection of rele-vant features and examples in machine learning.Artificial Intelligence Journal, 97(1-2):245–271,December 1997.

[19] I. Guyon and A Elisseeff. An introduction tofeature and variable selection. Journal of Ma-chine Learning Research, 3:1157–1182, 2003.

[20] Geoffroy Peeters. Automatic classification oflarge musical instrument databases using hier-archical classifiers with inertia ratio maximiza-tion. In 115th AES convention, New York,USA, October 2003.

[21] Slim Essid, Gael Richard, and Bertrand David.Musical instrument recognition by pairwiseclassification strategies. IEEE Transactions onSpeech and Audio Processing, 2004. to be pub-lished.

[22] Richard Duda and P. E. Hart. Pattern Clas-sification and Scence Analysis. Wiley- Inter-science. John Wiley & Sons, 1973.

[23] Vladimir Vapnik. The nature of statisticallearning theory. Springer-Verlag, 1995.

[24] Christopher J.C. Burges. A tutorial on supportvector machines for pattern recognition. Jour-nal of Data Mining and knowledge Discovery,2(2):1–43, 1998.

[25] B. Sholkopf and A. J. Smola. Learning withkernels. The MIT Press, Cambridge, MA, 2002.

[26] John C. Platt. Probabilistic outputs for sup-port vector machines and comparisions to reg-ularized likelihood methods. 1999.

AES 118th Convention, Barcelona, Spain, 2005 May 28–31

Page 10 of 10