A Comparison Study to Identify Birds Species Based on Bird Song Signals

: this paper presents a comparison study in automatically identifying bird species based on bird acoustic signals, using audio files from XENO-CANTO online database. The features including Mel-Frequency Cepstral Coefficients (MFCC), geo-related meta-features, and the integration are compared. The learning classifiers Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), k-Nearest Neighbor (kNN), and Ensemble Learning are examined. Our experimental results show that in the comparison study, ensemble learning using discriminant learner with the integration of MFCC features and geo meta-features obtains the best detection performance.


Introduction
Birds live worldwide and rank as the class of tetrapod with the most living species, at approximately ten thousands, in which more than half identified species being passerines, known as perching birds or songbirds.Bird vocalizations are traditionally divided into bird calls and bird songs.The distinction between songs and calls is based on complexity, length, and context and has a lot of exceptions.Songs are longer and more complex, mainly produced by males and associated with courtship and mating during the breeding season, while calls tend to be shorter, simpler and produced by both sexes throughout the year; serve such functions as alarms or keeping members of a flock in contact [1].
Bird watching is a traditional and popular activity focused on observation of birds.Due to fact that many birds are more easily heard than photographed, it is promising to rely on their sounds as a convenient and reliable method for species identification.With the rapid development of digital technology, portable devices such as mobile phones are equipped with outdoor recording functionality, adequate storage capacity, and computational power to do onsite recording analysis.It is easier than ever for bird watcher to record bird sounds during bird watching.On the other hand, professionals like ornithologist, ecologists, traditionally take advantage of long-term semi-automatic acoustical monitoring without human presence at recording site for scientific research or ecosystem evaluation purpose [2].
The motivation of bird species identification goes beyond bird watching.Bird is a good indicator of the state of their surrounding ecosystem; since they are widely distributed and react quickly to changes in environmental conditions such as habitat loss, declining biodiversity, and climate change.Acoustical monitoring for tracking bird migration and for estimating populations of bird species provides information to understand and evaluate the changes in environment.Recognition by bird sounds also would be a powerful tool for automatic identifying bird in cases such as in areas near airports to prevent collisions with aircraft [2].
Automatic bird species identification based on birdsongs is a recent application of machine learning, essentially of pattern recognition and classification.The challenge can be broken down into two main stages.First, each of the bird sounds recording should be analyzed, normally by signal processing tools, to produce a discriminative feature set that represent the original bird audio signal regarding its species.Techniques widely used in feature extraction include temporal and spectral measurements, Linear Predictive Coding, Mel Frequency Cepstral Coefficients (MFCC).Then these feature sets serve as input to a classification system.Several algorithms have been employed, including probabilistic and instance based classifier, neural networks and support vector machines [3].
Linear prediction coding (LPC) is a tool used mostly in audio signal processing and speech processing for representing the spectral envelope of a digital signal, using the information of a linear predictive model [4].It is one of the most powerful speech analysis techniques.The basic idea behind this model is that a speech sample can be approximated with a linear combination of previous speech samples.LPC analysis tries to determine the coefficients of a forward linear predictor by minimizing the prediction error in the least squares sense.
The cepstral coefficients are the results of taking the inverse Fourier transform representation of the logarithmic magnitude spectrum of a signal.LPC derived cepstral coefficients (LPCCs), is a very effective representation for speech coding, analysis, synthesis, and recognition.A significant property of the LPC spectral modeling is that the LPC spectrum matches the signal spectrum closely near the spectral peaks.So the linear prediction cepstral coefficients (LPCCs) are more robust and reliable features for speech recognition and have been proven to be more relevant than LPCs [5,8].
MFCC is a frame-level feature.It is computed by transforming the spectrum of a frame into the Mel scale, which approximates the human auditory system's response more closely than the linear frequency.The steps that are applied in the traditional method to get the coefficients are started with reemphasizing the sampled signal and then applying the framing and windowing on it, then taking the Fast Fourier Transform (FFT) for each windowed frame, the signal now is a power spectrum, this signal enters to a Mel filter bank and the length of the output is equal to the number of filters created, after taking a discrete cosine transform to the log of the filter bank's output, an array of features that describe the spectral shape of the signal [6,7,18].
The earlier works for birdsongs identification focused on template matching usually used in conjunction with dynamic time warping (DTW).First one manually obtains a collection of templates including intervals with no bird sound, and then slides them across a target spectrogram.DTW algorithm is used to stretches either the template or the target spectrogram to calculate some measure of similarity.This can be viewed as simultaneous segmentation and classification [9].
Lopes et al. [3] presented a comparison of the performance of 3 feature sets originally implemented for music analysis combined with a series of machine learning algorithms applied to the bird species identification problem.Experiments were conducted in order to evaluate various combinations of feature sets and classifiers in a database composed by 101 audio records from 3 bird species.Somervuo et al. [10] found that MFCC outperforms sinusoidal features and a collection of spectral features such as spectral centroid, bandwidth, roll-off, flux, etc. Lee etc. designed image shape features to identify bird species based on the recognition of fixed-duration birdsong segments where their corresponding spectrograms are viewed as gray-level images [11].Chou and Liu [12] applied a wavelet transformation to transform sections of the bird songs.Then the first five order MFCCs are computed, and same order MFCC are aligned.Neal et al. [13] proposed a supervised time-frequency audio segmentation method using a Random Forest classifier.Springer [14] addressed the issue when multiple species of birds sing concurrently in the same recording.Zhao et al [15] designed acoustic environment signatures that can be used for background noise recognition.
In this paper, we make a comparison study by comparing MFCC features and geo-metadata features with the use of several machine learning classifiers.

Data Set
XENO-CANTO is a collaborative database containing more than 192k audio records that cover 9120 bird species observed all around the world by more than 2000 contributors at the time of writing and these numbers keep growing during each day.Recordings in the dataset are not consisting only bird song.A substantial part of the elements contain background noise such as sounds from other animals, wind and machine noise, or electric hum, which is near to the real world applications.Contributors also provide the metadata of each audio file including geographical information, date and /or time of the day and/or presence of background species.The lifeclef2014 Bird Identification Task is based on a subset of XENO-CANTO database.It contains 14027 audio recordings belonging to the 501 bird species in the area of South American centered on Brazil [16].Additional information includes the audio file associated metadata in XML format.

Feature Extraction
The dataset is sourced from a large online collection of user-submitted recordings, and therefore suffers from inconsistent audio quality.The recordings have varying audio quality due to atmospheric conditions like wind and rain, interfering bird and insects calls, quality of the recording equipment, and varying professionalism of the recordist.In order to avoid bias in the evaluation related to recording devices, the organizers preprocessed the whole audio data to normalize frequency sample to 44. temporal bins of a signal sampled at 44 100 Hz).Two successive frames overlap of 33% i.e. 3.9 ms.The mel scale is a means of mapping the physical frequency to the perceptual representation.The mapping between the physical frequency scale (Hz) and perceptual frequency scale (mel) is approximately linear below 1000 Hz and logarithmic at higher frequencies.The relation between the physical frequency scale and the mel frequency scale can be described as: The mel scale maps the physical frequency to the perceptual representation.The mapping between the physphysicalquency scale (Hz) and perceptual frequency scale (mel) is approximately linear below 1000 Hz and logarithmic at higher frequencies.The mapping between mel frequency scale and physical frequency scale can be described as: Where f is the spectral frequency of the input bioacoustics signal using short-time Fourier transform.
In the process of MFCCs feature extraction, the Fourier spectrum is filtered by a set of mel-scale filters.The MFCCs are computed by performing DCT on the logarithmic energy output by every bandpass filter . Where K is the number of bandpass filters, L is the desired length of MFCCs, and Ek is the energy of the output of the k-th bandpass filter.L is set to 16 according to the study in the reference [17].
Let take a particular MFCC features, denoted by the matrix C= ^t m c , (m=0, 1, ..15; t=1,2,…n) wherein n is the number of frames.The Following 16 MFCC features are extraction based on equation (3), Additionally, we retrieve the following 16 MFCC features regarding the first derivative based on equation (4), And the following 16 MFCC features are designed based equation (5) Based on equations (3), ( 4) and ( 5), a total of 48 MFCC features are extracted.
Additionally, the three geo-meta features Latitude, Longitude and Elevation were extracted from XML file.

Experiments
We select the learning classifiers including Linear Discriminant Analysis (LDA), Support Vector Machine (SVM) with linear kernel, and quadratic kernel, k-Nearest Neighbor (kNN), and Ensemble Learning [19] using discriminant analysis and kNN learners respectively for our comparison study.The following section discusses results obtained using the subset of 2014 BirdCLEF dataset when trying to find which classifier is scalable as bird species number increases, all results are obtained using 5-fold cross validation.Table 1 lists the detection accuracy by applying the six classifiers to the three feature sets, 48-dimensional MFCC, 3-dimensional META, and the integration of MFCC and META on four different numbers of bird species 5, 10, 50 and 100 bird species (with 275, 494, 1892, or 3263 instances respectively).The ROC curves and confusion matrix are provided in Figures 1 to 5 by using the ensemble learning with the subspace discriminant classifier.
The accuracies of all 6 classifiers decrease dramatically as number of bird species increase from 5, 10, 50, to 100.SVM and Ensemble learning with Discriminant learner for random subspace have comparable highest accuracy but in our experiments the training time of SVM is much longer than ensemble learning.The integration of MFCC with geo-meta features obtains the best detection accuracy.
The experimental results show that ensemble learning with discriminant has the best performance.To compare the detection performance under different parameter of subspace dimensionalities, we adjust the subspace parameter from the default value 26 to 30, 40, and 50, respectively.Table 2 shows the detection accuracy under different subspace parameters with discriminant ensemble learning.Figure 6 shows the ROC curve in detecting all 501 species birds by using MFCC and geo-meta features together.

Conclusions
In this study, we conducted a comparison study identifying bird species based on bird songs.Our study shows that the integration of MFCC features and geometa features obtains the best detection accuracy with the

Figure 1 :
Figure 1: Confusion matrix of 5 species subsets with Ensemble: Subspace Discriminant classifiers using MFCC features only (a) and using MFCC and META features (b)

Figure 2 :Figure 3 :Figure 4 :Figure 5 :Figure 6 :
Figure 2: Confusion matrix of 10 species subsets with Ensemble: Subspace Discriminant classifiers using MFCC features only (a) and using MFCC and META features (b)

Table 2 .
Validation Accuracy (%) On Different Species By Discriminant Ensemble