Have you ever woken up without understanding what it was, but knowing for sure that some sound isn’t right?

Sound identification is one of our instincts that kept human beings safe. Sounds play a significant role in our life, Starting from recognizing a predator nearby to being inspired by music, to lots of human voices, to the cry of a bird. Therefore, developing audio classifiers is a crucial task in our lives.

In many cases, it is crucial to classify the source of the sounds and is already widely used for various purposes. In music, there's a classifier for the genre of music. Recently similar systems started to be used to classify bird calls, something that historically was done by a profession called Ornithologists. Their goal is to categorize which sounds of birds because it is difficult to detect the birdcalls from the fields or noisy environments.

Recently, deep learning (DL) has become one of the popular technologies to solve lots of tasks in our lives due to its accuracy, along with the improvement of computational devices like CPU (Central Processing Unit), GPU (Graphics Processing Unit). The below chart shows how big the deep learning market is and the expected size of its future from the aspects of the software, hardware, and services.

In this post, We will take the task of reading an audio file with zero to few bird calls and use deep learning to identify which bird it is, based on the Cornell Birdcall Identification Kaggle Challenge where we’ve got a silver medal.

How to deal with the data?

In the previous post our team wrote, we explained how to load sound data and get it to a spectrogram format and why it is crucial. Here’s an example of a spectrogram of birdcalls of Alder Flycatcher and a photo of such a bird, just in case you are curious.

Fig 3. log mel spectrogram of birdcall, Alder Flycatcher

The speed of data processing is one of the keys to utilizing a deep learning model. Although the increment of computation power, the computation cost of audio processing is still expensive on CPUs. However, if we choose a better computation resource to process the data like GPUs, it can boost the speed of about ten to one hundred times faster! In this post, we will show how to process Spectrogram fast by utilizing a library called torchlibrosa that enables us to process Spectrogram on a GPU.

Build Spectrogram processor

torchlibrosa is a Python library that has some audio processing functions implemented in PyTorch that can utilize GPU resources. PyTorch enables running this Spectrogram algorithm on a GPU. Here's an example of extracting Spectrogram features using torchlibrosa.

from torchlibrosa.stft import Spectrogram

spectrogram_extractor = Spectrogram(

Load audio data

We can load audio data via librosa library, which is one of the popular Python audio processing libraries.

import librosa

# get raw audio data
example, _ = librosa.load('img-kim/example.wav', sr=32000, mono=True)

Process Spectrogram

import torch

raw_audio = torch.Tensor(example).unsqueeze(0).cuda()

spectrogram = spectrogram_extractor(raw_audio)

Benchmark processing speed

We can process audio data on the GPU by using torchlibrosa library. You may wonder how much faster on the GPU than the CPU. Here's the speed of processing the benchmark between the devices. We just selected audio from the dataset obtained from the Cornell Birdcall Identification Kaggle Challenge, which is publicly available, and compared how long it takes on CPU and GPU. We tested on the Colab environment to reproduce the performance, and it is about x15 faster on GPU than CPU to process log-mel spectrogram from about 5 minutes audio.

Fig 4. Processing time between CPU (Intel Xeon 2.20 GHz) and GPU (Nvidia T4). librosa is used for CPU benchmark, torchlibrosa is used for GPU benchmark

How to classify a sound?

As mentioned above, deep learning also shows a brilliant performance in the audio domain. It can catch various patterns of target classes nicely in the time-series data. The more important thing is the environment and data matter in bird calls. The environments like fields or the middle of the mountains, there are lots of noises interfering with the birdcalls. There are lots of birds that can exist in long recorded audio. So considering these cases, we need to build a noise-robust, multi-label audio classifier.

We are going to introduce a deep learning architecture used by our team (Dragonsong) in Cornell Bird Call Identification Kaggle Challenge.


We built a novel audio classifier architecture that catches time-series features effectively by utilizing CNN, RNN and attention modules. Here is our brief plot of architecture used at the Cornell Birdcall Identification Challenge.

Fig 5. Our architecture of birdcall classifier

We process a raw audio with a log-mel spectrogram as an input of our architecture, and it passes through the ResNeSt50 backbone, which is one of the image classification architectures. Then, we take the features, which contain both spatial and temporal information, to the RoI (Region of Interest) pooling and bi-GRU layers. In the layers, it catches the time-wise information again while reducing the feature dimension because we thought of extracting temporal features are crucial to classify lots of bird calls in long audio. Lastly, we pass the information into the attention module to score by each time step to find out which time step the birds exist.

Training the model

Not only building deep learning architecture to represent the data but also how to train the model is crucial (a.k.a training recipe). To classify audios that contain multiple bird calls in a noisy environment, we mix multiple bird calls into audio and noises like white noise. Also, regarding lots of variation of bird calls, we augment time and pitch and mask some audio frames by using SpecAugment. Here is a short example of what we applied augmentations.

import IPython.display as ipd

Fig 6. Augmented sample. The mixed version of Alder Flycatcher and American Avocet.

As a result, we can achieve an outperform score on the Kaggle challenge.


Have you ever woken up without understanding what it was, but knowing for sure that some sound isn't right? With good algorithms, machines will be able to identify what it was and help you sleep better. Stay tuned!