Our Classifier
Below we have a general block diagram of our classifier. We begin by pre-processing the signals and standardizing each audio input, followed by our feature gathering. Once we've collected our features, we leave a few samples out and perform cross-validation in order to generate different models based on our features. When we have our best model selected, we test it on the left out samples, thus generating our output.
Click below to learn about each of our different steps!
Pre-Processing
To begin, we pre-process the data as this helps to increase our model's accuracy. By creating a standardization across all of our samples, we allow ourselves to treat all samples in the same way. Some of the modifications we make are:
-
Trim samples to actual speech
-
Convert stereo sound to mono
-
Eliminate background noises
-
Normalized signal amplitude
Below is an example of one of our samples before this step and after. As you can see from the before plot (top), it is a stereo sample without a normalized amplitude. After pre-processing, the signal is now simply a mono sample with max amplitude of 1.
Feature Gathering
In order to differentiate and model our different signals, we gather many specific features about them. The features collected are:
-
Average Value
-
Standard deviation
-
Minimum and Maximum Values
-
Range
-
Derivatives (up to the nth order)
These are collected in the time domain, from the Fourier coefficients, the Mel-Frequency Cepstral Coefficients, and the pitch contour of the signal.
Cross-Validation
Our best model was generated through leave one out cross-validation. After leaving out a small set of the data for testing later, we took the remaining data and divided it into different subsections. Each section generates a different model through folds of training and testing, as one section is the test section while others train the model. As a result, we get many different models and the most accurate model is ultimately selected.
Feature Selection
As a result of feature gathering, we had over 1000 features from our signals. However, many of these are actually detrimental to the model's accuracy. To determine which features are most beneficial, we used a MatLab function called reliefF. This gives a rating to each of our features depending on how much it helps or hurts our model. See below for a graph of every feature we gathered and it's effectiveness on the y-axis.
When choosing our final features to test, we kept the top 10 features. Below are the most beneficial features that we used when performing cross-validation.
Typically, our classifier had an accuracy between 50% and 80%, with our highest achieved accuracy peaking at 90%. Below is a sample confusion matrix that details how we observe the classifiers accuracy, with Chinese being Predicted Class 1 and Actual Class 0, and English being Predicted Class 2 and Actual Class 1. From this example, we achieved an accuracy of 60%, where four were correctly predicted to be Chinese and two were correctly predicted to be English.