Results

In an effort to learn more about variables that impact our accuracy, we created a program that runs our models with varying parameters. With each run we change the number of folds, our classification model, and other parameters. This method allowed us to visualize how each of these variables affected our overall performance. For example, we found better results with approximately 6 folds of cross-validation while additional folds caused our sample size to be too small when split up, and smaller folds prevented us from training an adequate number of models to choose from.

Our model consistently preferred to guess that samples were Chinese. We routinely had an accuracy of 70-80%, but with a testing size of 10 audio samples the accuracy can easily change. To improve our accuracy, our next steps would be introducing additional samples. Currently, we test 22 Chinese samples and 21 English samples and we would like to increase that to have a much larger sample size. Additionally, we'd like to look into additional forms of feature selection such as SequentialFS.

Learn more about the result of our project below!

What Worked Well

What Didn't Work

Future Features

Applications

What Worked Well

One of our major problems we faced was attempting to extract time-dependent features for samples of different lengths. In order to compare these features we had to ensure that they were the same length. We decided to focus on the mean, maximum, minimum, median, and standard deviation of these time signals to ensure that we produced a single value for the entire length of the signal.

Some of these features, such as the maximum value of the 5th bin of the MFCC, were quite useful and quickly chosen by our feature selection algorithm (ReliefF). We also derived the MFCC samples with respect to time and looked at the same analytics of this signal. These features proved to be less useful but were still better than others we previously investigated.

One of the last features we added as the maximum of the derivative of the pitch for each sample. We found that the derivative of the pitch was a lot higher in our Chinese samples than our English samples. Unfortunately, we could not keep these features as a function of time and had to find the mean, maximum, and range of these values to correlate them to an entire sample.

Well

What Didn't Work

Unfortunately, having samples of varying lengths meant that we could not extract features with respect to time unless we broke each of our samples into a multitude of windows with the same length and then analyzed each of these samples individually. We looked into multiple ways to implement features with respect to time, but ultimately decided that none of them would be worth pursuing. In breaking our samples into individual windows, we risk breaking off phonetic sounds mid-window. Some of our windows would also contain significant amounts of silence compared to others.

Additionally, we attempted to look at the zero-crossing rate of our samples. This consisted of finding how often our sample changed from positive to negative or vice versa per second. Unfortunately, we did not find a significant difference in this feature between Chinese and English.

Our Fourier transform proved to be somewhat useful. We took the longest length of the Fourier transform possible based on the longest audio sample. We then used each index of the Fourier transform as an individual feature. In hindsight this likely lead to several issues as we used too large of a resolution. For example, we could be finding several peaks around 500Hz in our English samples that do not exist in our Chinese samples, but if these peaks range from 490Hz to 510Hz they will not be identified as significant. This is because we have individual features corresponding to 490Hz, 492Hz, 494Hz etc. It would have been better to create windows of frequencies, e.g. from 490Hz to 510Hz as a single feature, and 492Hz to 512Hz as another feature.

Didn't

Future Features

Unfortunately due to the large scale of our project implementation, we were unable to conveniently add our function that would have performed a cross correlation of the retroflex R phoneme with our speech samples, as well as our function that would have performed envelope detection and peak analysis for the sake of observing a significant difference in the rhythms of the two languages. Our feature selection process was rather extensive and enumerated through exponentially many permutations, which enabled us to achieve the maximum accuracy from the features provided. In future iterations, we would like to implement these functions into our entire algorithm to provide a more comprehensive analysis of relevant features that would serve to distinguish the two languages and obtain an even higher accuracy in language detection.

SequentialFS would allow us to look at how features interact with each other by continuously adding features and removing them to see if they helped overall or hurt. This adds an additional layer above our ReliefF program which looks at each feature in a vacuum while ignoring the others. Unfortunately in trying to run SequentialFS we were given an error that the covariance matrix could not converge. In order to fix this we would have to eliminate all features that have a high degree of correlation; however, it is difficult to choose which features to remove as we would be removing them blindly before knowing how useful they are via feature selection.

Future

Applications

Language identification has many important applications, Like we did in this project, it can be used detecting different languages or be further tuned to even detect regional dialects. As you can imagine, it can be used to improve voice applications like Siri, Google assistant, Amazon Alexa to make them automatic and more personal. And given enough time, we would want to develop our algorithm to detect multiple languages and detect specific phonemes that are unique to each language to further increase accuracy.

Applications