Investigation of robust speech feature extraction techniques for accents classification of Malaysian Engllish speakers
Abstract
Automatic speech recognition (ASR) system is not a new topic in speech processing and
human-machine interaction. It has been established for more than five decades.
However, accent remains a great challenge closely related to multilingualism in today’s
ASR issues which manifests speech differences in pronunciation and intonation of people from different sociolinguistics background. A large and growing body of
literature has revealed the negative effects of various accents as impairment to the ASR
performance. Although English accents have been the most studied accent varieties
insofar as it is regarded the most important and prestigious international language,
Malaysian English (MalE) which signifies a new variety within New Englishes of nonnative
speakers is still unexplored. In the ASR market product nowadays, conventional
way is to treat MalE as a uniform variety despite this notion is disputed by many
scholars and researchers who regard MalE as implication of localized ethnic speech
diversity. Past perceptual studies have reported high possibility of detecting ethnic
identities from Singapore English (SgE) and Brunei English (BruE) speech as
appropriate comparator varieties to MalE accents using listening test setup. At present, no research has been done to identify ethnic origin from speech samples of MalE accented speech using multiple speech analysis techniques and machine learning algorithms for automatic classification for more reliable, standard and accurate experimental methods. This study is an attempt to fill that gap and for this purpose, a new database of MalE accents has been developed. The study elicits speech in isolatedwords and continuous speech from university students of both genders of three main
ethnics to represent educated speakers of Malay, Chinese and Indian groups using
selected accent-sensitive words from previous studies. The design of the proposed
system consists of pre-processing, feature extraction and classification stages. Apart
from basic pre-processing, this study proposes integrating fuzzy inference system for
voiced-unvoiced (FIS V-UV) frame basis segmentation by itself has contributed an
improved overall implementation over conventional automatic accent classification
(AAC) system. A new method is proposed, named as global statistical thresholds
(GSTs) for establishing membership functions of short-time energy and zero crossing
rate inputs in the FIS V-UV segmentation. This proposed segmentation has resulted in a
reduced portion of speech activity to be taken further for feature extraction stage. The
experimental results demonstrate the efficacy of the proposed FIS V-UV-assisted AAC
using GSTs with the highest increase in accuracy rate of 7.70% and frame reduction rate
of 24.26% over the conventional AAC. In the second stage, acoustic features correlated
to accents of these three ethnics are developed through several techniques of filter bank
analysis, vocal tract model, hybrid analysis and fusion analysis. Out of eight formulated
feature vectors tested on the MalE database, statistical descriptors of Mel-band spectral
energy (MBSE), principal component analysis-transformed MBSE (PCA-MBSE), two
hybrid techniques of discrete wavelet transform-derived linear prediction coefficients
(DWT-LPC) and two spectral feature fusions (SFFs) of popular Mel-frequency cepstral
coefficients and linear prediction coefficients with five formants (MFCC-formants and
LPC-formants) are new approaches in this field. The experimental results from the final
stage suggest that SFFs techniques are the best approach for this database to classify the three accents of MalE with the best accuracy rate of 97.4%. This technique has
outperformed the standard MFCC features by as much as 7.8%. Under robustness
analysis, the SFFs followed by PCA-MBSE have shown greater noise resistivity than
the others. This thesis also contributes a new technique of feature selection called as
statistical band selection (SBS) algorithm using a simple decision to select band, based
on the smallest variances within class scores. The experimental results reveal that SBS
has increased the performance of AAC by achieving better accuracy rates between 3.9%
to 5.6%, lesser memory requirement between 22% to 55% and faster speed of 70% on
average of the three-class accent problem. Comparing accent severity between different
genders, this study suggests that male speakers possess higher degree of accentedness
following consistent results of better classification rates regardless of any technique of
acoustic features used. Also, it can be concluded that continuos speech possesses higher
intensitity of accent markers than isolated-word speech mode.