Phoneme based speech to text translation system for Malaysian English pronunciation
Abstract
Speech is the most common and vocalized form of human communication.
Communication through speech helps to convey the linguistic information and also
helps to express information about the person’s social and regional origin, health and
emotional state. Recent improvement in phoneme based speech to text translation
system has become one of the most exciting areas of the speech signal processing;
because of the major advances in statistical modeling of speech, automatic speech
recognition systems have find widespread of applications in tasks that require human
machine interface. The advancement and development of speech to text translation
system can be used in many applications such as Medical Transcriptions (digital speech to text) Automated transcription, Telematics and Air traffic control. In this research work, two sets of isolated word speech signal database has been built namely Vowels Class Word Database (VCWD) and Phonemes Class Word Database (PCWD). The VCWD was initially built to classify the isolated words based on the eleven classes of
vowels. The database has been analyzed using four different spectral analysis
techniques such as Mel-Frequency Cepstral Co-efficient (MFCC), Linear Predictive Coefficient
(LPC), Perceptual Linear Predictive Analysis (PLP) and Relative Spectra-
Perceptual Linear Predictive Analysis (RASTA-PLP)) to determine the best
discriminative features and to identify the network parameters. The PCWD has been
built to develop the phoneme based speech to text translation system using Linear
Predictive Coefficients (LPC) and Multilayer Neural Network models (MLNN) using
fusion concept for the classification of isolated words and phoneme. The isolated word
speech signals are recorded using a speech acquisition algorithm developed using a
MATLAB Graphical user interface (GUI). The speech signals are recorded for 15
seconds at 16 kHz sampling frequency. The recorded speech signals are pre-processed
and used to segment the voiced/unvoiced parts of the speech signal. A simple fuzzy
voice classifier has been proposed to extract the voiced portion using frame energy and
change in energy features. The extracted voiced portions are pre-processed and divided
into a number of frames. For each frame signal, the spectral features are extracted and
used as a feature set for the classification. The classification tasks of the isolated words
and phonemes are associated with the extracted features to establish input output
mapping. The data are then normalized and randomized to rearrange the values into
definite range. The Multilayer Neural Network (MLNN) model has been developed
with four combinations of input and hidden activation functions. To improve the
performance rate and reduce the training time a simple systole activation function has
been proposed. The neural network models are trained with 60%, 70% and 80% of the
total data samples. The trained neural network is validated with the remaining 40%,
30% and 20% of data samples by simulating the network. The performance of the
network is calculated by measuring the true positives, false negatives and classification
accuracy and the results are compared. It is observed that the fuzzy voice classifier is
developed with less complexity and yields better accuracy when compared with the
other voiced/unvoiced classification methods available in the literature. The LPC
features show better discrimination and the MLNN neural network models trained using
the LPC spectral band features gives better classification accuracy when compared with
other feature extraction algorithms. Also, the proposed systole activation function
produces reduced training time and epoch rate when compared with the other network
models.