+98 21 8609-3065 h.veisi[at]ut.ac.ir
Persian Speech Recognition using Deep Learning

Persian Speech Recognition using Deep Learning

Home > Automatic speech recognition (ASR) is the translation of spoken words into text. Speech recognition has many applications such as virtual speech assistants (e.g., Apple’s Siri, Google Now, and Microsoft’s Cortana), speech-to-speech translation, voice dictation and etc. As shown in Fig. 1, an ASR system has five main components: signal processing and feature extraction, acoustic model (AM), language model (LM), lexicon and hypothesis search. The signal processing and feature extraction component takes the audio signal as the input, enhances the speech by removing noises and extracts feature vectors. The acoustic model integrates knowledge about acoustics and phonetics, takes the features as its input, and recognize phonemes. Language model contains information about structure of the language. Lexicon includes all words that audio signal can be mapped to them. The hypothesis search component combines AM and LM scores and outputs the word sequence with the highest score as the recognition result. There are several methods for creating acoustic model, such as hidden markov model (HMM) [5] and artificial neural network (ANN) [1]. Audio signals as a sequential data, current input depends on previous inputs. Recurrent neural networks (RNNs) [1] benefits for their ability to learn sequential data. But for standard RNN architectures, the range of context that can be in practice accessed is quite limited. This problem is often referred to in the literature as the vanishing gradient problem. In 1997, after introducing long short term memory (LSTM) [1-3] neural network, the problem of limitation in processing sequential data resolved. An LSTM network is the same as a standard RNN, except that units in the hidden layer are replaced by memory...
Speaker Recognition using Deep Belief Networks

Speaker Recognition using Deep Belief Networks

Home > Human society functions by communication between individuals. Language in both its written and spoken forms underpin all aspects of human interactions. The spoken language is the most fundamental as this is how individuals communicate with one another using only the human vocal apparatus. Since spoken language is one of the easiest measures to acquire (all your need is a microphone), is used in a variety of transaction applications (e.g. telephone banking), and has the potential for security by surveillance, it comes as no surprise that speaker recognition is one of the key research areas in signal processing and pattern recognition. Deep Belief Networks (DBNs) have become a popular research area in machine learning and pattern recognition. In recent years, deep learning techniques have been successfully applied to the modeling of speech signals, such as speech recognition, acoustic modeling, speaker and language recognition, spectrogram coding, voice activity detection, acoustic-articulatory inversion mapping, 3D object recognition, intelligent video surveillance and image recognition. DBNs, use a powerful strategy of unsupervised training and multiple layers that provides parameter-efficient and accurate acoustic modeling. DBNs have been successfully used in speech recognition for modeling the posterior probability of state given a feature vector. Feature vectors are typically standard frame based acoustic representations (e.g., MFCCs) that are usually stacked across multiple frames.   The DBN performs a nonlinear transformation of the input features, and produces the probability that an output unit is active, given a wide context of input frames. The basic process for pre-training a DBN is based upon stacking RBMs. RBMs are an undirected graphical model with visible and hidden units with only...
Speech Enhancement

Speech Enhancement

Home > Another field in our research in DSPLab is in Speech Quality Enhancement. In addition to classical well-known methods (such as Spectral Subtraction and Wiener Filter), we now are working on statistical approach in speech enhancement in both trained and none-trained manners.                    ...
Speaker Recognition

Speaker Recognition

Home > As a biometric, voice can be used to identify or verify a speaker. Speaker recognition denotes techniques for identification/verification of the person who is speaking by characteristics of his/her voice.                    ...
Speech Recognition

Speech Recognition

Automatic Speech Recognition (ASR) denotes to techniques that convert spoken speech into text. Our ASR in DSPLab concentrates on Persian speech recognition and related challenges such as 1.robustness 2.language modeling 3.spontaneous speech recognitions.                   Home...