Speech Recognition Using MATLAB and Cross-Correlation Technique

 Abstract —Speech is a prominent communication method among humans, whereas the communication between human and computers were based on text user interface and graphic user interface. Speech recognition is used in almost every security project where you need to speak and tell your password to computer and is also used for automation. This paper demonstrates a model that enhances technological advancement where humans and computers interact via voice user interface. In developing the model, cross correlation was implemented in MATLAB to compare two or more signals and detect the most accurate one of the all. We are actually used cross correlation to find similarity between our recorded Signal files and the testing signal. Thus we were able to develop a model where machines can differentiate between commands and act upon them.


I. INTRODUCTION
Speech is the most prominent means of communication amongst humans. Human-to-human interaction is based on speech, emotion and gestures, thereby making it a lot easier to understand one another. On the other hand, the communication between humans and computers is based on either Text User Interface (TUI) or Graphic User Interface(GUI). It is a lot easier for us humans to recognize a person's voice than computers. Hence, speech recognition in machine learning is a game changer as developing machines that can understand and uniquely identify a person's voice would make Human-Computer interaction more intriguing.
Speech recognition is one of the next generation technologies for human-computer interaction. Speech recognition has been researched since the late 1950s but due to its computational complexity and limited computing capabilities of the last few decades, its progress has been impeded. In laboratory settings automatic speech recognition systems (ASR) have achieved high levels of recognition accuracies, which tend to degrade in real world environment [1].
In today era speech technologies play an important role. This technology is commercially and easily available for a different uses. These technologies make machines respond correctly and it provides valuable services. In modern era, no one wants to reveal his identity due to security purposes. So, Speech can be used for the identification of person because every person has different speech characteristic. Thus with the different information in speech waves we can easily identify the speaker [2].

A. Speech Recognition
Speech recognition methods can be divided into textindependent and text dependent methods. In a text independent system, speaker models capture characteristics of somebody's speech, which show up irrespective of what one is saying. In a text-dependent system, on the other hand, the recognition of the speaker's identity is based on his or her speaking one or more specific phrases, like passwords, card numbers, PIN codes, etc. [3].
Every speech recognition application is designed to accomplish a specific task. Examples include: to recognize the digits zero through nine and the words "yes" and "no" over the telephone, to enable bedridden patients to control the positioning of their beds, or to implement a VAT (voiceactivated typewriter). Once a task is defined, a speech recognizer is chosen or designed for the task [2] The task of speech recognition is to convert speech into a sequence of words by a computer program. Speech recognition applications enable people to use speech as another input mode to interact with applications with ease and effectively. Speech recognition interfaces in native language will enable the illiterate/semi-literate people to use the technology to greater extent without the knowledge of operating with computer keyboard or stylus [4].
There are different modes available for Speech Recognition System: Speaker Dependent / Independent System: It must be trained in order to recognize accurately what has been said. To train a system, Speaker is asked to record predefined words or sentences that will be analyzed and that results will be stored.
Isolated Word Recognition: It is Simplest mode and less greedy in terms of CPU requirement. Word is surrounded by silence so that boundaries are well known.
Continuous Speech Recognition: It assumes that system is able to recognize a sequence of words in a sentence. Keyword Spotting: It is able to identify in a sentence a word corresponding to a particular command. Created to cover the gap between isolated and continuous System.
Vocabulary Size: Larger the vocabulary the system can make more errors. So vocabulary size matters.

B. Cross Correlation
Cross correlation is a standard method of measuring the similarities/relationships between two signals. It is a measure of similarity of two series as a function of the Speech Recognition Using MATLAB and Cross-Correlation Technique Ledisi G. Kabari, Marcus B. Chigoziri displacement of one relative to the other. There are some cases where it is necessary to compare one reference signal with one or more signals to determine the similarities between signals and to determine additional information based on their relationships. This is also known as a sliding dot product or sliding inner-product. It is commonly used for searching a long signal for a shorter, known feature It has applications in pattern recognition, single particle analysis, electron tomography, averaging, cryptanalysis, and neurophysiology [5].
The term cross-correlation is utilized for alluding to the relationships between the sections of two arbitrary vectors X and Y, while the connections of an irregular vector X are thought to be simply the connections between simply the passages of X, those shaping the connection lattice (network of connections) of X.
In MATLAB the cross correlation function is xcorr for sequence for a random process which includes autocorrelation. Thus the Syntax for Correlation in MATLAB is derived asr = xcorr(x,y). r = xcorr(x,y) returns the cross-correlation of two discrete-time sequences, x and y. Cross-correlation measures the closeness amongst x and moved (slacked) duplicates of y as a component of the slack. In the event that x and y have diverse lengths, the capacity annexes zeros toward the finish of the shorter vector so it has a similar length, N, as the other [5].

III. METHODOLOGY
Five recorded wav sound samples were stored in the database and we wish to recognize them using a correlation test.wav in MATLAB. STEP1: We create a test file command as shown in figure 1.  figure 2 represents the sample one.wav, do this repeatedly for as many samples you want to be present. In this case, we have five samples. Therefore, we repeated this code five times with every place that say 1 replaced with 2, 3, 4 and 5 respectively. STEP3: We note that the test has x value, it is this value that is used to compare the y values of the sample. Hence, Z1=xcorr(x,y); STEP4: We create a conditional statement where m6=300; and if m=max the machine will sound the allowed signal. This simply means, if m<=M1, sound one.wav speech else, it continues accordingly until it finds a perfect signal else, it sounds the denied.wav signal.

IV. SPEECH RECOGNITION TEST AND RESULTS
In the command window, we typed the command "speechrecognition ('test.wav') and hit the enter button to get the spectrum graphs. In this particular case, test.wav and test2.wav represents the number 2 and 3 respectively.
So we run a test for test2.wav. Now because test2.wav sounds the number 3, figure 3 as shown in figure 4 is the most accurate spectrum.
We repeated the step for test.wav which represents number 2. From figure 5 we can see that graph 2 is more accurate as test.wav sound is two. In this work, the five recorded wav audio files were used to demonstrate the speech recognition. Using the cross correlation method to find the similarities between the recorded audio files, we were able to develop a model where machines can differentiate between commands and act upon them. It clearly shows that machines can understand and interact with humans fluently, although they are very sensitive to noise and pronunciations. Figure 4 and 5 demonstrates a typical scenario where the machine recognizes one out of the audio files.