Effective Approach For Dialog Person Identification English Language Essay

Abstraction

The work taking to this paper has been focused on set uping a text-independent closed-set talker acknowledgment system. Contrary to other acknowledgment systems, this system was built with two parts for the intent of bettering the acknowledgment truth. The first portion is the talker sniping performed by KNN algorithm. To diminish the gender misclassification in KNN, a novel technique was used, where Pitch and MFCC characteristics were combined. This technique, in fact, does non merely better the gender misclassification, but besides leads to an addition on the entire public presentation of the pruning. The 2nd portion is the DDHMM talker acknowledgment performed on the ‘survived ‘ talkers after sniping. By adding the talker sniping portion, the system acknowledgment

Accuracy was increased 9.3 % . During the work period, an English Language Speech Database for Speaker Recognition ( ELSDSR ) was built. The system was trained and tested with both TIMIT and ELSDSR database.

Keywords: characteristic extraction, MFCC, KNN, talker pruning, DDHMM, talker acknowledgment and ELSDSR.

Introduction

Fig. 1.1 automatically extracts information transmitted in speech signal

The chief construction is taken from [ 1 ] . The speech signal contains rich messages, and three chief acknowledgment Fieldss from speech signal, which are of most involvement and have been studied for several decennaries, are speech acknowledgment, linguistic communication acknowledgment and talker acknowledgment. In this paper, concentrate our attending on talker acknowledgment field.

1.2 Principles of Speaker Recognition

Speaker acknowledgment, which involves two applications: talker designation and talker confirmation, is the procedure of automatically acknowledging who is talking on the footing of single information included in address moving ridges. This technique makes it possible to utilize the talker ‘s voice to verify their individuality and command entree to services such as voice dialing, banking by telephone, telephone shopping, database entree services, information services, voice mail, security control for confidential information countries, and remote entree to computing machines [ 2 ] . Speaker confirmation ( SV ) is the procedure of finding whether the talker individuality is who the individual claims to be. It performs a one-to-one comparing ( it is besides called binary determination ) between the characteristics of an input voice and those of the claimed voice that is registered in the system.

There are three chief constituents:

Front-end Processing, Speaker Modeling, and Pattern Matching. Front-end processing is used to foreground the relevant characteristics and take the irrelevant 1s.

Fig. 1.2 shows the basic construction of SV system ( SVS ) .

Three chief constituents shown in this construction are: Front-end Processing, Speaker Modeling, and Pattern Matching. To acquire the characteristic vectors of incoming voice, front-end processing will be performed, and so depending on the theoretical accounts used in Pattern Matching, lucifer tonss will be calculated. If the mark is larger than a certain threshold, so as a consequence, claimed talker would be acknowledged. After the first constituent, we will acquire the characteristic vectors of the address signal. Pattern Matching between the claimed talker theoretical account registered in the database and the impostor theoretical account will be performed. If the lucifer is above a certain threshold, the individuality claim is verified. Using a high threshold, system gets high safety and prevents imposters to be accepted, but in the mean while it besides takes the hazard of rejecting the echt individual, and frailty versa.

1.3 Speaker designation ( SI )

It is the procedure of happening the individuality of an unknown talker by comparing his/her voice with voices of registered talkers in the database. It ‘s a one-to-many comparing. The basic construction of SI system ( SIS ) is shown in Fig. 1.3. In SIS, M talker theoretical accounts are scored in analogue and the most-likely 1 is reported. In different state of affairss, talker acknowledgment is frequently classified into closed-set acknowledgment and open-set acknowledgment. Merely as their names suggest, the closed-set refers to the instances that the unknown voice must come from a set of known talkers ; and the open-set agencies unknown voice may come from unregistered talkers, in which instance we could add ‘none of the above ‘ option to this designation system. Furthermore in pattern talker acknowledgment systems could besides be divided harmonizing to the address modes: text-dependent acknowledgment, text-independent acknowledgment. For text-dependent SRS, talkers are merely allowed to state some specific sentences or words, which are known to the system. In the deal, the text-dependent acknowledgment is sub

Fig. 1.3 Basic construction of Speaker Identification

The nucleus constituents in SIS are the same as in SVS. In SIS, M talker theoretical accounts are scored in analogue and the most-likely 1 is reported, and accordingly determination will be one of the talker ‘s ID in the database, or will

be ‘none of the above ‘ if and merely if the matching mark is below some threshold and it ‘s in the instance of a open-set SIS. divided into fixed phrase and prompted phrase. On the contrary, as for the text-independent SRS, they could treat freely spoken address, which is either user selected phrase or colloquial address. Compared with text-dependent SRS, text-independent SRS are more flexible, but more complicated.

The elaborate taxonomy of address processing is shown in Fig. 1.4, so as to give a general position.

Fig. 1.4 taxonomy of address processing

Speech signal processing could be divided into three different undertakings: Analysis, Recognition and Coding. Shown in Fig. 1.1, acknowledgment research Fieldss could be subdivided into three parts: Address, Speaker and Language acknowledgment. Into the deal, harmonizing to the different applications and state of affairss that acknowledgment systems work in, Speaker acknowledgment is classified into text-dependent, -independent, closed-set and open-set. Before processing, it ‘s of import to stress the difference between SV and speech acknowledgment. The purpose of address acknowledgment system is to happen out what the talker is stating, and to help the talker in carry throughing what he/she wants to make. However speaker confirmation system is frequently used for security. The system will inquire talkers to state some specific words or Numberss, but unlike address acknowledgment system, the system does n’t cognize whether the talkers have said what they are expected to state. Furthermore in some literature voice acknowledgment is mentioned. Voice acknowledgment is equivocal, and it normally refers to speech acknowledgment, but sometimes it is besides used as a equivalent word for talker confirmation.

1.4 Phases of Speaker Identification

For about all the acknowledgment systems, preparation is the first measure. We call this measure in SIS registration stage, and name the undermentioned measure designation stage. Enrollment stage is to acquire the talker theoretical accounts or voiceprints for talker database. The first stage of confirmation systems is besides registration.

Enrollment stage is to acquire the talker theoretical accounts or voiceprints to do a talker database, which could be used subsequently in the following stage, i.e. designation stage. The front-end processing and talker patterning algorithms in both stages of SIS ( SVS ) should be consistent severally.

1.5 Development of Speaker Recognition Systems

The first type of talker acknowledgment machine utilizing spectrographs of voices was invented in the 1960 ‘s. It was called voiceprint analysis or seeable address. Voiceprint is

acoustic spectrum of the voice and it has similar definition as fingerprint. Both of them belong to biometries. However voiceprint analysis could non recognize automatic acknowledgment. Human ‘s manual finding was needed. Until now a figure of feature extraction techniques, which are normally used in Speech Recognition field, For talker acknowledgment job, different representations of the audio signal utilizing different characteristics have been addressed. Features can be calculated in clip sphere, frequence sphere, or in both spheres started from the system illustrated and used characteristics calculated in both spheres. For their ain database which was extracted from Italian Television intelligence, the system achieved 99 % acknowledgment rate when 1.5 seconds was used to place. Furthermore, different categorization paradigms utilizing different patterning techniques for SRS could be found, such as Gaussian Mixture Model ( GMM ) and Hidden Markov Model ( HMM ) , which are prevailing techniques in SR field. The system has been often quoted. It uses Mel-scale cepstral coefficients, which is

cepstral analysis in the frequence sphere. Based on, transmutations have been done. One illustration can be seen in, which transformed Mel cepstral characteristics for

counterbalancing the noise constituents in the audio channel, and so formants characteristics were calculated and used in categorization. Chief Component Analysis ( PCA ) was used on the characteristics. PCA was to cut down the computational complexness of the categorization stage.

Furthermore, speaker acknowledgment applications have distinguishable restraints and work in different state of affairss. Following the applications petitions, acknowledgment systems are divided into closed-set, open-set, text-independent, and text-dependent. Harmonizing to the use of applications, systems are designed for individual talker, and besides for multi-speaker. As a portion of information included in spoken vocalization, emotions get more and more attending at present, and vocal emotions have been studied as a separate subject. Shows that the mean cardinal frequence increased and the scope of cardinal frequence enlarged when the talker was involved in a nerve-racking state of affairs. Until now, MPEG as a new technique is used for talker acknowledgment. MPEG, officially named “ Multimedia Content Description Interface ” , is a criterion for depicting the multimedia content informations that supports some grade of reading of the information ‘s significance, which can be passed onto, or accessed by, a device or a computing machine codification. MPEG-7 is non aimed at any one application in peculiar ; instead, the elements that MPEG standardizes support as wide a scope of applications as possible. For talker acknowledgment job, MPEG Audio criterion were used. MPEG Audio criterion comprises forms and description strategies. They are divided into two categories: generic low-level tools and application-specific tools. There are 17 low-level sound forms ( LLD ) . [ 11 ] Used a method of projection onto a low-dimensional subspace via reduced-rank spectral footing maps to pull out speech characteristics. Here two LLd ‘s were used. : AudioSpectrumProjectionType Using Independent Component Analysis ( ICA ) , the talker acknowledgment truth for little set is 91.2 % , for big set is 93.6 % ; and the gender acknowledgment truth for little set is 100 % .

Fig. 1.6 Classification paradigms used in SRS during the past 20 old ages ( taken from CWJ ‘s presentation slides VQ, NN, HMM and GMM represent Vector Quantization, Neutral Network, Hidden Markov Model and Gaussian Mixture Model severally. It has been shown that a uninterrupted ergodic HMM method is superior to a distinct ergodic HMM method and that a uninterrupted ergodic HMM method is every bit robust as a VQ-based method when adequate preparation informations is available. However, when small information is available, the VQ-based method is more robust than a uninterrupted HMM method In this paper introduced the HMMs for talker mold, and notice that for each talker in the database, there is a corresponding HMM. Learned from Fig. 1.3, the Basic Structure of Speaker Identification ( Identification Phase ) , presuming there are M talkers in the database, Speaker Identification has to execute M form fiting between the unknown talker and M known talkers. With a big figure of talkers in the database, the public presentation of talker acknowledgment will diminish. Furthermore since HMM is a double stochastic procedure, this theoretical accounts are excessively flexible and difficult to develop. As a consequence the high acknowledgment truth is difficult to be achieved with a big figure of talker theoretical accounts to compare with. Speaker sniping appears to be one solution to increase the designation truth with the cost of increasing a small spot recognition clip. See Fig. 1.3 once more, the talker sniping performs before the form fiting to cut down sum of talker theoretical accounts in the matching procedure, and those pruned talkers are the 1s who are most dissimilar with the unknown talker. By making this, notice the form fiting between the unknown talker ‘s characteristic vectors and all the talkers in the database is reduced to the matching between the unknown characteristic vectors and the ‘survived ‘ campaigners after sniping. The simple algorithm KNN is used here for talker pruning. In this Paper, foremost present the theory of KNN, and so its application as talker pruning will be given.

1.6 K-Nearest Neighbor Algorithm

K-Nearest Neighbor is a sort of non-parametric algorithm. KNN shops all the given informations illustrations { fvi, Li } , where fvi denotes the characteristic vectors in our pruning system and Li denotes the category label. It uses these illustrations to gauge lnew for the new illustration fvnew. The lnew is assigned to the category holding the largest representatives amongst the K nearest illustrations, which fvnew are similar to. Here we use NK to denote the figure of close neighbours to separate the figure of codewords in one codebook K defined as follow The characteristics from speech signal are quantized by a vector-quantization ( VQ ) process, such as K-means algorithm. The VQ process aims at partitioning the acoustic infinite into nonoverlapping parts, and each part is represented by one codeword wk. The aggregation of these K codewords makes the codebook { wk } . Therefore in the DDHMM instances, the bj ( ot ) is approximated by bj ( wk ) where wk is the codeword closest to ot. The quality of the codebook is measured in

footings of deformation, which has the signifier as follow:

Where K is the figure of codewords, K is the centroid of the k’th codeword and yt is the characteristic vector.

Fig. 1.7 presents the KNN algorithm in the instance of 5 nearest neighbours, and the process of KNN is as follows: 1 ) iˆ Alternatively of constructing a theoretical account, all the preparation examples { fvi, Li } are stored ; 2 ) iˆ Calculate the similarity between the new illustration fvnew and all the illustrations in the preparation set fvi ; 3 ) iˆ Determine the K-nearest illustrations to fvnew ; 4 ) iˆ Assign lnew to the category that most of the K-nearest illustrations belong to.

First, all the illustrations from developing talkers are stored. Red trigons stand Speaker 1, bluish squares stand Speaker 2, and xanthous stars stand Speaker 3. The Pentagon denotes an illustration from an unknown talker. Second, cipher the Euclidean distances between the unknown illustration and all the illustrations in the preparation set dE = { dE1, aˆ¦ , lair } . Third, screen the distances and happen out the NK=5 nearest neighbours D1aˆ¦ D5, which are included in the light bluish circle. At last, assign unknown illustration to the category that most of the 5 illustrations belong to, which is Blue Square ( Speaker 2 ) in this instance. Normally the similarity between illustrations refers to the Euclidean distance, and the Euclidean distance between illustration vectors M = ( M1, M2, aˆ¦ , mj ) and N = ( n1, n2, aˆ¦ , New Jersey ) is defined as:

Where J is the dimension of vectors. To acquire a better apprehension of KNN, we give one simple illustration as follows. Suppose we know some classs of 6 people in different topics and their ages, and know which of them are qualified for a competition. Then harmonizing to the limited information we have to make up one’s mind one new pupils ‘ making by KNN algorithm. The information is shown below. Now we can specify the preparation set as { fvi, Li }

where vector fvi= { Agei, Mathi, Physicsi, Chemistryi } , and Li is the decision, either qualified or non. The information of George forms the new illustration xnew = { 27, 13, 11, 11 } . The Euclidian distances between fvnew and all fvi in the preparation set demand to be calculated utilizing following tabular array Following the process of KNN algorithm, here find out the NK=3 nearest neighbours for George, which are Lisa, Jerry and Tom. Due to the bulk vote, lnew of George is assigned to be Yes, which means Georgeis qualified.

One thing needs to be emphasized here is the standardization of variables in the illustration vectors. , nevertheless in some other instances, the distance between neighbours may be dominated by some variables with big discrepancy. ( Notice here the discrepancy of age is comparatively larger than the other variables. ) Therefore, it ‘s necessary to normalise each variable with its largest value before put to deathing KNN to avoid the domination. KNN is a simple algorithm and easy to implement, and by taking higher value of NK, the noise exposure in the preparation set will be reduced. However, since KNN does n’t construct any theoretical account and merely shops all the preparation informations, a batch of computing machine storage will be needed. Furthermore sometimes the Euclidian distance is non so suited for happening similar illustrations when there are irrelevant properties in preparation set.

1.8 Speaker Pruning utilizing KNN

The proposal of contriving and presenting talker pruning in our SRS is to increase the acknowledgment truth. Therefore it ‘s different from the normally defined talker pruning, where the pruning procedure will be continuously performed until the unknown talker ID is found out. Our sniping procedure merely execute one time to happen out the similarity between unknown talker and all the known talkers in the database ; by taking the most similar talkers, we eliminate the dissimilar speakers.. Before traveling to the pruning method, some issues should be kept in our head during implementing KNN into talker sniping algorithm: 1 ) iˆ Which features will be used ; 2 ) How to cipher the matching mark ; 3 ) Number of nearest neighbours 4 ) Sniping standard ( how many talker will be pruned ) 5 ) Time ingestion The figure of speakers/candidates will maintain after sniping should be adjustable with the demands of acknowledgment system. If more talkers are kept, for HMM more pattern acknowledgment has to be performed, which decreases the HMM acknowledgment truth. On the other manus, by maintaining more talkers we guarantee the truth rate of holding the true talker in those unbroken 1s. The trade-off truth between sniping and HMM can be solved by detecting the entire acknowledgment truth of the system with different combinations to happen out the coveted figure of ‘survived ‘ campaigners. By utilizing KNN algorithm, the sniping matching mark becomes the Euclidian distances between unknown talker illustrations and all the illustrations in the preparation database. Features are traveling to utilize as informations points are MFCC. Below fig show the experimental confirmation of taking MFCC. In order to better the public presentation of KNN in talker acknowledgment, we invented a new method to present the pitch information into KNN algorithm. The scope of parametric quantity is decided by the experimental experience. Second we introduce the parametric quantity into the Euclidean distances calculated by utilizing MFCC as follows

dMFCC is the Euclidian distance calculated by merely utilizing MFCC, and dnew is the distance after adding pitch influence. This method depends on the truth of pitch in dividing genders. The ( 1- _ ) factor will be multiplied with the Euclidean distances between the new talker and female talkers in database, and for the remainder male

Speakers the Euclidean distances will multiply a ( 1+ _ ) factor. Therefore if the chance is less than 50 % , which means talker is more possible to be a male, _becomes negative, and the distance between the new talker and female talkers will be increased, vise versa. The clip issue in our pruning is critical since by presenting the talker sniping measure, We decrease a small spot the acknowledgment velocity. Therefore we should maintain the clip ingestion every bit low as possible. The factors which have consequence on clip are developing set size, trial set size, characteristic dimensionality and the figure of nearest neighbours NK. However in the mean while we besides expect to hold higher truth after sniping

Experiments and Consequences

Notice the envelope of the original signal ( upper panel ) ,

which represents the low frequence, was removed.

preemphasis ; feature extraction ; talker pruning ; and talker mold and acknowledgment. More effects have been put on characteristic extraction, since characteristics is closely related with the public presentation of whole system. Furthermore talker pruning as an introduced measure in our acknowledgment system needs particular attendings. For bettering the pruning public presentation, a new method will be invented to do two characteristics: pitch and MFCC working together in Euclidean distance computation for KNN algorithm. Finally HMM will be implemented on the campaigners after talker sniping to carry through the talker acknowledgment undertaking.