Text Material Preview
Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features Niall Fox1, Richard B. Reilly1 1 Dept. of Electronic and Electrical Engineering, University College Dublin, Belfield, Dublin 4, Ireland. {niall.fox, richard.reilly}@ee.ucd.ie Abstract: This paper presents a speaker identification system based on dynamical features of both the audio and visual modes. Speakers are modeled using a text dependent HMM methodology. Early and late audio-visual integration are investigated. Experiments are carried out for 252 speakers from the XM2VTS database. From our experimental results, it has been shown that the addition of the dynamical visual information improves the speaker identification accuracies for both clean and noisy audio conditions compared to the audio only case. The best audio, visual and audio-visual identification accuracies achieved were 86.91%, 57.14% and 94.05% respectively. 1 Introduction Recently there has been significant interest in multi-modal human computer interfaces, especially audio-visual (AV) systems for applications in areas such as banking, and security systems [1], [3]. It is known that humans perceive in a multi- modal manner, and the McGurk effect demonstrates this fact [10]. People with impaired hearing use lip-reading to complement information gleaned from their perceived degraded audio signal. Indeed, synergistic integration has already been achieved for the purpose of AV speech recognition [18]. Previous work in this area is usually based on either the use of audio, [16], [15] or static facial images (face recognition) [2]. Previous multi-modal AV systems pre-dominantly use the static facial image as the visual mode and not the dynamical visual features [1], [19]. It is expected that the addition of the dynamical visual mode should complement the audio mode, increase the reliability for noisy conditions and even increase the identification rates for clean conditions. Also, it would be increasingly difficult for an imposter to impersonate both audio and dynamical visual information simultaneously. Recently, some work has been carried on the use of the dynamical visual mode for the purpose of speech recognition [9], [17]. Progress in speech based bimodal recognition is documented in [4]. The aim of the current study was to implement and compare various methods of integrating both dynamic visual and audio features for the purpose of speaker identification (ID) and to achieve a more reliable and secure system compared to the audio only case. 2 Audio and Visual Segmentation The XM2VTS database [8], [11] was used for the experiments described in this paper. The database consists of video data recorded from 295 subjects in four sessions, spaced monthly. The first recording per session of the third sentence (“Joe took fathers green shoe bench out”) was used for this research. The audio files were manually segmented into the seven words. The audio segmentation times were converted into visual frame numbers, to carry out visual word segmentation. Some sentences had the start of Joe clipped or it was totally missing. Due to this and other errors in the sentences, only 252 out of a possible 295 subjects were used for our experiments. Visual features were extracted from the mouth ROI (region of interest). This ROI was segmented manually by locating the two labial corners. A 98×98 pixel block was extracted as the ROI. Manual segmentation was only carried out for every 10 th frame, and the ROI coordinates for the intermediate frames were interpolated. 3 Audio and Visual Feature Extraction The audio signal was first pre-emphasised to increase the acoustic power at higher frequencies using the filter H(z) =1/(1-0.97z -1 ). The pre-emphasised signal was divided into frames using a Hamming window of length 20 ms, with overlap of 10 ms to give an audio frame rate, FA,, of 100 Hz. Mel-frequency cepstral coefficients (MFCC’s) [5] of dimension 8 were extracted from each frame. The energy [20] of each frame was also calculated and used as a 9 th static feature. Nine first order differences or delta features were calculated between adjacent frames and appended to the static audio features to give an audio feature vector of dimension 18. Transform based features were used to represent the visual information based on the Discrete Cosine Transform (DCT) because of its high energy compaction [12]. The 98×98 colour pixel blocks were converted to gray scale values. No further image pre-processing was implemented, and the DCT was applied to the gray scale pixel blocks. The first 15 coefficients were used, taken in a zig-zag pattern. Calculating the difference of the DCT coefficients over k frames forms the visual feature vector. This was carried out for two values of k giving a visual feature vector of dimension 30. The static coefficients were discarded. The values of k depended on the visual feature frame rate. The visual features can have two frame rates: 1) Asynchronous visual features, have a frame rate, FV, of 25 fps or equivalently, 25 Hz, i.e. that of the video sequence. The optimum values of k used were determined empirically to be 1 and 2. 2) Synchronous visual features, have a frame rate of 100 fps, i.e. that of the audio features. Since the frame rate is higher than the asynchronous case, the values of k must be higher to give the same temporal difference. Two sets of synchronous k values, (3,6) and (5,10), were tested. In general, delta(k1,k2), refers to the use of the k values, k1 and k2, for the calculation of the visual feature vector, where k2 > k1. A sentence observation, O = O1 … Ok … OM, consists of M words, where M = 7 here. A particular AV word, Ok, has NA audio frames and NV visual frames. In general NA ≠ 4×NV, even though FA = 4×FV. This is due to the fact that when NA and NV were determined, the initial frame number and final frame number values were rounded down and up respectively. A sequence of audio and synchronous visual frame observations is given by Equation (1). When the visual features are calculated, according to Equation (2), k2 frames are dropped. In Equation (2), on {V} refers to the n th synchronous visual feature vector of dimension 30, and Tm refers to the top 15 DCT transform coefficients of the m th interpolated visual frame. Hence, to ensure that there are NA visual feature frames, the NV DCT visual frames are interpolated to NA + k2 frames (refer to Fig. 1). (1) . ,...... }{}{}{ 1 {A,V} ioooO i N i n i k A ∈= (2) . 1 ,1 ],,[ 22 }{ 21 kNmkNnTTTTo AAkmmkmm V n +≤≤+≤≤−−= −− NA Audio Feature Frames (100 HZ) NV Visual DCT Frames (25 HZ) NA + k2 Visual DCT Frames NA Visual Feature Frames (100 HZ) 40ms 10ms Interpolation of the DCT Coeficients Calculation of Asynchronous Visual Features from the DCT Coeficients Calculation of Synchronous Visual Features from the DCT Coeficients K2 K1 time Audio Visual Fig. 1. Frame interpolation and visual feature calculation for a specific word consisting of NA audio frames 4 Speaker Identification and Hidden Markov Modeling Speaker ID is discussed in this paper as opposed to speaker verification. To test the importance of integration based on the use of dynamic audio and visual features, a text dependent speaker ID methodology was used. For text dependent modeling [6], the speaker says the same utterance for both training and testing. It was employed, as opposed to text independent modeling [16], because of the database used in this study. Also, text independence has been found to give less performance than text dependence [7]. Each word consists of NA or NV frame observations given above by Equation (1). Speaker Si is represented by M speakerdependent word models, Sik, for i = 1 … N, k= 1 … M where N = 252 and M = 7 here. There are M background models, Bk. Three sessions were used for training and one session for testing. The M background speaker independent word HMMs were trained using three of the sessions for all the speakers. These background models capture the AV speech variation over the entire database. Since there were only three training word utterances per speaker, there was insufficient training data to train a speaker dependent HMM, which was initialized with a prototype model. Hence the background word models were used to initialise the training of the speaker dependent word models. A sentence observation, O, is tested against all N speakers, Si, and the speaker that gives the maximum score is chosen as the identified speaker. To score an observation O against speaker Si, M separate scores, P(Oi/Sik) are calculated, one for each word in O, 1≤ k ≤ M. The M separate scores are normalised with respect to the frame length of each word by dividing by Fk, and are then summed to give an overall score P(Oi/Si) as shown in Equation (3). O is also scored against the background models to give an additional score P(Oi/B) also shown in Equation (3). , ))/(log(1 ))/(log( 1 ∑ = = M k k ikk i F SOP M SOP . ))/(log(1 ))/(log( 1 ∑ = = M k k ikk F BOP M BOP (3) The two scores in Equations (3), are subtracted to give an overall measure of the likelihood that speaker Si produced the word observation O as shown in Equation (4). The subtraction of the background score provides a normalisation of the complete speaker score, Di. Di is calculated for each of the N speakers and O is identified as speaker Si using the maximum value of Di, i = 1 … N. (4) . ))/(log())/(log( BOPSOPD ii −= 5 Audio-Visual Integration The two main problems concerning AV integration is when and how the integration should take place. Integration can be take place at three levels; early, middle and late [6]. Early and late integration only are discussed in this study. Early Integration (EI). The audio and visual modality features are combined, and then used for training and testing of a single AV classifier. The visual frame rate is first synchronised with the audio frame rate. Equation (5) and Fig. 2 show how the synchronous visual feature vector is concatenated to the audio feature vector. (5) . 1 , ],[ }{}{}{ A V n A n AV n Nnooo ≤≤= A (9) A (9)∆ 5 V (15)∆ 10 V (15)∆ A (9) A (9)∆ 5 V (15)∆ 10 V (15)∆ Audio Visual AV Fig. 2. Audio, delta(5,10) visual case and AV feature blocks This method of integration has several disadvantages. The audio or visual mode data quality is not taken into account resulting in an equal weighting in the AV feature vector. The feature vector has a higher dimension requiring more training data. This is a problem for training speaker dependent models. However EI has the advantage that it is easy to implement both in training and classification. Late Integration (LI). LI requires two independent classifiers to be trained, one classifier for each mode. For speaker ID there are two options for the position to late integrate the speaker scores. The Viterbi word scores may be integrated or the scores according to Equations (3) may be integrated. The advantages of late integration include, the ability to account for mode reliabilities, small feature vector dimensions and ease of adding other modes to the system. For LI the two scores are weighted to account for the reliability of the modes. The two scores may be integrated via addition or multiplication. Equation (6) shows the use of weights for the case of additive integration where λA is the weight of the audio score. The audio score can be late integrated with either of the asynchronous or synchronous visual scores. Prior to LI the audio and visual scores are normalised. ( ) ( ) (6) . )/()/()/( 1 AA iViAiAV SOPSOPSOP λλ −+= 6 Experiments Left to right HMM’s were used in the classification experiments. The models were trained using the Baum Welch algorithm and tested using the Viterbi algorithm [14], implemented using the HMM toolkit, HTK [20]. The audio features were calculated using HTK. The seven background models were trained using three sessions and tested using one session. This gave 3*N (756) training examples per background model. To test that the background models were trained correctly and to test the fusion methodologies, speech recognition experiments were carried out. A six state, two mixture HMM topology was used for the audio and EI AV models. A one state, one mixture topology was used for the asynchronous models and a six state, two mixture topology for the synchronous models. Each model was tested N times to give 7*N (1764) word tests in total, where N = 252, the number of speakers tested. Speaker ID experiments were carried out for N subjects. A one state, one mixture HMM topology was used for the audio, asynchronous visual and EI AV modes. A one state, two mixture HMM topology was used for the synchronous visual mode. These HMM topologies, which gave the best results, were found by exhaustive search. The first three sessions were used for training and the fourth session was used for testing. Two sets of synchronous visual feature k values, (3,6) and (5,10), and one set of asynchronous visual feature k values, (1,2), were tested. Additive white Gaussian noise was applied to the clean audio at signal-to-noise ratios (SNR) ranging from 48dB to –12dB in steps of 6dB. All models were trained using clean speech and tested using the various SNR values. Optimum λA values were determined by exhaustive search for each noise level. This was achieved by testing λA values from 0 to 1 in steps of 0.01. 7 Results and Discussion Table 1 shows the experimental results using audio. The audio word recognition performed extremely well, with an accuracy of 99.04%. This verifies that the seven background word models were trained correctly. The asynchronous word recognition performed poorly, 26.81%. This may be due the low number of states and mixes employed (both one) because of the low number of asynchronous frames per word. Interpolation of the asynchronous visual frames has the effect of increasing the amount of training data. This permits better training of the HMMs, which resulted in a better visual accuracy of 52.84%. In both cases of synchronous visual speaker ID the results are similar, which suggests that further improvement in the AV system depends on the integration and not on the features employed. Table 1. Word recognition and speaker ID results for clean audio Classifier Modality Visual Features Word recognition (%) Speaker Identification (%) Audio N/A 99.04 86.91 Synchronous Visual delta(5,10) 52.5 57.14 Audio-Visual (EI) delta(5,10) 95.07 80.16 Synchronous Visual delta(3,6) 52.84 53.97 Audio-Visual (EI) delta(3,6) 97.17 80.56 Asynchronous Visual delta(1,2) 26.81 55.56 Fig. 3a shows the results for EI. There may be several reasons why the EI performed so poorly. The visual features may not have been synchronized properly with the audio features. This may have occurred when visual frames were dropped to calculate the delta features or because of the overlapping audio frames. Another reason for poor EI performance may be the lack of training data for the dimensionally larger AV feature vectors. Fig. 3b shows the results of LI speaker ID. The AV LI scores are synergistic, giving a significant improvement over the audio case. -20 -10 0 10 20 30 40 50 0 10 20 30 40 50 60 70 80 90 SNR (dB) EI Speaker Identification Results (252 Speakers)Audio Asynch-Visual Synch-Visual A-V -20 -10 0 10 20 30 40 50 0 10 20 30 40 50 60 70 80 90 100 SNR (dB) LI Speaker Identification Results (252 Speakers) Audio SynchV AsynchV Audio-SynchV Audio-AsynchV Fig. 3a. EI speaker ID rates using delta(5,10) visual features. Fig. 3b. LI speaker ID rates using delta(5;10) visual features, additive Viterbi score LI of normalised scores Fig. 4 shows how the audio weights varied with SNR. The continuous line shows the audio weights that gave the best LI results. The general profile is as expected, with higher SNRs requiring higher audio weights and vice versa. The vertical error bars show the audio weights that gave an LI score within a range of 98% of the maximum S core (% ) S core (% ) score. This shows that some flexibility is permitted in the audio weights and this should be kept in mind when implementing adaptable audio weights. -20 -10 0 10 20 30 40 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Audio Weights vs SNR for SynchV SNR (dB) -20 -10 0 10 20 30 40 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Audio Weights vs SNR for AsynchV SNR (dB) Fig. 4a. Audio weights for synchronous visual of Fig. 3b. Fig. 4b. Audio weights for asynchronous visual of Fig. 3b 8 Further Developments and Conclusion The results for LI based on the use of dynamic features show good results. The LI results show that the addition of the dynamic visual mode not only increases the results for low SNR values but also increases the results for clean audio, giving a speaker ID system of higher accuracy and more robustness to audio noise. It was expected that the EI results were poor due to the lack of training data. However, both EI and LI speech recognition had 3*N training samples per model and synergistic EI was not achieved (see table 1). This would suggest that the use of more training data may not yield synergistic EI. To achieve synergistic EI, further analysis of the feature extraction methods and AV feature synchronisation may be required. For an AV system that is robust to real world conditions, it is not sufficient to just prove its robustness to audio noise only. Robustness to visual mode degradation is also necessary. Effects of visual degradation, such as frame rate decimation, noise and compression artifacts [13], have not been reported widely in the literature. It is expected that frame rate decimation would effect the dynamic visual features more so than other visual degradations. Further image pre-processing may yield higher visual accuracies. ROI down-sampling may further compact the visual feature vector and may improve the EI results, due to the reduced amount of training data required. In conclusion, the results show that the addition of the dynamical visual information improves the speaker ID accuracies for both clean and noisy audio conditions compared to the audio only case. 9 Acknowledgements This work was supported by Enterprise Ireland's IP2000 program. A udio W eights A udio W eights References 1. Brunelli, R. and Falavigna, D.: Person identification using multiple cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 10, pp. 955-966, Oct.1995 2. Brunelli, R. and Poggio, T.: Face Recognition: Features versus Templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042- 1052, 1993 3. Chen, T.: Audiovisual Speech Processing. IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 9-21, Jan.2001 4. Chibelushhi, C. C., Deravi, F., and Mason, J. S. D.: A Review of Speech-Based Bimodal Recognition. IEEE Transaction on Multimedia, vol. 4, no. 1, pp. 23-36, Mar.2002 5. Davis, S. and Mermelstein, P.: Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980. 6. Lucey, S.: Audio-Visual Speech Processing. PhD thesis, Queensland University of Technology, Brisbane, Australia, Apr.2002 7. Luettin J.: Speaker verification experiments on the XM2VTS database. In IDIAP Communication 98-02, IDIAP, Martigny, Switzerland, Aug.1999 8. Luettin, J. and Maitre, G.: Evaluation Protocol for the XM2VTSDB Database (Lausanne Protocol). In IDIAP Communication 98-05, IDIAP, Martigny, Switzerland, Oct.1998 9. Matthews, I., Cootes, T. F., Bangham, J. A., Cox, J. A., and Harvey, R.: Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 198-213, Feb.2002 10. McGurk, H. and MacDonald, J.: Hearing Lips and Seeing Voices. Nature, vol. 264, pp. 746-748, Dec.1976 11. Messer, K., Matas, J., Kittler, J., Luettin J., and Maitre, G.: XM2VTSDB: The Extended M2VTS Database. The Proceedings of the Second International Conference on Audio and Video-based Biometric Person Authentication (AVBPA'99), Washington D.C., pp. 72-77, Mar.1999 12. Netravali, A. N. and Haskell, B. G.: Digital Pictures. Plenum Press, pp. 408-416, 1998 13. Potamianos, G., Graf, H., and Cosatto, E.: An Image Transform Approach for HMM Based Automatic Lipreading . Proceedings of the IEEE International Conference on Image Processing, Chicago, vol. 3 pp. 173-177, 1998 14. Rabiner, L. R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb.1989 15. Ramachandran, R. P., Zilovic, M. S., and Mammone, R. J.: A Comparative Study of Robust Linear Predictive Analysis Methods with Applications to Speaker Identification. IEEE Transactions on Speech and Audio Processing, vol. 3, no. 2, pp. 117-125, Mar.1995 16. Reynolds, D. A. and Rose, R. C.: Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, Jan.1995 17. Scanlon, P. and Reilly, R.: Visual Feature Analysis For Automatic Speechreading. DSP Research Group, UCD, Dublin, Ireland, 2001 18. Silsbee, P. and Bovik, A.: A Computer Lipreading for Improved Accuracy in Automatic Speech Recognition. IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 337-350, Sept.1990 19. Yacoub, S. B. and Luetin, J.: Audio-Visual Person Verification. In IDIAP Communication 98-18, IDIAP, Martigny, Switzerland, Nov.1998 20. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., and Woodland, P.: The HTK Book (for HTK Version 3.1). Microsoft Corporation, Cambridge University Engineering Department, Nov.2001 Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features 1 Introduction 2 Audio and Visual Segmentation 3 Audio and Visual Feature Extraction 4 Speaker Identification and Hidden Markov Modeling 5 Audio-Visual Integration 6 Experiments 7 Results and Discussion 8 Further Developments and Conclusion 9 Acknowledgements References