Logo Passei Direto

Audio_Visual_Speaker_Identification_Base

Ferramentas de estudo

Material
Study with thousands of resources!

Text Material Preview

Audio-Visual Speaker Identification Based on the Use of 
Dynamic Audio and Visual Features 
Niall Fox1, Richard B. Reilly1 
1 Dept. of Electronic and Electrical Engineering, 
University College Dublin, Belfield, Dublin 4, Ireland. 
{niall.fox, richard.reilly}@ee.ucd.ie 
Abstract: This paper presents a speaker identification system based on 
dynamical features of both the audio and visual modes. Speakers are modeled 
using a text dependent HMM methodology. Early and late audio-visual 
integration are investigated. Experiments are carried out for 252 speakers from 
the XM2VTS database. From our experimental results, it has been shown that 
the addition of the dynamical visual information improves the speaker 
identification accuracies for both clean and noisy audio conditions compared to 
the audio only case. The best audio, visual and audio-visual identification 
accuracies achieved were 86.91%, 57.14% and 94.05% respectively. 
1 Introduction 
Recently there has been significant interest in multi-modal human computer 
interfaces, especially audio-visual (AV) systems for applications in areas such as 
banking, and security systems [1], [3]. It is known that humans perceive in a multi-
modal manner, and the McGurk effect demonstrates this fact [10]. People with 
impaired hearing use lip-reading to complement information gleaned from their 
perceived degraded audio signal. Indeed, synergistic integration has already been 
achieved for the purpose of AV speech recognition [18]. Previous work in this area is 
usually based on either the use of audio, [16], [15] or static facial images (face 
recognition) [2]. Previous multi-modal AV systems pre-dominantly use the static 
facial image as the visual mode and not the dynamical visual features [1], [19]. It is 
expected that the addition of the dynamical visual mode should complement the audio 
mode, increase the reliability for noisy conditions and even increase the identification 
rates for clean conditions. Also, it would be increasingly difficult for an imposter to 
impersonate both audio and dynamical visual information simultaneously. Recently, 
some work has been carried on the use of the dynamical visual mode for the purpose 
of speech recognition [9], [17]. Progress in speech based bimodal recognition is 
documented in [4]. The aim of the current study was to implement and compare 
various methods of integrating both dynamic visual and audio features for the purpose 
of speaker identification (ID) and to achieve a more reliable and secure system 
compared to the audio only case. 
 
 
2 Audio and Visual Segmentation 
The XM2VTS database [8], [11] was used for the experiments described in this paper. 
The database consists of video data recorded from 295 subjects in four sessions, 
spaced monthly. The first recording per session of the third sentence (“Joe took 
fathers green shoe bench out”) was used for this research. The audio files were 
manually segmented into the seven words. The audio segmentation times were 
converted into visual frame numbers, to carry out visual word segmentation. Some 
sentences had the start of Joe clipped or it was totally missing. Due to this and other 
errors in the sentences, only 252 out of a possible 295 subjects were used for our 
experiments. Visual features were extracted from the mouth ROI (region of interest). 
This ROI was segmented manually by locating the two labial corners. A 98×98 pixel 
block was extracted as the ROI. Manual segmentation was only carried out for every 
10
th
 frame, and the ROI coordinates for the intermediate frames were interpolated. 
3 Audio and Visual Feature Extraction 
The audio signal was first pre-emphasised to increase the acoustic power at higher 
frequencies using the filter H(z) =1/(1-0.97z 
-1
). The pre-emphasised signal was 
divided into frames using a Hamming window of length 20 ms, with overlap of 10 ms 
to give an audio frame rate, FA,, of 100 Hz. Mel-frequency cepstral coefficients 
(MFCC’s) [5] of dimension 8 were extracted from each frame. The energy [20] of 
each frame was also calculated and used as a 9
th
 static feature. Nine first order 
differences or delta features were calculated between adjacent frames and appended 
to the static audio features to give an audio feature vector of dimension 18. 
Transform based features were used to represent the visual information based on 
the Discrete Cosine Transform (DCT) because of its high energy compaction [12]. 
The 98×98 colour pixel blocks were converted to gray scale values. No further image 
pre-processing was implemented, and the DCT was applied to the gray scale pixel 
blocks. The first 15 coefficients were used, taken in a zig-zag pattern. Calculating 
the difference of the DCT coefficients over k frames forms the visual feature vector. 
This was carried out for two values of k giving a visual feature vector of dimension 
30. The static coefficients were discarded. The values of k depended on the visual 
feature frame rate. The visual features can have two frame rates: 
1) Asynchronous visual features, have a frame rate, FV, of 25 fps or 
equivalently, 25 Hz, i.e. that of the video sequence. The optimum values of k used 
were determined empirically to be 1 and 2. 
2) Synchronous visual features, have a frame rate of 100 fps, i.e. that of the 
audio features. Since the frame rate is higher than the asynchronous case, the values 
of k must be higher to give the same temporal difference. Two sets of synchronous k 
values, (3,6) and (5,10), were tested. In general, delta(k1,k2), refers to the use of the k 
values, k1 and k2, for the calculation of the visual feature vector, where k2 > k1. 
A sentence observation, O = O1 … Ok … OM, consists of M words, where M = 7 
here. A particular AV word, Ok, has NA audio frames and NV visual frames. In 
general NA ≠ 4×NV, even though FA = 4×FV. This is due to the fact that when NA and 
 
 
NV were determined, the initial frame number and final frame number values were 
rounded down and up respectively. A sequence of audio and synchronous visual 
frame observations is given by Equation (1). When the visual features are calculated, 
according to Equation (2), k2 frames are dropped. In Equation (2), on
{V}
 refers to the 
n
th
 synchronous visual feature vector of dimension 30, and Tm refers to the top 15 
DCT transform coefficients of the m
th
 interpolated visual frame. Hence, to ensure that 
there are NA visual feature frames, the NV DCT visual frames are interpolated to NA + 
k2 frames (refer to Fig. 1). 
(1) . ,...... }{}{}{
1 {A,V} ioooO i
N
i
n
i
k A
∈= 
(2) . 1 ,1 ],,[ 22
}{
21
kNmkNnTTTTo AAkmmkmm
V
n +≤≤+≤≤−−= −− 
 
NA Audio
Feature Frames
(100 HZ)
NV Visual
DCT Frames
(25 HZ)
NA + k2
Visual DCT
Frames
NA Visual
Feature Frames
(100 HZ)
40ms 10ms
Interpolation of the
DCT Coeficients
Calculation of
Asynchronous Visual
Features from the
DCT Coeficients
Calculation of
Synchronous Visual
Features from the
DCT Coeficients
K2
K1
time
Audio
Visual
 
Fig. 1. Frame interpolation and visual feature calculation for a specific word consisting of NA 
audio frames 
4 Speaker Identification and Hidden Markov Modeling 
Speaker ID is discussed in this paper as opposed to speaker verification. To test the 
importance of integration based on the use of dynamic audio and visual features, a 
text dependent speaker ID methodology was used. For text dependent modeling [6], 
the speaker says the same utterance for both training and testing. It was employed, as 
opposed to text independent modeling [16], because of the database used in this study. 
Also, text independence has been found to give less performance than text 
dependence [7]. 
Each word consists of NA or NV frame observations given above by Equation (1). 
Speaker Si is represented by M speakerdependent word models, Sik, for i = 1 … N, k= 
1 … M where N = 252 and M = 7 here. There are M background models, Bk. Three 
 
 
sessions were used for training and one session for testing. The M background 
speaker independent word HMMs were trained using three of the sessions for all the 
speakers. These background models capture the AV speech variation over the entire 
database. Since there were only three training word utterances per speaker, there was 
insufficient training data to train a speaker dependent HMM, which was initialized 
with a prototype model. Hence the background word models were used to initialise 
the training of the speaker dependent word models. 
A sentence observation, O, is tested against all N speakers, Si, and the speaker that 
gives the maximum score is chosen as the identified speaker. To score an observation 
O against speaker Si, M separate scores, P(Oi/Sik) are calculated, one for each word in 
O, 1≤ k ≤ M. The M separate scores are normalised with respect to the frame length 
of each word by dividing by Fk, and are then summed to give an overall score P(Oi/Si) 
as shown in Equation (3). O is also scored against the background models to give an 
additional score P(Oi/B) also shown in Equation (3). 
 
 , 
))/(log(1
))/(log(
1
∑
=
=
M
k k
ikk
i
F
SOP
M
SOP
 
. 
))/(log(1
))/(log(
1
∑
=
=
M
k k
ikk
F
BOP
M
BOP 
(3) 
 
The two scores in Equations (3), are subtracted to give an overall measure of the 
likelihood that speaker Si produced the word observation O as shown in Equation (4). 
The subtraction of the background score provides a normalisation of the complete 
speaker score, Di. Di is calculated for each of the N speakers and O is identified as 
speaker Si using the maximum value of Di, i = 1 … N. 
 
(4) . ))/(log())/(log( BOPSOPD ii −= 
5 Audio-Visual Integration 
The two main problems concerning AV integration is when and how the integration 
should take place. Integration can be take place at three levels; early, middle and late 
[6]. Early and late integration only are discussed in this study. 
 
Early Integration (EI). The audio and visual modality features are combined, and 
then used for training and testing of a single AV classifier. The visual frame rate is 
first synchronised with the audio frame rate. Equation (5) and Fig. 2 show how the 
synchronous visual feature vector is concatenated to the audio feature vector. 
(5) . 1 , ],[
}{}{}{
A
V
n
A
n
AV
n Nnooo ≤≤= 
A (9) A (9)∆
5 V (15)∆ 10 V (15)∆
A (9) A (9)∆ 5 V (15)∆ 10 V (15)∆
Audio
Visual
AV 
Fig. 2. Audio, delta(5,10) visual case and AV feature blocks 
 
 
This method of integration has several disadvantages. The audio or visual mode 
data quality is not taken into account resulting in an equal weighting in the AV feature 
vector. The feature vector has a higher dimension requiring more training data. This is 
a problem for training speaker dependent models. However EI has the advantage that 
it is easy to implement both in training and classification. 
 
Late Integration (LI). LI requires two independent classifiers to be trained, one 
classifier for each mode. For speaker ID there are two options for the position to late 
integrate the speaker scores. The Viterbi word scores may be integrated or the scores 
according to Equations (3) may be integrated. The advantages of late integration 
include, the ability to account for mode reliabilities, small feature vector dimensions 
and ease of adding other modes to the system. For LI the two scores are weighted to 
account for the reliability of the modes. The two scores may be integrated via addition 
or multiplication. Equation (6) shows the use of weights for the case of additive 
integration where λA is the weight of the audio score. The audio score can be late 
integrated with either of the asynchronous or synchronous visual scores. Prior to LI 
the audio and visual scores are normalised. 
 
( ) ( ) (6) . )/()/()/(
1 AA
iViAiAV SOPSOPSOP
λλ −+=
 
6 Experiments 
Left to right HMM’s were used in the classification experiments. The models were 
trained using the Baum Welch algorithm and tested using the Viterbi algorithm [14], 
implemented using the HMM toolkit, HTK [20]. The audio features were calculated 
using HTK. The seven background models were trained using three sessions and 
tested using one session. This gave 3*N (756) training examples per background 
model. To test that the background models were trained correctly and to test the 
fusion methodologies, speech recognition experiments were carried out. A six state, 
two mixture HMM topology was used for the audio and EI AV models. A one state, 
one mixture topology was used for the asynchronous models and a six state, two 
mixture topology for the synchronous models. Each model was tested N times to give 
7*N (1764) word tests in total, where N = 252, the number of speakers tested. 
Speaker ID experiments were carried out for N subjects. A one state, one mixture 
HMM topology was used for the audio, asynchronous visual and EI AV modes. A one 
state, two mixture HMM topology was used for the synchronous visual mode. These 
HMM topologies, which gave the best results, were found by exhaustive search. The 
first three sessions were used for training and the fourth session was used for testing. 
Two sets of synchronous visual feature k values, (3,6) and (5,10), and one set of 
asynchronous visual feature k values, (1,2), were tested. 
Additive white Gaussian noise was applied to the clean audio at signal-to-noise 
ratios (SNR) ranging from 48dB to –12dB in steps of 6dB. All models were trained 
using clean speech and tested using the various SNR values. Optimum λA values were 
determined by exhaustive search for each noise level. This was achieved by testing λA 
values from 0 to 1 in steps of 0.01. 
 
 
7 Results and Discussion 
Table 1 shows the experimental results using audio. The audio word recognition 
performed extremely well, with an accuracy of 99.04%. This verifies that the seven 
background word models were trained correctly. The asynchronous word recognition 
performed poorly, 26.81%. This may be due the low number of states and mixes 
employed (both one) because of the low number of asynchronous frames per word. 
Interpolation of the asynchronous visual frames has the effect of increasing the 
amount of training data. This permits better training of the HMMs, which resulted in 
a better visual accuracy of 52.84%. In both cases of synchronous visual speaker ID 
the results are similar, which suggests that further improvement in the AV system 
depends on the integration and not on the features employed. 
Table 1. Word recognition and speaker ID results for clean audio 
Classifier Modality Visual Features Word recognition (%) Speaker Identification (%)
Audio N/A 99.04 86.91
Synchronous Visual delta(5,10) 52.5 57.14
Audio-Visual (EI) delta(5,10) 95.07 80.16
Synchronous Visual delta(3,6) 52.84 53.97
Audio-Visual (EI) delta(3,6) 97.17 80.56
Asynchronous Visual delta(1,2) 26.81 55.56 
 
Fig. 3a shows the results for EI. There may be several reasons why the EI 
performed so poorly. The visual features may not have been synchronized properly 
with the audio features. This may have occurred when visual frames were dropped to 
calculate the delta features or because of the overlapping audio frames. Another 
reason for poor EI performance may be the lack of training data for the dimensionally 
larger AV feature vectors. Fig. 3b shows the results of LI speaker ID. The AV LI 
scores are synergistic, giving a significant improvement over the audio case. 
 
-20 -10 0 10 20 30 40 50
0
10
20
30
40
50
60
70
80
90
SNR (dB)
EI Speaker Identification Results (252 Speakers)Audio
Asynch-Visual
Synch-Visual
A-V
-20 -10 0 10 20 30 40 50
0
10
20
30
40
50
60
70
80
90
100
SNR (dB)
LI Speaker Identification Results (252 Speakers)
Audio
SynchV
AsynchV
Audio-SynchV
Audio-AsynchV
 
Fig. 3a. EI speaker ID rates using delta(5,10) visual features. Fig. 3b. LI speaker ID rates 
using delta(5;10) visual features, additive Viterbi score LI of normalised scores 
 
Fig. 4 shows how the audio weights varied with SNR. The continuous line shows 
the audio weights that gave the best LI results. The general profile is as expected, with 
higher SNRs requiring higher audio weights and vice versa. The vertical error bars 
show the audio weights that gave an LI score within a range of 98% of the maximum 
S
core (%
) 
S
core (%
) 
 
 
score. This shows that some flexibility is permitted in the audio weights and this 
should be kept in mind when implementing adaptable audio weights. 
 
-20 -10 0 10 20 30 40 50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Audio Weights vs SNR for SynchV
SNR (dB)
-20 -10 0 10 20 30 40 50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Audio Weights vs SNR for AsynchV
SNR (dB) 
Fig. 4a. Audio weights for synchronous visual of Fig. 3b. Fig. 4b. Audio weights for 
asynchronous visual of Fig. 3b 
8 Further Developments and Conclusion 
The results for LI based on the use of dynamic features show good results. The LI 
results show that the addition of the dynamic visual mode not only increases the 
results for low SNR values but also increases the results for clean audio, giving a 
speaker ID system of higher accuracy and more robustness to audio noise. It was 
expected that the EI results were poor due to the lack of training data. However, both 
EI and LI speech recognition had 3*N training samples per model and synergistic EI 
was not achieved (see table 1). This would suggest that the use of more training data 
may not yield synergistic EI. To achieve synergistic EI, further analysis of the feature 
extraction methods and AV feature synchronisation may be required. 
For an AV system that is robust to real world conditions, it is not sufficient to just 
prove its robustness to audio noise only. Robustness to visual mode degradation is 
also necessary. Effects of visual degradation, such as frame rate decimation, noise and 
compression artifacts [13], have not been reported widely in the literature. It is 
expected that frame rate decimation would effect the dynamic visual features more so 
than other visual degradations. Further image pre-processing may yield higher visual 
accuracies. ROI down-sampling may further compact the visual feature vector and 
may improve the EI results, due to the reduced amount of training data required. 
In conclusion, the results show that the addition of the dynamical visual 
information improves the speaker ID accuracies for both clean and noisy audio 
conditions compared to the audio only case. 
9 Acknowledgements 
This work was supported by Enterprise Ireland's IP2000 program. 
A
udio W
eights 
A
udio W
eights 
 
 
References 
1. Brunelli, R. and Falavigna, D.: Person identification using multiple cues. IEEE 
Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 10, pp. 955-966, 
Oct.1995 
2. Brunelli, R. and Poggio, T.: Face Recognition: Features versus Templates. IEEE 
Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042-
1052, 1993 
3. Chen, T.: Audiovisual Speech Processing. IEEE Signal Processing Magazine, vol. 18, no. 
1, pp. 9-21, Jan.2001 
4. Chibelushhi, C. C., Deravi, F., and Mason, J. S. D.: A Review of Speech-Based Bimodal 
Recognition. IEEE Transaction on Multimedia, vol. 4, no. 1, pp. 23-36, Mar.2002 
5. Davis, S. and Mermelstein, P.: Comparison of Parametric Representations for 
Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions 
on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980. 
6. Lucey, S.: Audio-Visual Speech Processing. PhD thesis, Queensland University of 
Technology, Brisbane, Australia, Apr.2002 
7. Luettin J.: Speaker verification experiments on the XM2VTS database. In IDIAP 
Communication 98-02, IDIAP, Martigny, Switzerland, Aug.1999 
8. Luettin, J. and Maitre, G.: Evaluation Protocol for the XM2VTSDB Database (Lausanne 
Protocol). In IDIAP Communication 98-05, IDIAP, Martigny, Switzerland, Oct.1998 
9. Matthews, I., Cootes, T. F., Bangham, J. A., Cox, J. A., and Harvey, R.: Extraction of 
visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine 
Intelligence, vol. 24, no. 2, pp. 198-213, Feb.2002 
10. McGurk, H. and MacDonald, J.: Hearing Lips and Seeing Voices. Nature, vol. 264, pp. 
746-748, Dec.1976 
11. Messer, K., Matas, J., Kittler, J., Luettin J., and Maitre, G.: XM2VTSDB: The Extended 
M2VTS Database. The Proceedings of the Second International Conference on Audio and 
Video-based Biometric Person Authentication (AVBPA'99), Washington D.C., pp. 72-77, 
Mar.1999 
12. Netravali, A. N. and Haskell, B. G.: Digital Pictures. Plenum Press, pp. 408-416, 1998 
13. Potamianos, G., Graf, H., and Cosatto, E.: An Image Transform Approach for HMM 
Based Automatic Lipreading . Proceedings of the IEEE International Conference on Image 
Processing, Chicago, vol. 3 pp. 173-177, 1998 
14. Rabiner, L. R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech 
Recognition. Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb.1989 
15. Ramachandran, R. P., Zilovic, M. S., and Mammone, R. J.: A Comparative Study of 
Robust Linear Predictive Analysis Methods with Applications to Speaker Identification. 
IEEE Transactions on Speech and Audio Processing, vol. 3, no. 2, pp. 117-125, Mar.1995 
16. Reynolds, D. A. and Rose, R. C.: Robust Text-Independent Speaker Identification Using 
Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing, 
vol. 3, no. 1, Jan.1995 
17. Scanlon, P. and Reilly, R.: Visual Feature Analysis For Automatic Speechreading. DSP 
Research Group, UCD, Dublin, Ireland, 2001 
18. Silsbee, P. and Bovik, A.: A Computer Lipreading for Improved Accuracy in Automatic 
Speech Recognition. IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, 
pp. 337-350, Sept.1990 
19. Yacoub, S. B. and Luetin, J.: Audio-Visual Person Verification. In IDIAP Communication 
98-18, IDIAP, Martigny, Switzerland, Nov.1998 
20. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., 
and Woodland, P.: The HTK Book (for HTK Version 3.1). Microsoft Corporation, 
Cambridge University Engineering Department, Nov.2001 
 
 
 
 
 
 
	Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features
	1 Introduction
	2 Audio and Visual Segmentation
	3 Audio and Visual Feature Extraction
	4 Speaker Identification and Hidden Markov Modeling
	5 Audio-Visual Integration
	6 Experiments
	7 Results and Discussion
	8 Further Developments and Conclusion
	9 Acknowledgements
	References