Audio_Visual_Speaker_Identification_Base

Xochicalli

Erick Rodríguez

in 7/9/2024

Material

Study with thousands of resources!

Text Material Preview

Audio-Visual Speaker Identification Based on the Use of
Dynamic Audio and Visual Features
Niall Fox1, Richard B. Reilly1
1 Dept. of Electronic and Electrical Engineering,
University College Dublin, Belfield, Dublin 4, Ireland.
{niall.fox, richard.reilly}@ee.ucd.ie
Abstract: This paper presents a speaker identification system based on
dynamical features of both the audio and visual modes. Speakers are modeled
using a text dependent HMM methodology. Early and late audio-visual
integration are investigated. Experiments are carried out for 252 speakers from
the XM2VTS database. From our experimental results, it has been shown that
the addition of the dynamical visual information improves the speaker
identification accuracies for both clean and noisy audio conditions compared to
the audio only case. The best audio, visual and audio-visual identification
accuracies achieved were 86.91%, 57.14% and 94.05% respectively.
1 Introduction
Recently there has been significant interest in multi-modal human computer
interfaces, especially audio-visual (AV) systems for applications in areas such as
banking, and security systems [1], [3]. It is known that humans perceive in a multi-
modal manner, and the McGurk effect demonstrates this fact [10]. People with
impaired hearing use lip-reading to complement information gleaned from their
perceived degraded audio signal. Indeed, synergistic integration has already been
achieved for the purpose of AV speech recognition [18]. Previous work in this area is
usually based on either the use of audio, [16], [15] or static facial images (face
recognition) [2]. Previous multi-modal AV systems pre-dominantly use the static
facial image as the visual mode and not the dynamical visual features [1], [19]. It is
expected that the addition of the dynamical visual mode should complement the audio
mode, increase the reliability for noisy conditions and even increase the identification
rates for clean conditions. Also, it would be increasingly difficult for an imposter to
impersonate both audio and dynamical visual information simultaneously. Recently,
some work has been carried on the use of the dynamical visual mode for the purpose
of speech recognition [9], [17]. Progress in speech based bimodal recognition is
documented in [4]. The aim of the current study was to implement and compare
various methods of integrating both dynamic visual and audio features for the purpose
of speaker identification (ID) and to achieve a more reliable and secure system
compared to the audio only case.

2 Audio and Visual Segmentation
The XM2VTS database [8], [11] was used for the experiments described in this paper.
The database consists of video data recorded from 295 subjects in four sessions,
spaced monthly. The first recording per session of the third sentence (“Joe took
fathers green shoe bench out”) was used for this research. The audio files were
manually segmented into the seven words. The audio segmentation times were
converted into visual frame numbers, to carry out visual word segmentation. Some
sentences had the start of Joe clipped or it was totally missing. Due to this and other
errors in the sentences, only 252 out of a possible 295 subjects were used for our
experiments. Visual features were extracted from the mouth ROI (region of interest).
This ROI was segmented manually by locating the two labial corners. A 98×98 pixel
block was extracted as the ROI. Manual segmentation was only carried out for every
10
th
frame, and the ROI coordinates for the intermediate frames were interpolated.
3 Audio and Visual Feature Extraction
The audio signal was first pre-emphasised to increase the acoustic power at higher
frequencies using the filter H(z) =1/(1-0.97z
-1
). The pre-emphasised signal was
divided into frames using a Hamming window of length 20 ms, with overlap of 10 ms
to give an audio frame rate, FA,, of 100 Hz. Mel-frequency cepstral coefficients
(MFCC’s) [5] of dimension 8 were extracted from each frame. The energy [20] of
each frame was also calculated and used as a 9
th
static feature. Nine first order
differences or delta features were calculated between adjacent frames and appended
to the static audio features to give an audio feature vector of dimension 18.
Transform based features were used to represent the visual information based on
the Discrete Cosine Transform (DCT) because of its high energy compaction [12].
The 98×98 colour pixel blocks were converted to gray scale values. No further image
pre-processing was implemented, and the DCT was applied to the gray scale pixel
blocks. The first 15 coefficients were used, taken in a zig-zag pattern. Calculating
the difference of the DCT coefficients over k frames forms the visual feature vector.
This was carried out for two values of k giving a visual feature vector of dimension
30. The static coefficients were discarded. The values of k depended on the visual
feature frame rate. The visual features can have two frame rates:
1) Asynchronous visual features, have a frame rate, FV, of 25 fps or
equivalently, 25 Hz, i.e. that of the video sequence. The optimum values of k used
were determined empirically to be 1 and 2.
2) Synchronous visual features, have a frame rate of 100 fps, i.e. that of the
audio features. Since the frame rate is higher than the asynchronous case, the values
of k must be higher to give the same temporal difference. Two sets of synchronous k
values, (3,6) and (5,10), were tested. In general, delta(k1,k2), refers to the use of the k
values, k1 and k2, for the calculation of the visual feature vector, where k2 > k1.
A sentence observation, O = O1 … Ok … OM, consists of M words, where M = 7
here. A particular AV word, Ok, has NA audio frames and NV visual frames. In
general NA ≠ 4×NV, even though FA = 4×FV. This is due to the fact that when NA and

NV were determined, the initial frame number and final frame number values were
rounded down and up respectively. A sequence of audio and synchronous visual
frame observations is given by Equation (1). When the visual features are calculated,
according to Equation (2), k2 frames are dropped. In Equation (2), on
{V}
refers to the
n
th
synchronous visual feature vector of dimension 30, and Tm refers to the top 15
DCT transform coefficients of the m
th
interpolated visual frame. Hence, to ensure that
there are NA visual feature frames, the NV DCT visual frames are interpolated to NA +
k2 frames (refer to Fig. 1).
(1) . ,...... }{}{}{
1 {A,V} ioooO i
N
i
n
i
k A
∈=
(2) . 1 ,1 ],,[ 22
}{
21
kNmkNnTTTTo AAkmmkmm
V
n +≤≤+≤≤−−= −−

NA Audio
Feature Frames
(100 HZ)
NV Visual
DCT Frames
(25 HZ)
NA + k2
Visual DCT
Frames
NA Visual
Feature Frames
(100 HZ)
40ms 10ms
Interpolation of the
DCT Coeficients
Calculation of
Asynchronous Visual
Features from the
DCT Coeficients
Calculation of
Synchronous Visual
Features from the
DCT Coeficients
K2
K1
time
Audio
Visual

Fig. 1. Frame interpolation and visual feature calculation for a specific word consisting of NA
audio frames
4 Speaker Identification and Hidden Markov Modeling
Speaker ID is discussed in this paper as opposed to speaker verification. To test the
importance of integration based on the use of dynamic audio and visual features, a
text dependent speaker ID methodology was used. For text dependent modeling [6],
the speaker says the same utterance for both training and testing. It was employed, as
opposed to text independent modeling [16], because of the database used in this study.
Also, text independence has been found to give less performance than text
dependence [7].
Each word consists of NA or NV frame observations given above by Equation (1).
Speaker Si is represented by M speakerdependent word models, Sik, for i = 1 … N, k=
1 … M where N = 252 and M = 7 here. There are M background models, Bk. Three

sessions were used for training and one session for testing. The M background
speaker independent word HMMs were trained using three of the sessions for all the
speakers. These background models capture the AV speech variation over the entire
database. Since there were only three training word utterances per speaker, there was
insufficient training data to train a speaker dependent HMM, which was initialized
with a prototype model. Hence the background word models were used to initialise
the training of the speaker dependent word models.
A sentence observation, O, is tested against all N speakers, Si, and the speaker that
gives the maximum score is chosen as the identified speaker. To score an observation
O against speaker Si, M separate scores, P(Oi/Sik) are calculated, one for each word in
O, 1≤ k ≤ M. The M separate scores are normalised with respect to the frame length
of each word by dividing by Fk, and are then summed to give an overall score P(Oi/Si)
as shown in Equation (3). O is also scored against the background models to give an
additional score P(Oi/B) also shown in Equation (3).

,
))/(log(1
))/(log(
1
∑
=
=
M
k k
ikk
i
F
SOP
M
SOP

.
))/(log(1
))/(log(
1
∑
=
=
M
k k
ikk
F
BOP
M
BOP
(3)

The two scores in Equations (3), are subtracted to give an overall measure of the
likelihood that speaker Si produced the word observation O as shown in Equation (4).
The subtraction of the background score provides a normalisation of the complete
speaker score, Di. Di is calculated for each of the N speakers and O is identified as
speaker Si using the maximum value of Di, i = 1 … N.

(4) . ))/(log())/(log( BOPSOPD ii −=
5 Audio-Visual Integration
The two main problems concerning AV integration is when and how the integration
should take place. Integration can be take place at three levels; early, middle and late
[6]. Early and late integration only are discussed in this study.

Early Integration (EI). The audio and visual modality features are combined, and
then used for training and testing of a single AV classifier. The visual frame rate is
first synchronised with the audio frame rate. Equation (5) and Fig. 2 show how the
synchronous visual feature vector is concatenated to the audio feature vector.
(5) . 1 , ],[
}{}{}{
A
V
n
A
n
AV
n Nnooo ≤≤=
A (9) A (9)∆
5 V (15)∆ 10 V (15)∆
A (9) A (9)∆ 5 V (15)∆ 10 V (15)∆
Audio
Visual
AV
Fig. 2. Audio, delta(5,10) visual case and AV feature blocks

This method of integration has several disadvantages. The audio or visual mode
data quality is not taken into account resulting in an equal weighting in the AV feature
vector. The feature vector has a higher dimension requiring more training data. This is
a problem for training speaker dependent models. However EI has the advantage that
it is easy to implement both in training and classification.

Late Integration (LI). LI requires two independent classifiers to be trained, one
classifier for each mode. For speaker ID there are two options for the position to late
integrate the speaker scores. The Viterbi word scores may be integrated or the scores
according to Equations (3) may be integrated. The advantages of late integration
include, the ability to account for mode reliabilities, small feature vector dimensions
and ease of adding other modes to the system. For LI the two scores are weighted to
account for the reliability of the modes. The two scores may be integrated via addition
or multiplication. Equation (6) shows the use of weights for the case of additive
integration where λA is the weight of the audio score. The audio score can be late
integrated with either of the asynchronous or synchronous visual scores. Prior to LI
the audio and visual scores are normalised.

( ) ( ) (6) . )/()/()/(
1 AA
iViAiAV SOPSOPSOP
λλ −+=

6 Experiments
Left to right HMM’s were used in the classification experiments. The models were
trained using the Baum Welch algorithm and tested using the Viterbi algorithm [14],
implemented using the HMM toolkit, HTK [20]. The audio features were calculated
using HTK. The seven background models were trained using three sessions and
tested using one session. This gave 3*N (756) training examples per background
model. To test that the background models were trained correctly and to test the
fusion methodologies, speech recognition experiments were carried out. A six state,
two mixture HMM topology was used for the audio and EI AV models. A one state,
one mixture topology was used for the asynchronous models and a six state, two
mixture topology for the synchronous models. Each model was tested N times to give
7*N (1764) word tests in total, where N = 252, the number of speakers tested.
Speaker ID experiments were carried out for N subjects. A one state, one mixture
HMM topology was used for the audio, asynchronous visual and EI AV modes. A one
state, two mixture HMM topology was used for the synchronous visual mode. These
HMM topologies, which gave the best results, were found by exhaustive search. The
first three sessions were used for training and the fourth session was used for testing.
Two sets of synchronous visual feature k values, (3,6) and (5,10), and one set of
asynchronous visual feature k values, (1,2), were tested.
Additive white Gaussian noise was applied to the clean audio at signal-to-noise
ratios (SNR) ranging from 48dB to –12dB in steps of 6dB. All models were trained
using clean speech and tested using the various SNR values. Optimum λA values were
determined by exhaustive search for each noise level. This was achieved by testing λA
values from 0 to 1 in steps of 0.01.

7 Results and Discussion
Table 1 shows the experimental results using audio. The audio word recognition
performed extremely well, with an accuracy of 99.04%. This verifies that the seven
background word models were trained correctly. The asynchronous word recognition
performed poorly, 26.81%. This may be due the low number of states and mixes
employed (both one) because of the low number of asynchronous frames per word.
Interpolation of the asynchronous visual frames has the effect of increasing the
amount of training data. This permits better training of the HMMs, which resulted in
a better visual accuracy of 52.84%. In both cases of synchronous visual speaker ID
the results are similar, which suggests that further improvement in the AV system
depends on the integration and not on the features employed.
Table 1. Word recognition and speaker ID results for clean audio
Classifier Modality Visual Features Word recognition (%) Speaker Identification (%)
Audio N/A 99.04 86.91
Synchronous Visual delta(5,10) 52.5 57.14
Audio-Visual (EI) delta(5,10) 95.07 80.16
Synchronous Visual delta(3,6) 52.84 53.97
Audio-Visual (EI) delta(3,6) 97.17 80.56
Asynchronous Visual delta(1,2) 26.81 55.56

Fig. 3a shows the results for EI. There may be several reasons why the EI
performed so poorly. The visual features may not have been synchronized properly
with the audio features. This may have occurred when visual frames were dropped to
calculate the delta features or because of the overlapping audio frames. Another
reason for poor EI performance may be the lack of training data for the dimensionally
larger AV feature vectors. Fig. 3b shows the results of LI speaker ID. The AV LI
scores are synergistic, giving a significant improvement over the audio case.

-20 -10 0 10 20 30 40 50
0
10
20
30
40
50
60
70
80
90
SNR (dB)
EI Speaker Identification Results (252 Speakers)Audio
Asynch-Visual
Synch-Visual
A-V
-20 -10 0 10 20 30 40 50
0
10
20
30
40
50
60
70
80
90
100
SNR (dB)
LI Speaker Identification Results (252 Speakers)
Audio
SynchV
AsynchV
Audio-SynchV
Audio-AsynchV

Fig. 3a. EI speaker ID rates using delta(5,10) visual features. Fig. 3b. LI speaker ID rates
using delta(5;10) visual features, additive Viterbi score LI of normalised scores

Fig. 4 shows how the audio weights varied with SNR. The continuous line shows
the audio weights that gave the best LI results. The general profile is as expected, with
higher SNRs requiring higher audio weights and vice versa. The vertical error bars
show the audio weights that gave an LI score within a range of 98% of the maximum
S
core (%
)
S
core (%
)

score. This shows that some flexibility is permitted in the audio weights and this
should be kept in mind when implementing adaptable audio weights.

-20 -10 0 10 20 30 40 50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Audio Weights vs SNR for SynchV
SNR (dB)
-20 -10 0 10 20 30 40 50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Audio Weights vs SNR for AsynchV
SNR (dB)
Fig. 4a. Audio weights for synchronous visual of Fig. 3b. Fig. 4b. Audio weights for
asynchronous visual of Fig. 3b
8 Further Developments and Conclusion
The results for LI based on the use of dynamic features show good results. The LI
results show that the addition of the dynamic visual mode not only increases the
results for low SNR values but also increases the results for clean audio, giving a
speaker ID system of higher accuracy and more robustness to audio noise. It was
expected that the EI results were poor due to the lack of training data. However, both
EI and LI speech recognition had 3*N training samples per model and synergistic EI
was not achieved (see table 1). This would suggest that the use of more training data
may not yield synergistic EI. To achieve synergistic EI, further analysis of the feature
extraction methods and AV feature synchronisation may be required.
For an AV system that is robust to real world conditions, it is not sufficient to just
prove its robustness to audio noise only. Robustness to visual mode degradation is
also necessary. Effects of visual degradation, such as frame rate decimation, noise and
compression artifacts [13], have not been reported widely in the literature. It is
expected that frame rate decimation would effect the dynamic visual features more so
than other visual degradations. Further image pre-processing may yield higher visual
accuracies. ROI down-sampling may further compact the visual feature vector and
may improve the EI results, due to the reduced amount of training data required.
In conclusion, the results show that the addition of the dynamical visual
information improves the speaker ID accuracies for both clean and noisy audio
conditions compared to the audio only case.
9 Acknowledgements
This work was supported by Enterprise Ireland's IP2000 program.
A
udio W
eights
A
udio W
eights

References
1. Brunelli, R. and Falavigna, D.: Person identification using multiple cues. IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 10, pp. 955-966,
Oct.1995
2. Brunelli, R. and Poggio, T.: Face Recognition: Features versus Templates. IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 10, pp. 1042-
1052, 1993
3. Chen, T.: Audiovisual Speech Processing. IEEE Signal Processing Magazine, vol. 18, no.
1, pp. 9-21, Jan.2001
4. Chibelushhi, C. C., Deravi, F., and Mason, J. S. D.: A Review of Speech-Based Bimodal
Recognition. IEEE Transaction on Multimedia, vol. 4, no. 1, pp. 23-36, Mar.2002
5. Davis, S. and Mermelstein, P.: Comparison of Parametric Representations for
Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Transactions
on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-366, 1980.
6. Lucey, S.: Audio-Visual Speech Processing. PhD thesis, Queensland University of
Technology, Brisbane, Australia, Apr.2002
7. Luettin J.: Speaker verification experiments on the XM2VTS database. In IDIAP
Communication 98-02, IDIAP, Martigny, Switzerland, Aug.1999
8. Luettin, J. and Maitre, G.: Evaluation Protocol for the XM2VTSDB Database (Lausanne
Protocol). In IDIAP Communication 98-05, IDIAP, Martigny, Switzerland, Oct.1998
9. Matthews, I., Cootes, T. F., Bangham, J. A., Cox, J. A., and Harvey, R.: Extraction of
visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 24, no. 2, pp. 198-213, Feb.2002
10. McGurk, H. and MacDonald, J.: Hearing Lips and Seeing Voices. Nature, vol. 264, pp.
746-748, Dec.1976
11. Messer, K., Matas, J., Kittler, J., Luettin J., and Maitre, G.: XM2VTSDB: The Extended
M2VTS Database. The Proceedings of the Second International Conference on Audio and
Video-based Biometric Person Authentication (AVBPA'99), Washington D.C., pp. 72-77,
Mar.1999
12. Netravali, A. N. and Haskell, B. G.: Digital Pictures. Plenum Press, pp. 408-416, 1998
13. Potamianos, G., Graf, H., and Cosatto, E.: An Image Transform Approach for HMM
Based Automatic Lipreading . Proceedings of the IEEE International Conference on Image
Processing, Chicago, vol. 3 pp. 173-177, 1998
14. Rabiner, L. R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition. Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb.1989
15. Ramachandran, R. P., Zilovic, M. S., and Mammone, R. J.: A Comparative Study of
Robust Linear Predictive Analysis Methods with Applications to Speaker Identification.
IEEE Transactions on Speech and Audio Processing, vol. 3, no. 2, pp. 117-125, Mar.1995
16. Reynolds, D. A. and Rose, R. C.: Robust Text-Independent Speaker Identification Using
Gaussian Mixture Speaker Models. IEEE Transactions on Speech and Audio Processing,
vol. 3, no. 1, Jan.1995
17. Scanlon, P. and Reilly, R.: Visual Feature Analysis For Automatic Speechreading. DSP
Research Group, UCD, Dublin, Ireland, 2001
18. Silsbee, P. and Bovik, A.: A Computer Lipreading for Improved Accuracy in Automatic
Speech Recognition. IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5,
pp. 337-350, Sept.1990
19. Yacoub, S. B. and Luetin, J.: Audio-Visual Person Verification. In IDIAP Communication
98-18, IDIAP, Martigny, Switzerland, Nov.1998
20. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V.,
and Woodland, P.: The HTK Book (for HTK Version 3.1). Microsoft Corporation,
Cambridge University Engineering Department, Nov.2001

Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features
1 Introduction
2 Audio and Visual Segmentation
3 Audio and Visual Feature Extraction
4 Speaker Identification and Hidden Markov Modeling
5 Audio-Visual Integration
6 Experiments
7 Results and Discussion
8 Further Developments and Conclusion
9 Acknowledgements
References

Audio_Visual_Speaker_Identification_Base

Xochicalli

Ferramentas de estudo

More content from this subject