Prévia do material em texto
28/04/2023, 12:39 Síntese de fala audiovisual: uma visão geral do estado da arte - ScienceDirect https://www.sciencedirect.com/science/article/abs/pii/S0167639314000818 1/3 Previous Next Comunicação por Voz Volume 66, fevereiro de 2015 , páginas 182-217 Síntese de fala audiovisual: uma visão geral do estado da arte Wesley Mattheyses ,Werner Verhelst Mostre mais Contorno https://doi.org/10.1016/j.specom.2014.11.001 ↗ Obtenha direitos e conteúdo ↗ Destaques • Visão geral abrangente das várias técnicas de síntese de fala audiovisual . • Categorização inovadora das técnicas com base em múltiplos aspectos. • Diretrizes futuras importantes para o campo da síntese audiovisual de fala. • Bundles a lot of information that was scattered in the scientific literature. Abstract We live in a world where there are countless interactions with computer systems in every-day situations. In the most ideal case, this interaction feels as familiar and as natural as the communication we experience with other humans. To this end, an ideal means of communication between a user and a computer system consists of audiovisual speech signals. Audiovisual text-to-speech technology allows the computer system to utter any spoken message towards its users. Over the last decades, a wide range of techniques for performing audiovisual speech synthesis has been developed. This paper gives a comprehensive overview on these approaches using a categorization of the systems based on multiple important aspects that determine the properties of the synthesized speech signals. The paper makes a clear distinction between the techniques that are used to model the virtual speaker and the techniques that are used to generate the appropriate speech gestures. In addition, the paper discusses the evaluation of audiovisual speech synthesizers, it elaborates on the hardware requirements for performing visual speech synthesis and it describes some important future directions that should stimulate the use of audiovisual speech synthesis technology in real-life applications. a a b Compartilhar Citar https://www.sciencedirect.com/science/article/pii/S016763931400082X https://www.sciencedirect.com/science/article/pii/S016763931400079X https://www.sciencedirect.com/journal/speech-communication https://www.sciencedirect.com/journal/speech-communication/vol/66/suppl/C https://doi.org/10.1016/j.specom.2014.11.001 https://s100.copyright.com/AppDispatchServlet?publisherName=ELS&contentID=S0167639314000818&orderBeanReset=true https://www.sciencedirect.com/topics/social-sciences/logopedics https://www.sciencedirect.com/ 28/04/2023, 12:39 Síntese de fala audiovisual: uma visão geral do estado da arte - ScienceDirect https://www.sciencedirect.com/science/article/abs/pii/S0167639314000818 2/3 Keywords Audiovisual speech synthesis; Visual speech synthesis; Speech synthesis Artigos recomendados Cited by (45) Social fidelity in virtual agents: Impacts on presence and learning 2021, Computers in Human Behavior Citation Excerpt : …These studies were done with natural (i.e. not synthetic) speech, but they illustrate that visible speech can provide benefits in a pedagogical context by increasing comprehension of spoken messages. Various approaches to animating virtual agent speech have been used and evaluated in terms of subjective acceptability or naturalness, with considerably less work evaluating whether those approaches result in the improved comprehension observed with natural audio- visual speech (Mattheyses & Verhelst, 2015). When speech animation is based on performance capture from human actors, perception is similar to that of natural visible speech, except in the case of phonemic contrasts that rely on the visibility of internal articulators (Files, Tjan, Jiang, & Bernstein, 2015), which are generally difficult to capture (e.g., Crary, Kotzur, Gauger, Gorham, & Burton, 1996).… Show abstract Features and results of a speech improvement experiment on hard of hearing children 2019, Speech Communication Citation Excerpt : …Facial movements are carried out within a parametric model, i.e. a collection of polygons is manipulated using a set of parameters (terminal-analog system). This process allows control of a wide range of motions using a set of parameters associated with different articulation functions (Parke 1972, Mattheyses – Verhelst 2015). These features can be directly matched to particular movements of the lip, tongue, chin, eyes, eyelids, eyebrows and the whole face, and can vary from −1 to 1.… Show abstract Video-realistic expressive audio-visual speech synthesis for the Greek language 2017, Speech Communication Citation Excerpt : …Under the same assumption it is also desirable that the levels of expressiveness of both information channels are correlated - i.e., a full blown facial expression of anger is accompanied with a full blown vocal expression of anger. Audio-Visual Text-To-Speech synthesis (AVTTS) explores the generation of audio-visual speech signals (Mattheyses and Verhelst, 2015) (i.e., a talking head), and video-realistic AVTTS, more specifically, explores the generation of talking heads that highly resemble a human being as if a camera was recording it. Although naturalness in video-realistic AVTTS systems has increased greatly, the addition of expressions has proven to be a challenging task (Anderson et al., 2013; Schröder, 2009) due to the large variability they introduce, especially in extreme expressions such as expressions of anger or happiness, in both acoustic and visual modeling.… https://www.sciencedirect.com/science/article/pii/S0747563220303113 https://www.sciencedirect.com/science/article/pii/S0167639318301274 https://www.sciencedirect.com/science/article/pii/S0167639317300419 28/04/2023, 12:39 Síntese de fala audiovisual: uma visão geral do estado da arte - ScienceDirect https://www.sciencedirect.com/science/article/abs/pii/S0167639314000818 3/3 Mostrar resumo CodeTalker: animação facial 3D orientada por fala com movimento discreto anterior 2023, arXiv Difícil de ouvir, mas fácil de ver: percepção audiovisual do contraste /r/-/w/ em anglo-inglês 2022, Jornal da Sociedade Acústica da América Deep Learning para análise visual da fala: uma pesquisa 2022, arXiv Veja todos os artigos de citação no Scopus Ver texto completo Copyright © 2014 Elsevier BV Todos os direitos reservados. Copyright © 2023 Elsevier BV ou seus licenciadores ou colaboradores. ScienceDirect® is a registered trademark of Elsevier B.V. https://doi.org/10.48550/arXiv.2301.02379 https://doi.org/10.1121/10.0012660 https://doi.org/10.48550/arXiv.2205.10839 http://www.scopus.com/scopus/inward/citedby.url?partnerID=10&rel=3.0.0&eid=2-s2.0-84912553696&md5=1b70b98eeb1ae815646fa5b65f75677d https://www.sciencedirect.com/science/article/pii/S0167639314000818 https://www.elsevier.com/ https://www.relx.com/