Cross Modal Evaluation of High Quality Emotional Speech Synthesis with the Virtual Human Toolkit

设为首页

收藏本站

网站地图 | English | 公务邮箱

NSTL服务站

详细信息查看全文

关键词：Speech synthesis ; Unit selection ; Expressive speech synthesis ; Emotion ; Prosody ; Facial animation
刊名：Lecture Notes in Computer Science
出版年：2016
出版时间：2016
年：2016
卷：10011
期：1
页码：190-197
全文大小：540 KB
参考文献：1.The semaine project. http://www.semaine-project.eu/
2.Anagnostopoulos, C.N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2015)CrossRef
3.Aylett, M.P., Pidcock, C.J.: The cerevoice characterful speech synthesiser SDK. In: Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds.) IVA 2007. LNCS (LNAI), vol. 4722, pp. 413–414. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74997-4_65 CrossRef
4.Aylett, M.P., Pidcock, C.J.: UK patent GB2447263A: Adding and controlling emotion in synthesised speech (2012)
5.Aylett, M.P., Potard, B., Pidcock, C.J.: Expressive speech synthesis: synthesising ambiguity. In: SSW8, pp. 133–138, Barcelona, Spain, August 2013
6.Buchholz, S., Latorre, J.: Crowdsourcing preference tests, and how to detect cheating. In: Proceedings of Interspeech, pp. 3053–3056 (2011)
7.Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M., Schröder, M.: FEELTRACE: an instrument for recording perceived emotion in real time. In: ITRW on speech and emotion, pp. 19–24 (2000)
8.Gobl, C., Chasaide, A.N., et al.: The role of voice quality in communicating emotion, mood and attitude. Speech Commun. 40(1), 189–212 (2003)CrossRef MATH
9.Hartholt, A., Traum, D., Marsella, S.C., Shapiro, A., Stratou, G., Leuski, A., Morency, L.-P., Gratch, J.: All together now. In: Aylett, R., Krenn, B., Pelachaud, C., Shimodaira, H. (eds.) IVA 2013. LNCS (LNAI), vol. 8108, pp. 368–381. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40415-3_33 CrossRef
10.Hofer, G.O., Richmond, K., Clark, R.A.: Informed blending of databases for emotional speech synthesis. In: Proceedings of Interspeech (2005)
11.Mattheyses, W., Latacz, L., Verhelst, W.: Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis. Speech Commun. 55(7), 857–876 (2013)CrossRef
12.Plutchik, R.: The Psychology and Biology of Emotion. Harper Collins College Publishers, New York (1994)
13.Schlosberg, H.: A scale for the judgement of facial expressions. J. Exp. Psychol. 29(6), 497–510 (1941)CrossRef
14.Schröder, M.: Emotional speech synthesis: a review. In: Proceedings Eurospeech, vol. 01, pp. 561–564 (2001)
15.Schröder, M.: Dimensional emotion representation as a basis for speech synthesis with non-extreme emotions. In: André, E., Dybkjær, L., Minker, W., Heisterkamp, P. (eds.) ADS 2004. LNCS (LNAI), vol. 3068, pp. 209–220. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24842-2_21 CrossRef
16.Schröder, M., Grice, M.: Expressing vocal effort in concatenative synthesis. In: Proceedings of 15th International Conference of Phonetic Sciences, pp. 2589–2592 (2003)
17.Taylor, P.A., Black, A., Caley, R.: The architecture of the festival speech synthesis system. In: SSW3. pp. 147–151. Jenolan Caves, Australia (1998)
18.Valbret, H., Moulines, E., Tubach, J.P.: Voice transformation using psola technique. In: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-1992, vol. 1, pp. 145–148. IEEE (1992)
作者单位：Blaise Potard (19)
Matthew P. Aylett (19) (20)
David A. Baude (19)

19. CereProc Ltd., Edinburgh, UK
20. University of Edinburgh, Edinburgh, UK
丛书名：Intelligent Virtual Agents
ISBN：978-3-319-47665-0
刊物类别：Computer Science
刊物主题：Artificial Intelligence and Robotics
Computer Communication Networks
Software Engineering
Data Encryption
Database Management
Computation by Abstract Devices
Algorithm Analysis and Problem Complexity
出版者：Springer Berlin / Heidelberg
ISSN：1611-3349
卷排序：10011

文摘

Emotional expression is a key requirement for intelligent virtual agents. In order for an agent to produce dynamic spoken content speech synthesis is required. However, despite substantial work with pre-recorded prompts, very little work has explored the combined effect of high quality emotional speech synthesis and facial expression. In this paper we offer a baseline evaluation of the naturalness and emotional range available by combining the freely available SmartBody component of the Virtual Human Toolkit (VHTK) with CereVoice text to speech (TTS) system. Results echo previous work using pre-recorded prompts, the visual modality is dominant and the modalities do not interact. This allows the speech synthesis to add gradual changes to the perceived emotion both in terms of valence and activation. The naturalness reported is good, 3.54 on a 5 point MOS scale.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700