Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

详细信息查看全文

作者：Shaofei Xue ; Hui Jiang ; Lirong Dai ; Qingfeng Liu
关键词：Deep neural network (DNN) ; Hybrid DNN/HMM ; Speaker adaptation ; Singular value decomposition (SVD)
刊名：The Journal of VLSI Signal Processing
出版年：2016
出版时间：February 2016
年：2016
卷：82
期：2
页码：175-185
全文大小：661 KB
参考文献：1.Gauvain, J.L., & Lee, C.-H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2, 291–298.CrossRef
2.Ahadi, S.M., & Woodland, P.C. (1997). Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 11, 187–206.CrossRef
3.Leggetter, C., & Woodland, P.C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9, 171–185.CrossRef
4.Gales, M.J.F. (1998). Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech & Language, 12, 75–98.CrossRef
5.Digalakis, V.V., Rtischev, D., & Neumeyer, L.G. (1995). Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Transactions on Speech and Audio Processing, 3, 357–Lb366.CrossRef
6.Lee, L., & Rose, R.C. (1996). Speaker normalization using efficient frequency warping procedures. IEEE International Conference of Acoustics Speech and Signal Processing (ICASSP), 1, 353–356.
7.Jiang, H., Soong, F., & Lee, C.-H. (2001). Hierarchical stochastic feature matching for robust speech recognition. IEEE International Conference of Acoustics Speech and Signal Processing (ICASSP), 1, 217–220.
8.Neto, J., Almeida, L., Hochberg, M., Martins, C., Nunes, L., Renals, S., & Robinson, T. (1995). Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system, EUROSPEECH.
9.Gemello, R., Mana, F., Scanzio, S., Laface, P., & De Mori, R. (2007). Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech Communication, 49, 827–835.CrossRef
10.Li, B., & Sim, K.C. (2010). Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems.
11.Stadermann, J., & Rigoll, G. (2005). Two-stage speaker adaptation of hybrid tied-posterior acoustic models. In IEEE international conference of acoustics, speech and signal processing (ICASSP).
12.Siniscalchi, S.M., Li, J., & Lee, C.-H. (2013). Hermitian polynomial for speaker adaptation of connectionist speech recognition systems. IEEE Transactions on Audio Speech, and Language Processing, 21, 2152–2161.CrossRef
13.Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In 2011 IEEE workshop on automatic speech recognition and understanding (ASRU).
14.Yao, K., Yu, D., Seide, F., Su, H., Deng, L., & Gong, Y. (2012). Adaptation of context-dependent deep neural networks for automatic speech recognition, spoken language technology workshop (SLT).
15.Yu, D., Yao, K., Su, H., Li, G., & Seide, F. (2013). KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 7893–7897).
16.Wang, Y.Q, & Gales, M.J.F. (2013). Tandem system adaptation using multiple linear feature transforms. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 7932–7936).
17.Tuske, Z., Schluter, R., & Ney, H. (2013). Deep hierarchical bottleneck MRASTA features for LVCSR. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 6970–6974).
18.Abdel-Hamid, O., & Jiang, H. (2013). Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 7942–7946).
19.Abdel-Hamid, O., & Jiang, H. (2013). Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition, INTERSPEECH.
20.Xue, S., Abdel-Hamid, O.s, Jiang, H., & Dai, L. (2014). Direct adaptation of hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker code. In IEEE international conference of acoustics, speech and signal processing (ICASSP).
21.Xue, S., Abdel-Hamid, O., Jiang, H., & Dai, L. (2014). Speaker adaptation of deep neural network based on discriminant codes. In IEEE/ACM transactions on acoustics, speech and signal processing (p. 22).
22.Xue, J., Li, J., Yu, D., Seltzer, M., & Gong, Y. (2014). Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In IEEE international conference of acoustics, speech and signal processing (ICASSP).
23.Veselỳ, K., Ghoshal, A., Burget, L., & Povey, D. (2013). Sequence-discriminative training of deep neural networks, INTERSPEECH.
24.Denil, M., Shakibi, B., Dinh, L., de Freitas, N., et al. (2013). Predicting parameters in deep learning, advances in neural information processing systems.
25.Sainath, T.N., Kingsbury, B., Sindhwani, V., Arisoy, E., & Ramabhadran, B. (2013). Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In IEEE international conference of acoustics, speech and signal processing (ICASSP) (pp. 6655–6659).
26.Xue, J., Li, J., & Gongm, Y. (2013). Restructuring of deep neural network acoustic models with singular value decomposition, INTERSPEECH.
27.Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.CrossRef
28.Pan, J., Liu, C., Wang, Z., Hu, Y., & Jiang, H. (2012). Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMs in acoustic modeling. In 8th international symposium on chinese spoken language processing (ISCSLP) (pp. 301–305).
29.Bao, Y., Jiang, H., Liu, C., Hu, Y., & Dai, L. (2012). Investigation on dimensionality reduction of concatenated features with deep neural network for LVCSR systems. IEEE 11th International Conference on Signal Processing (ICSP), 1, 562–566.
30.Bao, Y., Jiang, H., Dai, L., & Liu, C. (2013). Incoherent training of deep neural networks to de-correlate bottleneck features for speech recognition. In IEEE international conference of acoustics, speech and signal processing (ICASSP).
31.Zhang, S., Bao, Y., Zhou, P., Jiang, H., & Dai, L. (2014). Improving deep neural networks for LVCSR using dropout and shrinking structure. In IEEE international conference of acoustics, speech and signal processing (ICASSP).
作者单位：Shaofei Xue (1)
Hui Jiang (2)
Lirong Dai (1)
Qingfeng Liu (1)

1. National Engineering Laboratory of Speech and Language Information Processing, University of Science and Technology of China, Hefei, China
2. Department of Electrical Engineering and Computer Science, York University, Toronto, Canada
刊物类别：Engineering
刊物主题：Electrical Engineering
Circuits and Systems
Computer Imaging, Vision, Pattern Recognition and Graphics
Computer Systems Organization and Communication Networks
Signal,Image and Speech Processing
Mathematics of Computing
出版者：Springer New York
ISSN：1939-8115

文摘

Recently several speaker adaptation methods have been proposed for deep neural network (DNN) in many large vocabulary continuous speech recognition (LVCSR) tasks. However, only a few methods rely on tuning the connection weights in trained DNNs directly to optimize system performance since it is very prone to over-fitting especially when some class labels are missing in the adaptation data. In this paper, we propose a new speaker adaptation method for the hybrid NN/HMM speech recognition model based on singular value decomposition (SVD). We apply SVD on the weight matrices in trained DNNs and then tune rectangular diagonal matrices with the adaptation data. This alleviates the over-fitting problem via updating the weight matrices slightly by only modifying the singular values. We evaluate the proposed adaptation method in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that it is effective to adapt large DNN models using only a small amount of adaptation data. For example, recognition results in the Switchboard task have shown that the proposed SVD-based adaptation method may achieve up to 3-6 % relative error reduction using only a few dozens of adaptation utterances per speaker. Keywords Deep neural network (DNN) Hybrid DNN/HMM Speaker adaptation Singular value decomposition (SVD)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700