A linguistically-motivated speaker recognition front-end through session variability compensated cepstral trajectories in phone units
EntityUAM. Departamento de Tecnología Electrónica y de las Comunicaciones
10.1109/ICASSP.2012.62888922012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012. 4389 - 4392
ISBN978-1-4673-0045-2 (print); 978-1-4673-0044-5 (online)
Funded bySupported by MEC grant PR-2010-123, MICINN project TEC09-14179, ForBayes project CCG10-UAM/TIC-5792 and Cátedra UAM-Telefónica.
SubjectsSpeaker recognition; Linguistic units; Temporal trajectories; Session variability; Feature compensation; Telecomunicaciones
NotePersonal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. J. González-Rodríguez, J. González-Domínguez, J. Franco-Pedroso, D. Ramos, "A linguistically-motivated speaker recognition front-end through session variability compensated cepstral trajectories in phone units" in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto (Japan), 2012, 4389 - 4392
Rights© 2012 IEEE
In this paper a new linguistically-motivated front-end is presented showing major performance improvements from the use of session variability compensated cepstral trajectories in phone units. Extending our recent work on temporal contours in linguistic units (TCLU), we have combined the potential of those unit-dependent trajectories with the ability of feature domain factor analysis techniques to compensate session variability effects, which has resulted in consistent and discriminant phone-dependent trajectories across different recording sessions. Evaluating with NIST SRE04 English-only 1s1s task, we report EERs as low as 5.40% from the trajectories in a single phone, with 29 different phones producing each of them EERs smaller than 10%, and additionally showing an excellent calibration performance per unit. The combination of different units shows significant complementarity reporting EERs as 1.63% (100×DCF=0.732) from a simple sum fusion of 23 best phones, or 0.68% (100×DCF=0.304) when fusing them through logistic regression.
Files in this item
Google Scholar:González Rodríguez, Joaquín - González Domínguez, Javier - Franco-Pedroso, Javier - Ramos Castro, Daniel
This item appears in the following Collection(s)
Showing items related by title, author, creator and subject.