Linguistically-constrained formant-based i-vectors for automatic speaker recognition
Entidad
UAM. Departamento de Tecnología Electrónica y de las ComunicacionesEditor
ElsevierFecha de edición
2016-02Cita
10.1016/j.specom.2015.11.002
Speech Communication 76 (2016): 61 – 81
ISSN
0167-6393DOI
10.1016/j.specom.2015.11.002Financiado por
This work has been supported by the Spanish Ministry of Economy and Competitiveness (project CMC-V2: Caracterizacion, Modelado y Compensacion de Variabilidad en la Señal de Voz, TEC2012-37585-C02-01). Also, the authors would like to thank SRI for providing the Decipher phonetic transcriptions of the NIST 2004, 2005 and 2006 SREs that have allowed to carry out this work.Proyecto
Gobierno de España. TEC2012-37585-C02-01Versión del editor
http://dx.doi.org/10.1016/j.specom.2015.11.002Materias
Automatic speaker recognition; Formant dynamics; Formant frequencies; Linguistically-constrained systems; TelecomunicacionesNota
This is the author’s version of a work that was accepted for publication in Speech Communication. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Speech Communication, VOL 76 (2016) DOI 10.1016/j.specom.2015.11.002Derechos
© 2016 Elsevier B.V. All rights reservedEsta obra está bajo una licencia de Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional.
Resumen
This paper presents a large-scale study of the discriminative abilities of formant frequencies for automatic speaker recognition. Exploiting both the static and dynamic information in formant frequencies, we present linguistically-constrained formant-based i-vector systems providing well calibrated likelihood ratios per comparison of the occurrences of the same isolated linguistic units in two given utterances. As a first result, the reported analysis on the discriminative and calibration properties of the different linguistic units provide useful insights, for instance, to forensic phonetic practitioners. Furthermore, it is shown that the set of units which are more discriminative for every speaker vary from speaker to speaker. Secondly, linguistically-constrained systems are combined at score-level through average and logistic regression speaker-independent fusion rules exploiting the different speaker-distinguishing information spread among the different linguistic units. Testing on the English-only trials of the core condition of the NIST 2006 SRE (24,000 voice comparisons of 5 minutes telephone conversations from 517 speakers -219 male and 298 female-), we report equal error rates of 9.57 and 12.89% for male and female speakers respectively, using only formant frequencies as speaker discriminative information. Additionally, when the formant-based system is fused with a cepstral i-vector system, we obtain relative improvements of ∼6% in EER (from 6.54 to 6.13%) and ∼15% in minDCF (from 0.0327 to 0.0279), compared to the cepstral system alone.
Lista de ficheros
Google Scholar:Franco-Pedroso, Javier
-
González Rodríguez, Joaquín
Lista de colecciones del ítem
Registros relacionados
Mostrando ítems relacionados por título, autor, creador y materia.