Linguistically-constrained formant-based i-vectors for automatic speaker recognition

Franco-Pedroso, Javier; González Rodríguez, Joaquín

UAM_Biblioteca

Autor (es)

Franco-Pedroso, Javier; González Rodríguez, Joaquín

Entidad

UAM. Departamento de Tecnología Electrónica y de las Comunicaciones

Editor

Elsevier

Fecha de edición

2016-02

Cita

Speech Communication 76 (2016): 61 – 81

ISSN

0167-6393

DOI

10.1016/j.specom.2015.11.002

Financiado por

This work has been supported by the Spanish Ministry of Economy and Competitiveness (project CMC-V2: Caracterizacion, Modelado y Compensacion de Variabilidad en la Señal de Voz, TEC2012-37585-C02-01). Also, the authors would like to thank SRI for providing the Decipher phonetic transcriptions of the NIST 2004, 2005 and 2006 SREs that have allowed to carry out this work.

Proyecto

Gobierno de España. TEC2012-37585-C02-01

Versión del editor

http://dx.doi.org/10.1016/j.specom.2015.11.002

Materias

Automatic speaker recognition; Formant dynamics; Formant frequencies; Linguistically-constrained systems; Telecomunicaciones

URI

http://hdl.handle.net/10486/675247

Nota

This is the author’s version of a work that was accepted for publication in Speech Communication. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Speech Communication, VOL 76 (2016) DOI 10.1016/j.specom.2015.11.002

Derechos

Esta obra está bajo una licencia de Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional.

Resumen

This paper presents a large-scale study of the discriminative abilities of formant frequencies for automatic speaker recognition. Exploiting both the static and dynamic information in formant frequencies, we present linguistically-constrained formant-based i-vector systems providing well calibrated likelihood ratios per comparison of the occurrences of the same isolated linguistic units in two given utterances. As a first result, the reported analysis on the discriminative and calibration properties of the different linguistic units provide useful insights, for instance, to forensic phonetic practitioners. Furthermore, it is shown that the set of units which are more discriminative for every speaker vary from speaker to speaker. Secondly, linguistically-constrained systems are combined at score-level through average and logistic regression speaker-independent fusion rules exploiting the different speaker-distinguishing information spread among the different linguistic units. Testing on the English-only trials of the core condition of the NIST 2006 SRE (24,000 voice comparisons of 5 minutes telephone conversations from 517 speakers -219 male and 298 female-), we report equal error rates of 9.57 and 12.89% for male and female speakers respectively, using only formant frequencies as speaker discriminative information. Additionally, when the formant-based system is fused with a cepstral i-vector system, we obtain relative improvements of ∼6% in EER (from 6.54 to 6.13%) and ∼15% in minDCF (from 0.0327 to 0.0279), compared to the cepstral system alone.

Mostrar el registro completo del ítem

Lista de ficheros

Nombre

linguistically_franco_SC_2016_ps.pdf

Tamaño

1.134Mb

Formato

PDF

Google™ Scholar:Franco-Pedroso, Javier - González Rodríguez, Joaquín

Lista de colecciones del ítem

Producción científica en acceso abierto de la UAM [20411]

UAM_Biblioteca