Linguistically-constrained formant-based i-vectors for automatic speaker recognition
EntityUAM. Departamento de Tecnología Electrónica y de las Comunicaciones
10.1016/j.specom.2015.11.002Speech Communication 76 (2016): 61 – 81
Funded byThis work has been supported by the Spanish Ministry of Economy and Competitiveness (project CMC-V2: Caracterizacion, Modelado y Compensacion de Variabilidad en la Señal de Voz, TEC2012-37585-C02-01). Also, the authors would like to thank SRI for providing the Decipher phonetic transcriptions of the NIST 2004, 2005 and 2006 SREs that have allowed to carry out this work.
ProjectGobierno de España. TEC2012-37585-C02-01
SubjectsAutomatic speaker recognition; Formant dynamics; Formant frequencies; Linguistically-constrained systems; Telecomunicaciones
NoteThis is the author’s version of a work that was accepted for publication in Speech Communication. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Speech Communication, VOL 76 (2016) DOI 10.1016/j.specom.2015.11.002
Rights© 2016 Elsevier B.V. All rights reserved
Esta obra está bajo una licencia de Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional.
This paper presents a large-scale study of the discriminative abilities of formant frequencies for automatic speaker recognition. Exploiting both the static and dynamic information in formant frequencies, we present linguistically-constrained formant-based i-vector systems providing well calibrated likelihood ratios per comparison of the occurrences of the same isolated linguistic units in two given utterances. As a first result, the reported analysis on the discriminative and calibration properties of the different linguistic units provide useful insights, for instance, to forensic phonetic practitioners. Furthermore, it is shown that the set of units which are more discriminative for every speaker vary from speaker to speaker. Secondly, linguistically-constrained systems are combined at score-level through average and logistic regression speaker-independent fusion rules exploiting the different speaker-distinguishing information spread among the different linguistic units. Testing on the English-only trials of the core condition of the NIST 2006 SRE (24,000 voice comparisons of 5 minutes telephone conversations from 517 speakers -219 male and 298 female-), we report equal error rates of 9.57 and 12.89% for male and female speakers respectively, using only formant frequencies as speaker discriminative information. Additionally, when the formant-based system is fused with a cepstral i-vector system, we obtain relative improvements of ∼6% in EER (from 6.54 to 6.13%) and ∼15% in minDCF (from 0.0327 to 0.0279), compared to the cepstral system alone.
This item appears in the following Collection(s)
Showing items related by title, author, creator and subject.