How does AGNITIO's Speaker Recognition Technology Work?

AGNITIO specializes in the recognition of persons by their voice, a biometric method that we could classify as natural because we all can recognize friends and family by their voices. However, the automatic treatment of voice signals is a complex problem and it has taken years to solve this problem in a satisfactory manner. The three basic steps for the recognition of a speaker is: capturing, digitalizing and processing the sound waves using a complex mathematical method. When a person speaks, the air in his lungs passes through vocal cords and later by a part of his anatomy called the vocal tract. This includes the larynx, the oral cavity, and the nose. By modifying the physical structure of the vocal tract we can articulate various phonemes and thus we are able to communicate. These physical transformations can be followed with precision if we analyze the frequency components of the resulting sound wave.

Much progress has been made in the processing of these signal in order to automatically understand the words as well as to be able to artificially produce by computer a given utterance. Biometric speaker recognition technology is able to obtain the information contained in the sound waves that contains the fixed and personal characteristics that each individual person generates. This information allows us to recognize a person independently of what he may be saying. Each one of us possess a unique physiology that generates information in the sound wave that is unique and personal. By way of mathematical processing we are capable of eliminate all the data in the sound wave containing what you are saying, in order to keep the personal information of who you are.

The following is an outline of how AGNITIO’s technology works:

  • We can obtain a model or pattern of the biometric vocal characteristics of a person from a speech of less than a minute. This process is called the training or the registration of that person for his later identification. This model can be obtained from substantially shorter utterances if the training and the identification are going to be accomplished using the same words.
  • Once the models of a group of persons are stored, we can proceed to verification of the identity of the speaker. For that we capture a short utterance of an unknown person (a minimum of 10 seconds for free text). Then we extract the characteristic parameters of the sound wave and we compare them with the model, obtaining a score. This score measures the likelihood that both the model and unknown voice belong to the same person.
  • Depending on the application, we employ diverse methods to make a decision about the identity of the person. We either use calibrated thresholds for a specific application or we use more complex approximations such as the comparison of reference population to obtain liklyhood ratios. With those results we make a decision on the identity of the person.

Each of these three process has its particular complexity and should be knowledgeably dealt with in order to obtain the desired results. As is the case with all biometric technology, an inappropriate use for a given application can lead to inaccurate results or to non usable processes due to their complexity or speed. However, used adequately, biometric technology becomes a simple and reliable method that is very easy to use in diverse environments.