Child speech

Confounding voice tech since 1962

With the advance of artificial intelligence technologies, voice as a human-machine interface is increasing in popularity. Voice interfaces are especially useful for children, many of whom are pre-literate, or have not yet developed the dexterity needed to operate a keyboard. The integration of speech technology into games, toys, TVs and other household devices have seen a huge surge in the market, as well as platforms and services for literacy, language learning, and accessibility.

However, there is a growing realisation that voice interfaces are under-performing for child speech. Wilpon and Jacobsen [1] estimate that the WER for child speech is more than twice that for adult speech, while a more recent study by Potamianos and Narayanan [2] put the WER at two to five times that of adult speech. Certainly both physiological and behavioural factors influence ASR performance for child speech. Children’s voices are pitched higher than adults voices, and children have shorter, narrower vocal tracts, and thinner vocal folds [3].

Children also differ from adults in their behaviour, often demonstrating unpredictable articulation styles, as well as a tendency to create new words, hesitate, under-articulate and demonstrate high pronunciation variability. Acoustic variability is often cited as one of the major reasons why ASR systems for child speech are so challenging to build [6]. Much of this variability is due to the rapid of time as they grow older. Something that is rarely mentioned in the literature is the fact that the manner of interaction with technology is different for children than it is for adults. Today’s children have been born into a world of smartphones, apps and tablets and it is second nature for kids to interact with hand- held technology in real-world, real-noise environments.

Unlike adults, children tend not to modify their behaviour to facilitate the operation of voice-enabled devices. They are unlikely to find a quieter environment to use technology, and most of their time is spent in environments where there are children and adults speaking in close range, or a television, radio, running water, extraction fans, and other household noises from ovens, microwaves and kettles. Of the few data sets available for building child ASR, none contains training data that could be deemed representative of real-world environments for children.

References

[1] J.G. Wilpon and C.N.Jacobsen,“A study of speech recognition for children and the elderly,” in Proceedings of ICASSP 1996, vol. 1, 1996, pp. 349–352 vol. 1.

[2]  A. Potamianos and S. Narayanan, “Robust recognition of children’s speech,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 603–616, 2003.

[3]  M. E. Beckman, A. R. Plummer, B. Munson, and P. F. Reidy, “Methods for eliciting, annotating, and analyzing databases for child speech development.” Comput Speech Lang, vol. 45, pp. 278–299, Sep 2017.

[4]  Q. Li and M. J. Russell, “An analysis of the causes of increased error rates in children’s speech recognition,” in INTERSPEECH, 2002.

[5]  D. Giuliani and M. Gerosa, “Investigating recognition of children’s speech,” in Proceedings of ICASSP 2003, pp. II–137–40.

[6]  A. Potamianos, S. Narayanan, and S. Lee, “Automatic speech recognition for children,” in European Conference on Speech Communication and Technology, 1997.

Leave a comment