Voice tech in education

Helping students and teachers benefit from technology

Tracking the use of different talk moves in class is essential feedback for any teacher wishing to monitor their students’ progress. However, providing teachers with feedback about their instruction, or the conversational progress underway by the children in their classes, requires lengthy, detailed and expensive observation by experts, who would need to transcribe every classroom session and categorize, by hand, individual exchanges into individual talk moves (Suresh et al. 2018). A new strategy is needed to provide useful and timely feedback to teachers. 

Automatic speech recognition (ASR), whereby a computer can automatically transcribe audio, can provide this missing puzzle piece. This audio transcription can then be mined using natural language processing (NLP) techniques to infer intent (“I want to book a flight”), trigger actions (“Open Google Maps”) or determine sentiment (“we had a great time”). With one in four adults in possession of a voice enabled smart speaker, the applications of voice technology are becoming more familiar to a wider audience.

Children in particular are interacting more with voice technology and applications of child ASR are far-reaching. Educational platforms that assess pronunciation, literacy, fluency and reading accuracy are becoming more widespread and all rely on accurate transcription of child speech. Unfortunately the accuracy of child ASR is very poor, with some studies reporting an error rate five times that of adult ASR (Potamianos and Narayanan 2003).

For the TalkBack project, this means that while the teachers’ instructions during the class can be transcribed accurately, it is still impossible to transcribe the speech of the children. Accurate transcriptions of both teacher and student speech is essential for categorizing and monitoring the use of talk moves in the classroom – this can only be achieved with accurate child ASR.

My specialty for the last five years has been child speech technology, improving the various language and acoustic models and modeling techniques that make speech technology work for children’s voices. Both physiological and behavioural factors influence ASR performance for child speech. Children’s voices are pitched higher than adults voices, kids have shorter, narrower vocal tracts, and thinner vocal folds (Beckmann 2017). Children also differ in their behaviour, demonstrating unpredictable articulation, and a tendency to create new words, hesitate, and under-articulate (Li and Russell 2002, Giuliani and Gerosa 2003). Acoustic variability is another of the major reasons why ASR systems for child speech are so challenging to build (Potamianos et al. 1997). Much of this variability is due to the rapid cognitive and physical development of children in short spaces of time as they grow older. Today’s children have been born into a world of smartphones, apps and tablets and it is second nature for kids to interact with hand-held technology in real-world, real-noise environments. Unlike adults, children rarely modify their behaviour to facilitate the operation of voice-enabled devices, which could also contribute to poor ASR performance. Accurate child ASR is needed, that is built for child voices using child-specific speech and language data, and works in real-world environments.

References

Potamianos A. and Narayanan, S. (2003). “Robust recognition of children’s speech,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 603–616

Beckman, Mary E., Plummer, Andrew R., Munson, B. and Reidy, Patrick F. (2017). “Methods for eliciting, annotating, and analyzing databases for child speech development.,” Comput Speech Lang, vol. 45, pp. 278–299

Li, Q., and Russell, M. J. (2002). “An analysis of the causes of increased error rates in children’s speech recognition,” in Interspeech, Denver CO, USA.

Giuliani D., and Gerosa, M. (2003). “Investigating recognition of children’s speech,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. II–137–40, Institute of Electrical and Electronics Engineers (IEEE).

Potamianos, A., Narayanan, S., and Lee, S. (1997). “Automatic speech recognition for children,” in Fifth European Conference on Speech Communication and Technology

Leave a comment