The use of ensemble techniques in multiclass speech emotion recognition to improve both accuracy and confidence in classifications
MetadataShow full item record
This item's downloads: 2757 (view details)
Creating machines with the ability to reason, perceive, learn and make decisions based on a human like intelligence has been an interest of artificial intelligence researchers for decades, with the long term goal of developing a general intelligence capable of solving problems just like humans. Affective computing is the area of these studies which focusses on the design and development of intelligent devices which can perceive, process and synthesize human emotion. Humans can interpret emotion in a number of different ways, for example processing spoken utterances, non-verbal cues, facial expressions and also written communication. Changes in our nervous system indirectly alter spoken utterances which makes it possible for people to perceive how others feel by listening to them speak. These changes can also be interpreted by machines through the extraction of speech features. The field of Speech Emotion Recognition (SER) takes advantage of this capability and has subsequently offered many approaches to recognize affect in spoken utterances. Our research focusses on this problem of recognizing affect in spoken utterances and offers a contribution to state of the art systems, which not only can increase accuracy in predictions made but can also improve the reliability or confidence in predictions made. The majority of state of the art SER systems employ complex statistical algorithms to model the relationship between acoustic parameters extracted from spoken language. This model can then be used to classify new instances of emotionally expressive speech. There are other SER systems which use the content of spoken utterances i.e. what is being said, along with acoustic parameters to make a more informed prediction. Our work highlights how state of the art SER systems do not employ state of the art text analysis techniques and therefore are limiting their prediction ability. This thesis therefore presents a classification system which exploits best practices from both the acoustic and text processing domains, to create an SER system which exhibits more accurate and confident predictions than state of the art systems to date.