Editing Affective computing (section)

===Emotional speech===
Various changes in the autonomic nervous system can indirectly alter a person's speech, and affective technologies can leverage this information to recognize emotion. For example, speech produced in a state of fear, anger, or joy becomes fast, loud, and precisely enunciated, with a higher and wider range in pitch, whereas emotions such as tiredness, boredom, or sadness tend to generate slow, low-pitched, and slurred speech.<ref>{{cite journal | last1=Breazeal | first1=Cynthia | last2=Aryananda | first2=Lijin |title=Recognition of Affective Communicative Intent in Robot-Directed Speech | journal=Autonomous Robots | publisher=Springer | volume=12 | issue=1 | year=2002 | issn=0929-5593 | doi=10.1023/a:1013215010749 | pages=83–104 | s2cid=459892 |url=http://web.media.mit.edu/~cynthiab/Papers/breazeal-aryananda-AutoRo02.pdf}}</ref> Some emotions have been found to be more easily computationally identified, such as anger<ref name="Dellaert" /> or approval.<ref>{{Cite book|last1=Roy|first1=D.|last2=Pentland|first2=A.|title=Proceedings of the Second International Conference on Automatic Face and Gesture Recognition |chapter=Automatic spoken affect classification and analysis |date=1996-10-01|pages=363–367|doi=10.1109/AFGR.1996.557292|isbn=978-0-8186-7713-7|s2cid=23157273}}</ref>

Emotional speech processing technologies recognize the user's emotional state using computational analysis of speech features. Vocal parameters and [[prosody (linguistics)|prosodic]] features such as pitch variables and speech rate can be analyzed through pattern recognition techniques.<ref name="Dellaert">Dellaert, F., Polizin, t., and Waibel, A., Recognizing Emotion in Speech", In Proc. Of ICSLP 1996, Philadelphia, PA, pp.1970–1973, 1996</ref><ref name="Lee">Lee, C.M.; Narayanan, S.; Pieraccini, R., Recognition of Negative Emotion in the Human Speech Signals, Workshop on Auto. Speech Recognition and Understanding, Dec 2001</ref>

Speech analysis is an effective method of identifying affective state, having an average reported accuracy of 70 to 80% in research from 2003 and 2006.<ref>{{Cite journal|last1=Neiberg|first1=D|last2=Elenius|first2=K|last3=Laskowski|first3=K|date=2006|title=Emotion recognition in spontaneous speech using GMMs|url=http://www.speech.kth.se/prod/publications/files/1192.pdf|journal=Proceedings of Interspeech|doi=10.21437/Interspeech.2006-277|s2cid=5790745}}</ref><ref>{{Cite journal|last1=Yacoub|first1=Sherif|last2=Simske|first2=Steve|last3=Lin|first3=Xiaofan|last4=Burns|first4=John|date=2003|title=Recognition of Emotions in Interactive Voice Response Systems|journal=Proceedings of Eurospeech|pages=729–732|doi=10.21437/Eurospeech.2003-307 |citeseerx=10.1.1.420.8158|s2cid=11671944 }}</ref> These systems tend to outperform average human accuracy (approximately 60%<ref name="Dellaert" />) but are less accurate than systems which employ other modalities for emotion detection, such as physiological states or facial expressions.<ref name="Hudlicka-2003-p24">{{harvnb|Hudlicka|2003|p=24}}</ref> However, since many speech characteristics are independent of semantics or culture, this technique is considered to be a promising route for further research.<ref name="Hudlicka-2003-p25">{{harvnb|Hudlicka|2003|p=25}}</ref>

====Algorithms====

The process of speech/text affect detection requires the creation of a reliable [[database]], [[knowledge base]], or [[vector space model]],<ref name = "Osgood75">
{{cite book
 | author = Charles Osgood
 |author2=William May|author3=Murray Miron
 | title = Cross-Cultural Universals of Affective Meaning
 | url = https://archive.org/details/crossculturaluni00osgo
 | url-access = registration
 | publisher = Univ. of Illinois Press
 | isbn = 978-94-007-5069-2
 | year = 1975
}}
</ref> broad enough to fit every need for its application, as well as the selection of a successful classifier which will allow for quick and accurate emotion identification.

{{Asof|2010}}, the most frequently used classifiers were linear discriminant classifiers (LDC), k-nearest neighbor (k-NN), Gaussian mixture model (GMM), support vector machines (SVM), artificial neural networks (ANN), decision tree algorithms and hidden Markov models (HMMs).<ref name="Scherer-2010-p241">{{harvnb|Scherer|Bänziger|Roesch|2010|p=241}}</ref> Various studies showed that choosing the appropriate classifier can significantly enhance the overall performance of the system.<ref name="Hudlicka-2003-p24"/> The list below gives a brief description of each algorithm:

* [[Linear classifier|LDC]] – Classification happens based on the value obtained from the linear combination of the feature values, which are usually provided in the form of vector features.
* [[K-nearest neighbor algorithm|k-NN]] – Classification happens by locating the object in the feature space, and comparing it with the k nearest neighbors (training examples). The majority vote decides on the classification.
* [[Gaussian mixture model|GMM]] – is a probabilistic model used for representing the existence of subpopulations within the overall population. Each sub-population is described using the mixture distribution, which allows for classification of observations into the sub-populations.<ref>[http://cnx.org/content/m13205/latest/ "Gaussian Mixture Model"]. Connexions – Sharing Knowledge and Building Communities. Retrieved 10 March 2011.</ref>
* [[Support vector machine|SVM]] – is a type of (usually binary) linear classifier which decides in which of the two (or more) possible classes, each input may fall into.
* [[Artificial neural network|ANN]] – is a mathematical model, inspired by biological neural networks, that can better grasp possible non-linearities of the feature space.
* [[Decision tree learning|Decision tree algorithms]] – work based on following a decision tree in which leaves represent the classification outcome, and branches represent the conjunction of subsequent features that lead to the classification.
* [[Hidden Markov model|HMMs]] – a statistical Markov model in which the states and state transitions are not directly available to observation. Instead, the series of outputs dependent on the states are visible. In the case of affect recognition, the outputs represent the sequence of speech feature vectors, which allow the deduction of states' sequences through which the model progressed. The states can consist of various intermediate steps in the expression of an emotion, and each of them has a probability distribution over the possible output vectors. The states' sequences allow us to predict the affective state which we are trying to classify, and this is one of the most commonly used techniques within the area of speech affect detection.

It is proved that having enough acoustic evidence available the emotional state of a person can be classified by a set of majority voting classifiers. The proposed set of classifiers is based on three main classifiers: kNN, C4.5 and SVM-RBF Kernel. This set achieves better performance than each basic classifier taken separately. It is compared with two other sets of classifiers: one-against-all (OAA) multiclass SVM with Hybrid kernels and the set of classifiers which consists of the following two basic classifiers: C5.0 and Neural Network. The proposed variant achieves better performance than the other two sets of classifiers.<ref>{{cite journal|url=http://ntv.ifmo.ru/en/article/11200/raspoznavanie_i_prognozirovanie_dlitelnyh__emociy_v_rechi_(na_angl._yazyke).htm|title=Extended speech emotion recognition and prediction|author=S.E. Khoruzhnikov|journal=Scientific and Technical Journal of Information Technologies, Mechanics and Optics|volume=14|issue=6|page=137|year=2014|display-authors=etal}}</ref>

====Databases====

The vast majority of present systems are data-dependent. This creates one of the biggest challenges in detecting emotions based on speech, as it implicates choosing an appropriate database used to train the classifier. Most of the currently possessed data was obtained from actors and is thus a representation of archetypal emotions. Those so-called acted databases are usually based on the Basic Emotions theory (by [[Paul Ekman]]), which assumes the existence of six basic emotions (anger, fear, disgust, surprise, joy, sadness), the others simply being a mix of the former ones.<ref name="Ekman, P. 1969">Ekman, P. & Friesen, W. V (1969). [http://www.communicationcache.com/uploads/1/0/8/8/10887248/the-repertoire-of-nonverbal-behavior-categories-origins-usage-and-coding.pdf The repertoire of nonverbal behavior: Categories, origins, usage, and coding]. Semiotica, 1, 49–98.</ref> Nevertheless, these still offer high audio quality and balanced classes (although often too few), which contribute to high success rates in recognizing emotions.

However, for real life application, naturalistic data is preferred. A naturalistic database can be produced by observation and analysis of subjects in their natural context. Ultimately, such database should allow the system to recognize emotions based on their context as well as work out the goals and outcomes of the interaction. The nature of this type of data allows for authentic real life implementation, due to the fact it describes states naturally occurring during the [[human–computer interaction]] (HCI).

Despite the numerous advantages which naturalistic data has over acted data, it is difficult to obtain and usually has low emotional intensity. Moreover, data obtained in a natural context has lower signal quality, due to surroundings noise and distance of the subjects from the microphone. The first attempt to produce such database was the FAU Aibo Emotion Corpus for CEICES (Combining Efforts for Improving Automatic Classification of Emotional User States), which was developed based on a realistic context of children (age 10–13) playing with Sony's Aibo robot pet.<ref name="Steidl-2011">{{cite web | last = Steidl | first = Stefan | title = FAU Aibo Emotion Corpus | publisher = Pattern Recognition Lab | date = 5 March 2011 | url = http://www5.cs.fau.de/de/mitarbeiter/steidl-stefan/fau-aibo-emotion-corpus/ }}</ref><ref name="Scherer-2010-p243">{{harvnb|Scherer|Bänziger|Roesch|2010|p=243}}</ref> Likewise, producing one standard database for all emotional research would provide a method of evaluating and comparing different affect recognition systems.

====Speech descriptors====

The complexity of the affect recognition process increases with the number of classes (affects) and speech descriptors used within the classifier. It is, therefore, crucial to select only the most relevant features in order to assure the ability of the model to successfully identify emotions, as well as increasing the performance, which is particularly significant to real-time detection. The range of possible choices is vast, with some studies mentioning the use of over 200 distinct features.<ref name="Scherer-2010-p241"/> It is crucial to identify those that are redundant and undesirable in order to optimize the system and increase the success rate of correct emotion detection. The most common speech characteristics are categorized into the following groups.<ref name="Steidl-2011"/><ref name="Scherer-2010-p243"/>

# Frequency characteristics<ref>{{Cite book |doi=10.1109/ICCCI50826.2021.9402569|isbn=978-1-7281-5875-4|chapter=Non-linear frequency warping using constant-Q transformation for speech emotion recognition|title=2021 International Conference on Computer Communication and Informatics (ICCCI)|pages=1–4|year=2021|last1=Singh|first1=Premjeet|last2=Saha|first2=Goutam|last3=Sahidullah|first3=Md|arxiv=2102.04029|s2cid=231846518}}</ref>
#* Accent shape – affected by the rate of change of the fundamental frequency.
#* Average pitch – description of how high/low the speaker speaks relative to the normal speech.
#* Contour slope – describes the tendency of the frequency change over time, it can be rising, falling or level.
#* Final lowering – the amount by which the frequency falls at the end of an utterance.
#* Pitch range – measures the spread between the maximum and minimum frequency of an utterance.
# Time-related features:
#* Speech rate – describes the rate of words or syllables uttered over a unit of time
#* Stress frequency – measures the rate of occurrences of pitch accented utterances
# Voice quality parameters and energy descriptors:
#* Breathiness – measures the aspiration noise in speech
#* Brilliance – describes the dominance of high or low frequencies In the speech
#* Loudness – measures the amplitude of the speech waveform, translates to the energy of an utterance
#* Pause Discontinuity – describes the transitions between sound and silence
#* Pitch Discontinuity – describes the transitions of the fundamental frequency.