After extracting these features from each frame of ms in music streams, their first and second derivatives are added to the feature set of the corresponding frame in order to consider temporal characteristics between consecutive frames.
The next step is to estimate the log-likelihood of the features on respective acoustic models constructed for each type of musical mood. Acoustic models should hence be trained in advance of this step. In this study, the distribution of acoustic features extracted from music data corresponding to each mood is modeled by a Gaussian density function. Thus, a Gaussian mixture model GMM is constructed for each musical mood in accordance with model training procedures.
The log-likelihood of feature vectors extracted from given music signals is computed on each GMM, as follows:. M log-likelihood results are then submitted to the emotion decision process. Facial expression and speech are the representative indicators that directly convey human emotional information. The bimodal emotion recognition approach integrates the recognition results, respectively, obtained from face and speech.
In facial expression recognition, accurate detection of the face has an important influence on the recognition performance. A bottom-up, feature-based approach is widely used for the robust face detection. This approach searches an image through a set of facial features indicating color and shape, and then groups them into face candidates based on the geometric relationship of the facial features. Finally, a candidate region is decided as a face by locating eyes in the eye region of a candidate's face. The detected facial image is submitted to the module for facial expression recognition.
The first step of facial expression recognition is to normalize the captured image. Two kinds of features are then extracted on the basis of Ekman's facial expression features [ 28 ].
The first feature is a facial image consisting of three facial regions: the lips, eyebrows, and forehead. By applying histogram equalization and the threshold of the standard distribution of the brightness of the normalized facial image, each of the facial regions is extracted from the entire image.
The second feature is an edge image of those three facial regions. The edges around the regions are extracted by using histogram equalization. Next, the facial features are trained according to a specific classifier in order to determine explicitly distinctive boundaries between emotions. The boundary is used as a criterion to decide an emotional state for a given facial image.
Various techniques already in use for conventional pattern classification problems are likewise used for such emotion classifiers. Among them, neural network NN -based approaches have widely been adopted for facial emotion recognition, and have provided reliable performance [ 29 — 31 ].
A recent study on NN-based emotion recognition [ 32 ] reported the efficiency of the back-propagation BP algorithm proposed by Rumelhart and McClelland in [ 33 ].
In this study, we follow a training procedure introduced in [ 31 ] that uses an advanced BP algorithm called error BP. Each of the extracted features is trained by using two neural networks for each type of emotion.
- Customer Reviews.
- Jorge Solis | Karlstads universitet.
- Edited By Michael Funk and Bernhard Irrgang!
Each neural network is composed of input nodes, 6 hidden nodes, and M output nodes. The input nodes receive pixels from the input image, and the output nodes, respectively, correspond to each of M emotions. The number of hidden nodes was determined by an experiment. Finally, the decision logic determines the final emotion from the two neural network results. The face-emotion decision logic utilizes the weighted sum of the two results and a voting method of the result transitions over the time domain. The overall process of emotion recognition through facial expression is shown in Figure 3.
Once audio signals transmitted through a robot microphone are determined to be human voice signals, the speech emotion recognition module is activated. In the first step, several acoustic features representing emotional characteristics are estimated from the voice signals. Two types of acoustic features are extracted: a phonetic feature and a prosodic feature.
MFCC and LPC pertaining to musical mood recognition are also employed for speech emotion recognition in terms of phonetic features, while spectral energy and pitch are used as prosodic features. As in musical mood recognition, the first and second derivatives of all features are added to the feature set. Next, the acoustic features are recognized through a pattern classifier. Even though various classifiers such as HMM and SVM have been fed into speech emotion recognition tasks, we employ the neural network-based classifier used in the facial expression recognition module in order to efficiently handle the fusion process in which the recognition results of two indicators are integrated.
We organize a sub-neural network for each emotion. The construction of each sub-network has basically the same architecture. A sub-network comprises input nodes corresponding to the dimension of the acoustic features, hidden nodes, and an output node. The number of hidden nodes varies according to the distinctness of respective emotions.
When there are M emotions, acoustic features extracted from the voice signals are simultaneously fed into M sub-networks, and thus an M -dimensional vector is obtained for the recognition result. The configuration of the neural network is similar to that adopted in [ 17 ], but we adjust internal learning weights of each sub-network and the normalization algorithm in consideration of the characteristics of the acoustic features. Figure 4 describes a simplified architecture for the proposed bimodal recognition when the number of emotions is M.
As a recognition result, an M -dimensional vector is, respectively, obtained from facial expression and speech. Let us denote R face t and R speech t as the vectors obtained from the two indicators at time t.empico.com/2439.php
Musical Robots and Interactive Multimodal Systems
The final procedure of the bimodal emotion recognition is to perform a fusion process in which the results, R face t and R speech t , are integrated. We calculate the vector R bimodal t referred to as a fusion vector, as follows:. The weights are appropriately determined by reference to the recognition results for each indicator. In general, the performance of standard emotion recognition systems substantially depends on the user characteristics in expressing emotional states [ 6 ]. Thus, systems occasionally demonstrate the common error of a rapid transition of human emotional states.
To address this problem, we consider the general tendency that human emotional states rarely change quickly back and forth.
Philosophical and Technical Perspectives
Hence, the proposed fusion process in 2 uses two recognition results obtained just before the current time t in order to reflect the emotional state demonstrated during the previous time. The final procedure in the perception system is to determine an emotion on the basis of the bimodal fusion vector calculated in 2 and the mood recognition result estimated in 1. These two results indicate different scales about the values but have the same dimension corresponding to the number of emotions and moods.
Let us denote R bimodal e and R music e as the value of the e th emotion in the fusion vector and that of the e th mood in the mood recognition result, respectively. In addition to these two results, our decision mechanism utilizes supplementary information. This research originated from the general idea of the relationship between music and human emotion. We strongly believe that the music that a person listens to directly correlates with the emotion that the person feels.
Musical Robots and Interactive Multimodal Systems - spirinsemloti.ga
Thus, if a robot detects a musical mood similar to the mood that a user has enjoyed in the past, the user would be in an emotional state similar to the emotion the robot determined at that time. To consider this aspect, we steadily make a record of bimodal recognition results in accordance with the musical mood whenever the three recognition modules are simultaneously activated. The average values of the bimodal results on each type of musical mood are then recorded in the memory system, which will be described in the following section.
To utilize this value in the decision process, the mood of the music being played should be determined in advance, as follows:. Finally, we determine an emotion based on three kinds of results, all of which are M -dimensional vectors, as follows:.
- About This Item?
- The Little Mongo DB Schema Design Book.
- Healing and Change in the City of Gold: Case Studies of Coping and Support in Johannesburg.
- Power Transmissions: Proceedings of the International Conference on Power Transmissions 2016.
- Interactive Improvisation with a Robotic Marimba Player!
- Mozart to Robot – Cultural Challenges of Musical Instruments : Robotics in Germany and Japan?
This decision criterion is available only if either the bimodal indicators or musical indicator is activated. When only the musical signals are detected, R bimodal e is automatically set to zero. The memory system consecutively records several specific types of emotions such as happy, sad, or angry from among the emotions that the perception system detects from three kinds of indicators.
The system creates emotion records including the emotion type and time. Such emotion records can naturally be utilized for affective interaction with the user.