Music Affect Recognition tutorial @ ISMIR2012

“Music Affect Recognition: The State-of-the-art and Lessons Learned” by Xiao Hu and Yi-Hsuan (Eric) Yang – the most thorough overview of music emotions research in MIR I ever seen. Here are the handouts, below are some of my thoughts on the topic.

Dr. Hu started from elegant terminology and scope definition, noting that ‘music mood’ and ‘music emotion’ are treated similar in MIR despite their different meaning in psychology. The second point is that music emotion can be either expressed (by an artist) or induced (in a listener), and this tutorial is about the latter. Next, there was an overview of different mood models, from Ekman‘s 6 facial expressions to recent work by Lee on sources of mood induced by music. Dr. Hu discussed a list of issues in this field, including lack of data (only four datasets available), low expert consensus degree and low performance of state-of-the-art methods. Then Dr. Hu made a good overview of music mood taxonomies and categorical affect recognition methods, including her work on mood clustering based on Allmusic data. That research resulted in 5 mood clusters, which correspond to mood hierarchy obtained by Laurier from LastFM folksonomy.

In the second part of the tutorial, Eric Yang overviewed dimensional mood models, which embed mood metrics in continuos spaces. The most popular space is 2D valence/arousal, but ti was nice to discover that multi-dimensional models also exist. Additional dimensions may represent sense of potency/control and predictability of music. Eric showed demo of his project Mr.Emo, which is also capable of emotion-based music discovery and navigation. In the conclusion, Eric mentioned research on temporal effects of music mood, affect research in neighbour domains (e.g. video) and even affect recognition using neuroscanning (EEG etc).

Other interesting things: recent decent review on the topic, DEAP dataset for emotion analysis using eeg, physiological and video signals, PsySound toolkit to extract psychoacoustic features from signal.

Now it’s time to speculate a bit. Usually, emotions in MIR are considered as a subset of all possible music tags. This subset has some nice spatial representation (e.g. in valence/arousal space) and is correlated with music content features. Then, this task is treated as multi-label classification problem and solved by ML machinery (in case of categorical approach) or as something like metric embedding (in case we need continuous dimensional space of emotions).However, probably there issomething specific to those emotion tags. That’s how human actually perceive emotions and how perceived emotions are used in decision making. Often music affect recognition problem considered without particular purpose, but I assume eventually we want to help user find proper music.Some psychology studies suggest that there are ‘two selves’: experiencing self and remembering self. The experiencing self perceives actual integral experience (such as pain suffered each moment one holds a hand in a cold water). The remembering self perceives some memory about past experience (and answers when someone asks how painful was it). The two important observations are:

  1. These guys’ experiences are different. Most of the time, the remembering self neglects duration of the experience and pays the most attention to peak moment and the end part of the experience.
  2. The remembering self is one who actually makes decisions. That is, human tends to make decisions on his past experience and when he does that, he has only his memoirs.

For instance, people were asked to undergo two painful episodes. One was to experience pain for a minute, another to experience the same pain for a minute, and then lesser pain for another half minute. Afterwards, the subjects were asked to say which of the two episodes of pain was worse. Generally they said the shorter one. To test the firmness of their judgements, they were told they would undergo a third painful episode. They were told it would be a repeat of one of the two above and given a choice of which it should be. They generally chose the longer episode.

So, why to bother when dealing with music affect? I suppose, there are two implications. First, when we collect ground truth and ask people to label music they’ve listened, probably the remembering self answers. That is, if we seek content features responsible for the answer, we should put more weight to the end of music piece, lesser weight to the peak segment, even lesser to the beginning and totally skip the rest. This can be done in many ways, one of them is to build a generative model of music affect labeling process and train it on the ground truth.

Second, decision making in music is different from choosing third painful experiment, next vacation etc. Usually, the decision is to skip or not to skip, and it is firmly made by the experiencing self.

That is, to achieve accurate understanding of music affect factors, we need to decompose the ground truth based on the remembering self model. To label music for user, we probably need to draw labels from the experiencing self model (i.e. from the beginning of the track). I wish I had time to verify the hypothesis. However, in the best case that will move accuracy higher to the expert agreement cap, which is quite low for this problem.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s