实验室四篇论文被INTERSPEECH 2024所接收

发布者:王非凡发布时间:2024-06-04浏览次数:10

64日,实验室四篇论文被INTERSPEECH 2024所接收:

1. Hierarchical Distribution Adaptation for Unsupervised Cross-corpus Speech Emotion Recognition

论文作者:路成,宗源,赵焱,连海伦,齐天铧,Björn Schuller,郑文明

论文简介:The primary issue of unsupervised cross-corpus speech emotion recognition (SER) is that domain shift between the training and testing data undermines the SER model’s ability to generalize on unknown testing datasets. In this paper, we propose a straightforward and effective strategy, called Hierarchical Distribution Adaptation (HDA), to address the domain bias issue. HDA leverages a hierarchical emotion representation module based on nested Transformers to extract speech emotion features at different levels (eg, frame/segment/utterance-level), for capturing multiple-scale emotion correlations in speech. Furthermore, a hierarchical distribution adaptation module, including frame-level distribution adaptation (FDA), segmentlevel distribution adaptation (SDA), and utterance-level distribution adaptation (UDA), is developed to align the hierarchicallevel emotion representations of the training and testing speech samples to effectively eliminate domain discrepancy. Extensive experimental results demonstrate the superiority of our proposed HDA over other state-of-the art (SOTA) methods.


2. Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity

论文作者:齐天铧,王世炎,路成,赵焱,宗源,郑文明

论文简介:Realistic emotional voice conversion (EVC) aims to enhance emotional diversity of converted audios, making the synthesized voices more authentic and natural. To this end, we propose Emotional Intensity-aware Network (EINet), dynamically adjusting intonation and rhythm by incorporating controllable emotional intensity. To better capture nuances in emotional intensity, we go beyond mere distance measurements among acoustic features. Instead, an emotion evaluator is utilized to precisely quantify speaker's emotional state. By employing an intensity mapper, intensity pseudo-labels are obtained to bridge the gap between emotional speech intensity modeling and run-time conversion. To ensure high speech quality while retaining controllability, an emotion renderer is used for combining linguistic features smoothly with manipulated emotional features at frame level. Furthermore, we employ a duration predictor to facilitate adaptive prediction of rhythm changes condition on specifying intensity value. Experimental results show EINet's superior performance in naturalness and diversity of emotional expression compared to state-of-the-art EVC methods.


3. Confidence-aware Hypothesis Transfer Networks for Source-Free Cross-Corpus Speech Emotion Recognition

论文作者:王金岑,赵焱,路成,连海伦,常洪丽,宗源,郑文明

论文简介:The goal of Source-free cross-corpus speech emotion recognition (SER) is to transfer emotion knowledge from source corpus to target one without access to source data. To address this challenge, we develop a novel method named Confidence-aware Hypothesis Transfer Network (CaHTN) including two modules. To be specific, the first module called hypothesis implicit transfer leverages the frozen source classifier (hypothesis) to force target samples to implicitly align the source hypothesis by information maximization. Besides, a bidirectional confident self-training module is designed to exploit not only the positive pseudo label information but also the negative ones for target feature extraction enhancement. To verify its effectiveness, we design twelve source-free cross-corpus SER tasks and conduct extensive experiments on CASIA, EmoDB, EMOVO and eNTERFACE. Experimental results indicate CaHTN obtains state-of-the-art performance in addressing source-free cross-corpus SER.


4. Boosting Cross-Corpus Speech Emotion Recognition using CycleGAN with Contrastive Learning

论文作者:王金岑,赵焱,路成,唐传高,李溯南,宗源,郑文明

论文简介:The premise for the success of most classic speech emotion recognition (SER) algorithms is that training and testing samples are independent and identically distributed. However, the premise is not always valid in real life. Thus, in this paper, we propose a novel transfer learning method called contrastive cycle generative adversarial network (C2GAN) to address cross-corpus SER, where training and testing data originates from different corpora. Specifically, we first adapt CycleGAN to generate synthetic data, transforming samples between source and target corpora, to enhance the variability of source data. Then, an emotion-guided contrastive learning module is introduced to jointly optimize original and synthetic data during training, leading to better class-level feature alignment. We conduct experiments on eNTERFACE, CASIA and EmoDB datasets with six different settings for evaluation. Extensive results confirm the excellent performance of C2GAN over other state-of-the-art methods.