近日,IEEE ICASSP 2025公布了第一届多模态情感与意图联合识别挑战赛(MEIJU'25 Challenge)的结果,我室包揽了“赛道二:Imbalanced Emotion and Intent Recognition”中文子赛道和英文子赛道的冠军。
在英文子赛道中,我们提出了一种任务特定特征学习的多模态情感与意图联合识别方法(Task-Specific Feature Learning, TSFL)。TSFL方法包含三个主要组成部分:双流大语言模型特征提取、粗粒度任务特定特征分解和细粒度任务特定特征学习。双流大语言模型特征提取模块使用预训练的大型语言模型(Large Language Models, LLMs)从视频、音频和文本模态中提取特征。粗粒度任务特定特征分解模块将这些特征分解为情感和意图特定的部分。细粒度任务特定特征学习模块对提取的特征进行细化,通过条件互学习策略确保预测结果更好地与真实标签对齐。这些组成部分使得该模型能够高效地捕捉到情感和意图特定的高级特征,从而增强了情感和意图的联合识别能力。
在中文子赛道中,我们提出了一种可靠学习框架(Reliable Learning Framework, RLF)用于进行多模态情感与意图联合识别。RLF包括一个分层交互网络(Hierarchical Interaction Network, HIN)和一个可靠融合策略(Reliable Fusion Strategy, RFS)。其中,HIN能够从由预训练的大型语言模型(Large Language Models, LLMs)生成的多模态数据(视频、音频和文本)的高层语义特征中挖掘情感和意图线索,并依次通过注意力机制和互学习策略进行模态间、视角间的特征交互,借助这种多层级的交互可以增强情感和意图特征的表示能力。RFS则通过整合HIN模型的多次预测结果,进一步提高情感和意图理解的鲁棒性和泛化性。
上述竞赛论文均已被IEEE ICASSP 2025录用并发表,论文信息和获奖证书如下:
1.【MEIJU'25 Challenge Track#2 (English Sub-track)冠军论文】
论文题目:Enhancing Task-Specific Feature Learning with LLMs for Multimodal Emotion and Intent Joint Understanding
论文作者:李兆阳(#),路成(#),徐啸林,张凯飞,顾予佳(大四本科生),李邦华(大四本科生),宗源(*),郑文明(*)
论文摘要:This paper introduces our solution, the Task-Specific Feature Learning (TSFL) method, designed to address the second track of the MEIJU Challenge at ICASSP 2025, namely, Imbalanced Emotion and Intent Recognition (English). The TSFL method incorporates three core components: the use of LLM features to represent multimodal signals, coarse-grained task-specific feature decomposition, and fine-grained task-specific feature learning. These components enable the effective joint learning of emotion-discriminative and intent-discriminative features. As a result, our method achieved a JRBM score of 0.6230, significantly outperforming the official baseline result and surpassing all other competing teams to win the championship.
2.【MEIJU'25 Challenge Track#2 (Chinese Sub-track)冠军论文】
论文题目:Reliable Learning From LLM Features for Multimodal Emotion and Intent Joint Understanding
论文作者:徐啸林(#),路成(#),李兆阳,刘宇韵,马英豪,罗嘉豪(大三本科生),宗源(*),郑文明(*)
论文摘要:This paper describes a Reliable Learning Framework (RLF) for the 1st Multimodal Emotion and Intent Joint Understanding (MEIJU) Challenge at ICASSP 2025. Our proposed RLF includes a Hierarchical Interaction Network and a Reliable Fusion Strategy. The former can excavate emotion and intent cues from the high-level semantic features of multimodal data (video, audio, and text) generated by pretrained Large Language Models (LMMs), to enhance their representations, and the latter reliably integrates multiple predictions to further improve the robustness of emotion and intent understanding. Our RLF method achieved first place on Track 2 (Mandarin) of MEIJU, with performance scores for emotion, intent, and joint recognition reaching 0.7285, 0.7456, and 0.7370.