Chaolong Li (李超龙)


Master Student

Affective Information Processing Lab,

Key Laboratory of Child Development and Learning Science of Ministry of Education,

School of Biological Sciences and Medical Engineering,

Southeast University, Nanjing, Jiangsu Province, China.


Supervisors: Prof. Wenming Zheng

Email: lichaolong[at]


I received my B.Sc. degree in Neuro Education from Southeast University in June 2016. Since September 2016, I have become a master student of Affective Information Processing Lab (AIPL), Key Laboratory of Child Development and Learning Science of Ministry of Education, School of Biological Science & Medical Engineering in Southeast University, under the supervision of Prof. Wenming Zheng and Prof. Zhen Cui.

Recent News

Research Interests

My main interests include affective computing, pattern recognition, computer vision and deep learning. Especially focus on facial expression recognition, deep learning on graphs and its applications to skeleton-based action recognition.


  1. Chaolong Li, Zhen Cui, Wenming Zheng, Chunyan Xu, Rongrong Ji, Jian Yang, “Action-Attending Graphic Neural Network,” IEEE Transactions on Image Processing (TIP), vol. 27, no. 7, pp. 3657-3670, 2018. [Project] [Paper] [Abstract] [BibTex] [CCF-A] (IF:5.071) The motion analysis of human skeletons is crucial for human action recognition, which is one of the most active topics in computer vision. In this paper, we propose a fully end-to-end action-attending graphic neural network (A²GNN) for skeleton-based action recognition, in which each irregular skeleton is structured as an undirected attribute graph. To extract high-level semantic representation from skeletons, we perform the local spectral graph filtering on the constructed attribute graphs like the standard image convolution operation. Considering not all joints are informative for action analysis, we design an action-attending layer to detect those salient action units (AUs) by adaptively weighting skeletal joints. Herein the filtering responses are parameterized into a weighting function irrelevant to the order of input nodes. To further encode continuous motion variations, the deep features learnt from skeletal graphs are gathered along consecutive temporal slices and then fed into a recurrent gated network. Finally, the spectral graph filtering, action-attending and recurrent temporal encoding are integrated together to jointly train for the sake of robust action recognition as well as the intelligibility of human actions. To evaluate our A²GNN, we conduct extensive experiments on four benchmark skeleton-based action datasets, including the large-scale challenging NTU RGB+D dataset. The experimental results demonstrate that our network achieves the state-of-the-art performances.

        title={Action-Attending Graphic Neural Network},
        author={Li, Chaolong and Cui, Zhen and Zheng, Wenming and Xu, Chunyan and Ji, Rongrong and Yang, Jian},
        journal={IEEE Transactions on Image Processing},
  2. Chaolong Li, Zhen Cui, Wenming Zheng, Chunyan Xu, Jian Yang, “Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition,” In Proc. AAAI, Feb. 2018, pp. 3482-3489. [Paper] [Abstract] [BibTex] [CCF-A] [Spotlight] Variations of human body skeletons may be considered as dynamic graphs, which are generic data representation for numerous real-world applications. In this paper, we propose a spatio-temporal graph convolution (STGC) approach for assembling the successes of local convolutional filtering and sequence learning ability of autoregressive moving average. To encode dynamic graphs, the constructed multi-scale local graph convolution filters, consisting of matrices of local receptive fields and signal mappings, are recursively performed on structured graph data of temporal and spatial domain. The proposed model is generic and principled as it can be generalized into other dynamic models. We theoretically prove the stability of STGC and provide an upper-bound of the signal transformation to be learnt. Further, the proposed recursive model can be stacked into a multi-layer architecture. To evaluate our model, we conduct extensive experiments on four benchmark skeleton-based action datasets, including the large-scale challenging NTU RGB+D. The experimental results demonstrate the effectiveness of our proposed model and the improvement over the state-of-the-art.

        title={Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition},
        author={Li, Chaolong and Cui, Zhen and Zheng, Wenming and Xu, Chunyan and Yang, Jian},
        booktitle={Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence},
        organization={AAAI Press}
  3. Cheng Lu, Wenming Zheng, Chaolong Li, Chuangao Tang, Suyuan Liu, Simeng Yan, Yuan Zong, “Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild,” In Proc. ACM ICMI, 2018, pp. 646-652. [Paper] [Abstract] [BibTex] [CCF-C] The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio-video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatio-temporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.

        title={Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild},
        author={Lu, Cheng and Zheng, Wenming and Li, Chaolong and Tang, Chuangao and Liu, Suyuan and Yan, Simeng and Zong, Yuan},
        booktitle={Proceedings of the 2018 on International Conference on Multimodal Interaction},
  4. Tong Zhang, Wenming Zheng, Zhen Cui, Chaolong Li, “Deep Manifold-to-Manifold Transforming Network,” In Proc. ICIP, 2018, pp. 4098-4102. [Paper] [Abstract] [BibTex] [CCF-C] In this paper, we propose an end-to-end deep manifold-to-manifold transforming network (DMT-Net), which makes SPD matrices flow from one Riemannian manifold to another more discriminative one. For discriminative feature learning, two specific layers on manifolds are developed: (i) the local SPD convolutional layer, (ii) the non-linear SPD activation layer, where positive definiteness is satisfied for both two layers. Further, to relieve computational burden of kernels on relative large-scale data, we design a batch-kernelized layer to favor batchwise kernel optimization of deep networks. Specifically, one reference set dynamically changing with the network training is introduced to break the limitation of memory size. We evaluate our proposed method on action recognition datasets, where input signals are popularly modeled as SPD matrices. The experimental results demonstrate that our DMT-Net is more competitive than state-of-the-art methods.

        title={Deep Manifold-to-Manifold Transforming Network},
        author={Zhang, Tong and Zheng, Wenming and Cui, Zhen and Li, Chaolong},
        booktitle={2018 25th IEEE International Conference on Image Processing (ICIP)},

Research Project

  • Design of Scientific Literacy Assessment Platform Based on Sensor and Android, Student Innovation and Entrepreneurship Training Program of Jiangsu Province, 2015-2016, PI.

Honors and Awards

  • National Graduate Scholarship (2018)
  • The Video based Emotion Recognition Challenge Second Runner-up Poisition of the 6th EmotiW Challenge (2018)
  • Chien-Shiung Wu · BME Scholarship (2018)
  • Third Prize in the Thirteenth National Post-Graduate Mathematical Contest in Modeling (2016)
  • Merit Student of Southeast University (2014)
  • National Scholarship for Encouragement (2014)
  • Merit Student of Southeast University (2013)
  • National Scholarship for Encouragement (2013)
  • Zhang Zhiwei Scholarship (2013)
  • Outstanding Communist Youth League Member of Southeast University (2013)


      Room 318 (Middle), Liwenzheng Building, Southeast University, Sipailou 2#, Nanjing, Jiangsu Province, 210096 P. R. China.

Last Modified: 2019-02-20