Xin Wen (温鑫)

I am a Ph.D. candidate at CVMI Lab, The University of Hong Kong, advised by Prof. Xiaojuan Qi. Prior to that, I obtained my B.Eng. degree in computer science with minor in applied mathematics at Tongji University. I also spent some time at ByteDance AI Lab, MEGVII Research, and Shanghai AI Laboratory.

My research interest mainly lies in learning robust and generalizable vision representations with minimal human intervention, and its application to recognition/perception tasks.

Please feel free to drop me an email if you are interested in what I do and seek for possible collaborations.

Email  /  Github  /  Google Scholar  /  Twitter  /  LinkedIn

profile photo
Invited Talks

  • CCVL Lab, Johns Hopkins University: "Self-Supervised Visual Representation Learning with Semantic Grouping", Nov. 2022.
  • Publications
    What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights
    Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi
    Conference on Neural Information Processing Systems (NeurIPS), 2024
    ArXiv / Code / Slides / Poster

    We find CLIP to be relatively robust to pre-training data imbalance, design and conduct controlled experiments to identify the underlying mechanisms and provide insights for recognition and SSL models.

    Can OOD Object Detectors Learn from Foundation Models?
    Jiahui Liu, Xin Wen, Shizhen Zhao, Yingxian Chen, Xiaojuan Qi
    European Conference on Computer Vision (ECCV), 2024
    ArXiv / Code

    We introduce an automatic data curation process that leverages foundation models as tools to harvest meaningful data from text-to-image generation models for OOD object detection.

    Classes Are Not Equal: An Empirical Study on Image Recognition Fairness
    Jiequan Cui, Beier Zhu, Xin Wen, Xiaojuan Qi, Bei Yu, Hanwang Zhang
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    ArXiv / Code

    We show that models can still have extremely biased behaviors when trained on balanced ImageNet, investigate the resons behind, and provide some workarounds.

    What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
    Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Yongshuo Zong, Xin Wen, Bingchen Zhao
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    Project Page / ArXiv / Code

    We build a counterfactual visual question answering benchmark, and show that strong Vision-Language Models, even GPT-4, cannot handle them very well.

    Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training
    Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, Hengshuang Zhao
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    ArXiv / Code

    We investigate the techniques to enable 3D representation learning at unprecedented data scale.

    CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection
    Chuofan Ma, Yi Jiang*, Xin Wen*, Zehuan Yuan, Xiaojuan Qi
    Conference on Neural Information Processing Systems (NeurIPS), 2023
    Project Page / ArXiv / Code

    We bridge the gap between vision & language spaces by reformulating region-word alignment as co-occurring object discovery, and images mention a shared concept in their captions are grouped together.

    Parametric Classification for Generalized Category Discovery: A Baseline Study
    Xin Wen*, Bingchen Zhao*, Xiaojuan Qi
    IEEE International Conference on Computer Vision (ICCV), 2023
    Project Page / ArXiv / Code / Slides / Poster

    We revisit the reason that makes previous parametric classifiers fail to recognise new classes for GCD, identify the prediction biases between and within seen and novel classes as the key issue, and propose a simple yet strong framework that addresses these limitations and achieves state-of-the-art performance in this field.

    Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery
    Bingchen Zhao, Xin Wen, Kai Han
    IEEE International Conference on Computer Vision (ICCV), 2023
    ArXiv / Code

    We tackle GCD without knowing the class number as a-priori, propose a semi-supervised variant of GMM with stochastic splitting and merging to dynamically determine prototypes, and leverage prototpyical contrastive learning for representation learning on partially labelled data.

    Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning
    Xiaoyang Wu, Xin Wen, Xihui Liu, Hengshuang Zhao
    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023
    ArXiv / Code

    We propose the Masked Scene Contrast (MSC) framework for unsupervised 3D representation learning, which efficiently generates contrastive views directly on scene-level point clouds and enables large-scale 3D pre-training across multiple datasets.

    Self-Supervised Visual Representation Learning with Semantic Grouping
    Xin Wen, Bingchen Zhao, Anlin Zheng, Xiangyu Zhang, Xiaojuan Qi
    Conference on Neural Information Processing Systems (NeurIPS), 2022
    Project Page / ArXiv / Code / Slides / Poster

    We show that object discovery can be learned jointly with the representations from scratch on real-world scene-centric data, which leads to strong transfer learning results in various downstream tasks.

    Temporal Context Aggregation for Video Retrieval with Contrastive Learning
    Jie Shao*, Xin Wen*, Bingchen Zhao, Xiangyang Xue
    IEEE Winter Conference on Applications of Computer Vision (WACV), 2021
    ArXiv / Code / Slides

    We present a contrastive learning-based video representation learning framework that adopts long-range temporal information between frame-level features using self-attention.

    Distilling Visual Priors from Self-Supervised Learning
    Bingchen Zhao, Xin Wen
    European Conference on Computer Vision (ECCV) VIPriors Workshop, 2020
    ArXiv / Code / Slides

    We leverage self-supervised learning and knowledge distillation to improve the generalizability of CNN models for image classification under the data-deficient setting.

    Research Experiences

  • Research Intern | Noah's Ark Lab, Huawei | London, UK
       Host: Dr. Ismail Elezi and Dr. Jiankang Deng
       Topic: autoregressive image modeling

  • Oct. 2024 - Present
  • Research Intern | OpenRobotLab, Shanghai AI Laboratory | Shanghai, China
       Host: Dr. Yilun Chen and Dr. Jiangmiao Pang
       Topic: robustness of vision-language models, representation learning for robotics

  • Aug. 2023 - Oct. 2024
  • Research Intern | Foundation Model Group, MEGVII Research | Remote
       Host: Anlin Zheng and Dr. Xiangyu Zhang
       Topic: unsupervised object-centric representation learning and open-world understanding

  • Apr. 2022 - June 2023
  • Research Intern | Visual Computing Group, ByteDance AI Lab     | Shanghai, China
       Host: Dr. Jie Shao and Prof. Xiangyang Xue
       Topic: video retrieval, action recognition, and video-language pre-training

  • Jan. 2020 - June 2021
    Honors and Awards

    NeurIPS 2022 Scholar Award Oct. 2022
    Outstanding Graduates of Shanghai June 2021
    2nd place in the ECCV 2020 Workshop VIPriors Image Classification Challenge July 2020
    Qidi Scholarship of Tongji University (top 1%) June 2020
    Regional Champion (China) of the Covestro International Data Science Hackathon Nov. 2019
    Silver Medal of the 43rd ACM International Collegiate Programming Contest (ICPC) Asia-East Continent Final Dec. 2018
    Academic Services

    Reviewer for TPAMI, IJCV, NeurIPS, ICLR, ICML, CVPR, ICCV, ECCV, WACV, CVinW, and OOD-CV.


    Template gratefully stolen from here.