Xin Wen

Xin Wen / 温鑫

I am a final year Ph.D. Candidate at HKU CVMI Lab, advised by Prof. Xiaojuan Qi. Before that, I got my B.Eng. degree in Computer Science (minor in Mathematics) at Tongji University.

I work on (self-supervised) representation learning, with interest in learning robust, compositional, and generalizable visual representations from uncurated data with minimal human intervention, and its applications to perception & generation tasks.

I will join Meta FAIR as a Research Scientist Intern in June 2025, and will be on job market for Research Scientist and Postdoc positions starting December 2025.

Email / Github / Google Scholar / Twitter

/ Publications

	"Principal Components" Enable A New Language of Images Xin Wen, Bingchen Zhao, Ismail Elezi, Jiankang Deng, Xiaojuan Qi IEEE International Conference on Computer Vision (ICCV), 2025 Project Page / ArXiv / Code / Poster We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space, enabling 1D coarse-to-fine tokenization.
	A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025 ArXiv / Code / Poster We designed SlotMIM, a method that induces object-centric representations from non-object-centric images, which we find facilitates robot learning.
	Learning from Neighbors: Category Extrapolation for Long-Tail Learning Shizhen Zhao, Xin Wen, Jiahui Liu, Chuofan Ma, Chunfeng Yuan, Xiaojuan Qi IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025 ArXiv We find extrapolating tail classes with novel classes that share similar semantics with tail classes significantly improves long-tail recognition.
	What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi Conference on Neural Information Processing Systems (NeurIPS), 2024 ArXiv / Code / Slides / Poster We find CLIP to be relatively robust to pre-training data imbalance, design and conduct controlled experiments to identify the underlying mechanisms and provide insights for recognition and SSL models.
	Can OOD Object Detectors Learn from Foundation Models? Jiahui Liu, Xin Wen, Shizhen Zhao, Yingxian Chen, Xiaojuan Qi European Conference on Computer Vision (ECCV), 2024 ArXiv / Code We introduce an automatic data curation process that leverages foundation models as tools to harvest meaningful data from text-to-image generation models for OOD object detection.
	Classes Are Not Equal: An Empirical Study on Image Recognition Fairness Jiequan Cui, Beier Zhu, Xin Wen, Xiaojuan Qi, Bei Yu, Hanwang Zhang IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024 ArXiv / Code We show that models can still have extremely biased behaviors when trained on balanced ImageNet, investigate the resons behind, and provide some workarounds.
	What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Yongshuo Zong, Xin Wen, Bingchen Zhao IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024 Project Page / ArXiv / Code We build a counterfactual visual question answering benchmark, and show that strong Vision-Language Models, even GPT-4, cannot handle them very well.
	Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, Hengshuang Zhao IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024 ArXiv / Code We investigate the techniques to enable 3D representation learning at unprecedented data scale.
	CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi Conference on Neural Information Processing Systems (NeurIPS**), 2023 Project Page / ArXiv / Code We bridge the gap between vision & language spaces by reformulating region-word alignment as co-occurring object discovery, and images mention a shared concept in their captions are grouped together.
	Parametric Classification for Generalized Category Discovery: A Baseline Study Xin Wen, Bingchen Zhao, Xiaojuan Qi IEEE International Conference on Computer Vision (ICCV), 2023 Project Page / ArXiv / Code / Slides / Poster We revisit the reason that makes previous parametric classifiers fail to recognise new classes for GCD, identify the prediction biases between and within seen and novel classes as the key issue, and propose a simple yet strong framework that addresses these limitations and achieves state-of-the-art performance in this field.
	Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery Bingchen Zhao, Xin Wen, Kai Han IEEE International Conference on Computer Vision (ICCV), 2023 ArXiv / Code We tackle GCD without knowing the class number as a-priori, propose a semi-supervised variant of GMM with stochastic splitting and merging to dynamically determine prototypes, and leverage prototpyical contrastive learning for representation learning on partially labelled data.
	Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning Xiaoyang Wu, Xin Wen, Xihui Liu, Hengshuang Zhao IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023 ArXiv / Code We propose the Masked Scene Contrast (MSC) framework for unsupervised 3D representation learning, which efficiently generates contrastive views directly on scene-level point clouds and enables large-scale 3D pre-training across multiple datasets.
	Self-Supervised Visual Representation Learning with Semantic Grouping Xin Wen, Bingchen Zhao, Anlin Zheng, Xiangyu Zhang, Xiaojuan Qi Conference on Neural Information Processing Systems (NeurIPS), 2022 Project Page / ArXiv / Code / Slides / Poster We show that object discovery can be learned jointly with the representations from scratch on real-world scene-centric data, which leads to strong transfer learning results in various downstream tasks.
	Temporal Context Aggregation for Video Retrieval with Contrastive Learning Jie Shao, Xin Wen, Bingchen Zhao, Xiangyang Xue IEEE Winter Conference on Applications of Computer Vision (WACV**), 2021 ArXiv / Code / Slides We present a contrastive learning-based video representation learning framework that adopts long-range temporal information between frame-level features using self-attention.
	Distilling Visual Priors from Self-Supervised Learning Bingchen Zhao, Xin Wen European Conference on Computer Vision (ECCV) VIPriors Workshop, 2020 ArXiv / Code / Slides We leverage self-supervised learning and knowledge distillation to improve the generalizability of CNN models for image classification under the data-deficient setting.

Talks

Huawei Noah's Ark Lab London: "'Principal Components' Enable A New Language of Images", May 2025

ICCV 2023 VLAR Workshop: "What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models", Oct. 2023

CCVL Lab, Johns Hopkins University: "Self-Supervised Visual Representation Learning with Semantic Grouping", Nov. 2022

Awards

ICLR 2025 Notable Reviewer	May 2025
NeurIPS 2022 Scholar Award	Oct. 2022
Outstanding Graduates of Shanghai	June 2021
2nd place in the ECCV 2020 Workshop VIPriors Image Classification Challenge	July 2020
Qidi Scholarship of Tongji University (top 1%)	June 2020
Regional Champion (China) of the Covestro International Data Science Hackathon	Nov. 2019
Silver Medal of the 43rd ACM International Collegiate Programming Contest (ICPC) Asia-East Continent Final	Dec. 2018

Academic Services

Reviewer for TPAMI, IJCV, NeurIPS, ICLR, ICML, CVPR, ICCV, ECCV, WACV, CVinW, and OOD-CV.

Template gratefully stolen from here.