Hi there! I am a PhD Candidate in the Computer Science & Engineering (CSE) division at the University of Michigan, advised by Prof. Stella X. Yu. I earned my B.S. in CSE from the Ohio State University, Summa Cum Laude.
My research currently focuses on vision-language foundation models for context-aware, visually grounded,and agentic perception. My long-term pursuit is developing perception for embodied interaction with the physical world.
I am also motivated by applications that connect AI with other disciplines.
GaitSpan: Growing Humanoid Locomotion from Walking to Running Kwan-Yee Lin*,
Zilin Wang*,
Janelle J. Liu,
Stella X. Yu Technical report preprint / webpage / code (coming soon)
We introduce GaitSpan, which grows a pretrained walking policy into new gaits by treating walking as a seed skill. Through rhythm generation, stride shaping, and residual adaptation, it yields a single command-conditioned policy spanning walking, jogging, and running over a continuous speed range, transfers across morphologies (Booster T1, K1, and Unitree G1), and deploys zero-shot in the real world on unseen terrains and payloads.
Vision Harnessing Agent for Open Ad-hoc Segmentation Zilin Wang,
Stella X. Yu Technical report preprint / code
We introduce VASA, the first vision harnessing agent for open ad-hoc segmentation, which segments arbitrary user-defined concepts from natural-language instructions. Built on SAM3, VASA improves over SAM3 Agent through a visual workflow for state management, tool invocation, action constraints, long-horizon planning, visual scrutiny and error recovery.
We propose CAFT, a hierarchical vision-language representation that aligns visual and linguistic hierarchies from long captions without region-level supervision. Trained on 30M image-text pairs, CAFT achieves SOTA on six long-text benchmarks, highlighting the value of hierarchical alignment for fine-grained vision-language understanding and visual grounding.
We introduce ImageNet-F, a large-scale benchmark with mixed-granularity labels reflecting real-world annotation variability. Using this dataset, our free-grain learning framework leverages semantic and visual guidance to improve hierarchical image classification under heterogeneous supervision.
Ad-hoc categories are created dynamically to achieve specific tasks based on context at hand, such as things to sell at a garage sale.
We introduce open ad-hoc categorization (OAK), a novel task requiring discovery of novel classes across diverse contexts, and tackle it by learning contextualized visual features with text guidance based on CLIP.
Teaching
UM EECS 542: Advanced Topics in Computer Vision
[WN 2026] Guest lecture on Vision Foundation Models.
[FA 2025] Guest lecture on Vision Foundation Models.
[FA 2024] Guest lecture on Vision Foundation Models.