Zilin Wang

Hi there! I am a PhD Candidate in the Computer Science & Engineering (CSE) department at the University of Michigan, advised by Prof. Stella X. Yu.

Before my Ph.D., I earned an M.S. in CSE from the University of Michigan, focusing on medical image segmentation with Dr. Sandaresh Ram and Dr. Craig Galban at Michigan Medicine and during an internship at Genentech with Dr. Acner Camino. I earned my B.S. in CSE from the Ohio State University, Summa Cum Laude.

I am broadly interested in computer vision and deep learning, particularly in addressing the challenges of applying these techniques to real-world problems.

Email  /  Scholar  /  Github  /  Twitter  /  LinkedIn  /  WeChat

profile photo

Research

project image Aligning Forest and Trees in Images and Long Captions for Cross-Domain Grounding
Byeongju Woo, Zilin Wang, Byeonghyun Pak, Sangwoo Mo, Stella X. Yu
In submission

We propose F-CAST, a hierarchical image–text representation framework that aligns visual and linguistic hierarchies from long captions without region-level supervision. Trained on 30M image–text pairs, F-CAST achieves state-of-the-art results on six long-text benchmarks, demonstrating the power of hierarchical alignment for fine-grained visual–language understanding and visual grounding.

project image Free-Grained Hierarchical Recognition
Seulki Park, Zilin Wang, Stella X. Yu
In submission

We introduce ImageNet-F, a large-scale benchmark with mixed-granularity labels reflecting real-world annotation variability. Using this dataset, our free-grain learning framework leverages semantic and visual guidance to improve hierarchical image classification under heterogeneous supervision.

project image Open Ad-Hoc Categorization with Contextualized Feature Learning
Zilin Wang*, Sangwoo Mo*, Stella X. Yu, Sima Behpour, Liu Ren
CVPR, 2025
paper / webpage / poster / code

Ad-hoc categories are created dynamically to achieve specific tasks based on context at hand, such as things to sell at a garage sale. We introduce open ad-hoc categorization (OAK), a novel task requiring discovery of novel classes across diverse contexts, and tackle it by learning contextualized visual features with text guidance based on CLIP.

Applied Research

project image Nationwide Building-Attribute Benchmarking and Scalable Zero-Shot Extraction with Foundation Models
Gustavo Perez*, Zilin Wang*, Brian Wang, Fei Pan, Frank McKenna, Stella X. Yu
In Submission

We introduce a new nationwide dataset and a scalable zero-shot workflow for building attribute extraction using large vision-language models. Extensive benchmarking shows our approach significantly outperforms existing baselines and supervised methods.

project image Unsupervised Selective Labeling for Semi-Supervised Aerial Imagery Recognition
Zilin Wang, Stella X. Yu, Kyle L. Landolt, Mark D. Koneff, Bradley A. Pickens, Sierra Schuster, Luke J. Fara, Aaron C. Murphy, Jennifer Dieck, Timothy P. White
In Submission

Annotating every aerial image available for training a classifier is unnecessary. Our algorithm selects a small subset of images that are diverse yet representative for labeling, leading to an effective semi-supervised classifier using only 6.4% labeled data in our case study.

project image Deep Learning Segmentation of Foveal Avascular Zone in Optical Coherence Tomography Angiography of Nonproliferative Diabetic Retinopathy
Acner Camino Benech*, Zilin Wang*, Aditi Basu Bal, Fethallah Benmansour, Richard Carano, Daniela Ferrara
ARVO, 2023; EURETINA, 2023
paper / slides / poster

Foveal Avascular Zone (FAZ) segmentation can be improved by incorporating auxiliary tasks like FAZ boundary segmentation and using weak supervision from vessel segmentation. This helps the model learn finer details and contextual cues, leading to SOTA results.

project image Pulmonary Artery-Vein Segmentation in 3D Computed Tomography Images
Zilin Wang, Sandaresh Ram, Craig Galban

We present iSparseUnet, a segmentation model for 3D Computed Tomography (CT) images that leverages data sparsity, octree structure, and invertible layers for optimal efficiency. Unlike patch-based methods, it processes the entire volume at once, producing hierarchical outputs that preserve global context and maintain the connectivity of lung structures, essential for precise medical segmentation.

Teaching

UM EECS442: Computer Vision

UM EECS542: Advanced Topics in Computer Vision

UM EECS598: Action & Perception

UM SI670: Applied Machine Learning

Academic Service & Outreach


Design and template code from Jon Barron