Research

My recent research aims at advancing computer vision and multimodal intelligence to connect fundamental AI advances with real-world applications. By developing dense perception systems, efficient adaptation methods, and controllable generation techniques, as illustrated in the figure below, my work enables AI to understand complex visual environments while operating effectively under practical constraints. This research creates adaptable vision technologies that deliver both high performance and deployability across domains like healthcare, autonomous systems, and creative applications, addressing pressing technological demands while considering societal implications. Research Overview

Dense Perception and Understanding

This direction focuses on developing advanced methods for dense recognition and understanding under vision/vision-language foundation models, including vision tasks such as semantic segmentation, object detection, instance segmentation, and human pose estimation. Our work aims to achieve pixel-level comprehension of images and videos to support applications requiring fine-grained perception. We explore novel network architectures and learning paradigms to improve accuracy while maintaining computational efficiency. The future of this direction lies in developing more robust models that can handle real-world complexities like extreme occlusion, varying lighting conditions, and open-world/vocabulary object categories, with potential applications in autonomous systems, augmented reality, and clinical healthcare.

Efficient Finetuning and Inference

Our research in this direction investigates methods to optimize deep learning models for real-world deployment, focusing on reducing computational demands while preserving model accuracy. We study efficient fine-tuning techniques, model compression approaches, and hardware-aware optimization strategies to make fundamental vision models more accessible. The goal is to enable adaptation of pre-trained large models to novel domains with limited labeled training samples and computing resources. Future advancements in this direction could revolutionize edge computing scenarios, allowing sophisticated vision/vision-language models to run on resource-constrained devices, from smartphones to embedded systems, thus expanding the reach of computer vision technologies.

Visual Editing and Content Generation

This research stream explores the intersection of visual understanding and generative AI, developing techniques for controllable visual content creation and manipulation. We investigate problems in image synthesis, inpainting, and video generation, with an emphasis on maintaining photorealism and editability. Our work also examines the implications of these technologies, including potential safeguards against misuse. The future of this direction promises increasingly intuitive interfaces where users can manipulate visual content with natural language or simple sketches, potentially transforming creative industries while raising important questions about digital authenticity.

Applications

Focusing on practical scenarios, this research extends advanced computer vision technologies to specialized domains where they can create significant impact. In medical imaging, we develop algorithms for disease detection, organ segmentation, and treatment planning that can assist healthcare professionals. These applications present unique challenges requiring domain-specific adaptations of general vision models. As these fields evolve, we anticipate more interdisciplinary collaborations that will push the boundaries of what's possible in computer vision applications, potentially leading to life-saving breakthroughs and more sustainable environmental management.

Dong Zhang (张冬)

Research

Dense Perception and Understanding

Efficient Finetuning and Inference

Visual Editing and Content Generation

Applications