Important Computer Vision Papers for the Week from 20/01 to 26/01

Stay Updated with Recent Computer Vision Research

Jan 30, 2025

Every week, researchers from top research labs, companies, and universities publish exciting breakthroughs in diffusion models, vision language models, image editing and generation, video processing and generation, and image recognition.

This article provides a comprehensive overview of the most significant papers published in the Fourth week of January 2025, highlighting the latest research and advancements in computer vision.

Whether you’re a researcher, practitioner, or enthusiast, this article will provide valuable insights into the state-of-the-art techniques and tools in computer vision.

My New E-Book: Efficient Python for Data Scientists

Youssef Hosni

Jan 7

I am happy to announce publishing my new E-book Efficient Python for Data Scientists. Efficient Python for Data Scientists is your practical companion to mastering the art of writing clean, optimized, and high-performing Python code for data science. In this book, you'll explore actionable insights and strategies to transform your Python workflows, streamline data analysis, and maximize the potential of libraries like Pandas.

Read full story

1. Diffusion Models

1.1. Hunyuan3D 2.0: Scaling Diffusion Models for High-Resolution Textured 3D Assets Generation

We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model — Hunyuan3D-DiT, and a large-scale texture synthesis model — Hunyuan3D-Paint.

The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications.

The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes.

Furthermore, we build Hunyuan3D-Studio — a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently.

We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, etc.

Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models.

2. Image Generation

2.1. Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass

Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives.

Current leading methods such as DUSt3R employ a fundamentally pairwise approach, processing images in pairs and necessitating costly global alignment procedures to reconstruct from multiple views.

In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel.

Fast3R’s Transformer-based architecture forwards N images in a single forward pass, bypassing the need for iterative alignment. Through extensive experiments on camera pose estimation and 3D reconstruction, Fast3R demonstrates state-of-the-art performance, with significant improvements in inference speed and reduced error accumulation.

These results establish Fast3R as a robust alternative for multi-view applications, offering enhanced scalability without compromising reconstruction accuracy.

View arXiv page
View PDF

2.2. Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

Chain-of-thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios.

In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects.

Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.

PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image.

Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%.

We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation.

3. Video Understanding & Generation

3.1. VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs).

We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks.

Our experiments reveal two key findings:

Video-only training provides sufficient information for learning knowledge, including rules, reasoning, and planning capabilities
The representation of visual change is crucial for knowledge acquisition.

To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning.

In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of Oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.

View arXiv page
View PDF

3.2. MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering.

Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks.

Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset.

Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU.

The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking achieve the highest performance among the tested models. However, they still fall short of matching human expertise.

Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.

View arXiv page
View PDF

3.3. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric.

The meaning of “vision-centric” is two-fold: the vision-centric training paradigm and vision-centric framework design. The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding.

Instead of preparing massive video-text datasets, we focus on constructing large-scale and high-quality image-text datasets.

VideoLLaMA3 has four training stages:

Vision-centric alignment stage, which warms up the vision encoder and projector
Vision-language pretraining stage, which jointly tunes the vision encoder, projector, and LLM with large-scale image-text data covering multiple types (including scene images, documents, and charts) as well as text-only data.
The multi-task fine-tuning stage incorporates image-text SFT data for downstream tasks and video-text data to establish a foundation for video understanding.
Video-centric fine-tuning further improves the model’s capability for video understanding.

As for the framework design, to better capture fine-grained details in images, the pretrained vision encoder is adapted to encode images of varying sizes into vision tokens with corresponding numbers, rather than a fixed number of tokens.

For video inputs, we reduce the number of vision tokens according to their similarity so that the representation of videos will be more precise and compact.

Benefiting from vision-centric designs, VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.

View arXiv page
View PDF

3.4. Temporal Preference Optimization for Long-Form Video Understanding

Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models.

To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning.

TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences.

By optimizing these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmark — LongVideoBench, MLVU, and Video-MME — demonstrate the effectiveness of TPO across two state-of-the-art video LMMs.

Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding.

3.5. Improving Video Generation with Human Feedback

Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist.

In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model.

Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions.

We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy.

From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models.

These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward-weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos.

Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs.

3.6. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems.

Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs).

To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs’ ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation.

A proposed knowledge gain metric, Delta knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs’ capability to learn and adapt from videos.

View arXiv page
View PDF

Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:

Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM

To Data & Beyond

My New E-Book: Efficient Python for Data Scientists

Discussion about this post

To Data & Beyond

Important Computer Vision Papers for the Week from 20/01 to 26/01

Stay Updated with Recent Computer Vision Research

Table of Contents:

My New E-Book: Efficient Python for Data Scientists

1. Diffusion Models

1.1. Hunyuan3D 2.0: Scaling Diffusion Models for High-Resolution Textured 3D Assets Generation

2. Image Generation

2.1. Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass

2.2. Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

3. Video Understanding & Generation

3.1. VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

3.2. MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

3.3. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

3.4. Temporal Preference Optimization for Long-Form Video Understanding

3.5. Improving Video Generation with Human Feedback

3.6. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:

Discussion about this post