Top Important Computer Vision Papers for the Week from 08/07 to 14/07

Stay Updated with Recent Computer Vision Research

Sep 07, 2024

Every week, researchers from top research labs, companies, and universities publish exciting breakthroughs in various topics such as diffusion models, vision language models, image editing and generation, video processing and generation, and image recognition.

This article provides a comprehensive overview of the most significant papers published in the Fourth Week of August 2024, highlighting the latest research and advancements in computer vision.

Whether you’re a researcher, practitioner, or enthusiast, this article will provide valuable insights into the state-of-the-art techniques and tools in computer vision.

My New E-Book: LLM Roadmap from Beginner to Advanced Level

Youssef Hosni

June 18, 2024

I am pleased to announce that I have published my new ebook LLM Roadmap from Beginner to Advanced Level. This ebook will provide all the resources you need to start your journey towards mastering LLMs.

Read full story

1. Diffusion Models

1.1. SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

In this paper, we aim to enhance the performance of SwiftBrush, a prominent one-step text-to-image diffusion model, to be competitive with its multi-step Stable Diffusion counterpart.

Initially, we explore the quality-diversity trade-off between SwiftBrush and SD Turbo: the former excels in image diversity, while the latter excels in image quality. This observation motivates our proposed modifications in the training methodology, including better weight initialization and efficient LoRA training.

Moreover, our introduction of a novel clamped CLIP loss enhances image-text alignment and results in improved image quality. Remarkably, by combining the weights of models trained with efficient LoRA and full training, we achieve a new state-of-the-art one-step diffusion model, achieving an FID of 8.14 and surpassing all GAN-based and multi-step Stable Diffusion models. The evaluation code is available at:

1.2. Diffusion Models Are Real-Time Game Engines

We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality.

GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation.

GameNGen is trained in two phases:

An RL agent learns to play the game and the training sessions are recorded
A diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions.

Conditioning augmentations enable stable auto-regressive generation over long trajectories.

View arXiv page
View PDF

1.3. ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos.

Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas.

In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction.

However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition.

Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives.

Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.

View arXiv page
View PDF

2. Vision Language Models (VLMs)

2.1. Building and better understanding vision-language models: insights and future directions

The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach a consensus on several key aspects of the development pipeline, including data, architecture, and training methods.

This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting their strengths and weaknesses, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas.

We then walk through the practical steps to build Idefics3–8B, a powerful VLM that significantly outperforms its predecessor Idefics2–8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline.

These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.

View arXiv page
View PDF

2.2. Law of Vision Representation in MLLMs

We present the “Law of Vision Representation” in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance.

We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance.

By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

View arXiv page
View PDF

2.3. CogVLM2: Visual Language Models for Image and Video Understanding

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications.

Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video, and GLM-4V.

As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to 1344 times 1344 pixels.

As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, the CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench, and VCGBench.

3. Video Understanding & Generation

3.1. Training-free Long Video Generation with Chain of Diffusion Model Experts

Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to the high complexity of video generation tasks.

In this paper, we propose ConFiner, an efficient high-quality video generation framework that decouples video generation into easier subtasks: structure control and spatial-temporal refinement. It can generate high-quality videos with a chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask.

During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts’ capabilities into a single sampling. Furthermore, we design the ConFiner-Long framework, which can generate long coherent videos with three constraint strategies on ConFiner.

Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.

View arXiv page
View PDF

3.2. Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

We present a method for generating video sequences with coherent motion between a pair of input keyframes. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for keyframe interpolation, i.e., to produce a video in between two input frames.

We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backward in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes.

Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.

View arXiv page
View PDF

4. Text to Image Generation

4.1. GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

Photo-realistic and controllable 3D avatars are crucial for various applications such as virtual and mixed reality (VR/MR), telepresence, gaming, and film production.

Traditional methods for avatar creation often involve time-consuming scanning and reconstruction processes for each avatar, which limits their scalability. Furthermore, these methods do not offer the flexibility to sample new identities or modify existing ones.

On the other hand, by learning a strong prior from data, generative models provide a promising alternative to traditional reconstruction methods, easing the time constraints for both data capture and processing.

Additionally, generative methods enable downstream applications beyond reconstruction, such as editing and stylization. Nonetheless, the research on generative 3D avatars is still in its infancy, and therefore current methods still have limitations such as creating static avatars, lacking photo-realism, having incomplete facial details, or having limited drivability.

To address this, we propose a text-conditioned generative model that can generate photo-realistic facial avatars of diverse identities, with more complete details like hair, eyes, and mouth interior, and which can be driven through a powerful non-parametric latent expression space.

Specifically, we integrate the generative and editing capabilities of latent diffusion models with a strong prior model for avatar expression driving. Our model can generate and control high-fidelity avatars, even those out-of-distribution. We also highlight its potential for downstream applications, including avatar editing and single-shot avatar reconstruction.

View arXiv page
View PDF

5. Segmentation

5.1. SAM2Point: Segment Any 3D Videos in Zero-shot and Promptable Manners

We introduce SAM2Point, a preliminary exploration adapting Segment Anything Model 2 (SAM 2) for zero-shot and promptable 3D segmentation. SAM2Point interprets any 3D data as a series of multi-directional videos, and leverages SAM 2 for 3D-space segmentation, without further training or 2D-3D projection.

Our framework supports various prompt types, including 3D points, boxes, and masks, and can generalize across diverse scenarios, such as 3D objects, indoor scenes, outdoor environments, and raw sparse LiDAR. Demonstrations on multiple 3D datasets, e.g., Objaverse, S3DIS, ScanNet, Semantic3D, and KITTI, highlight the robust generalization capabilities of SAM2Point.

To our knowledge, we present the most faithful implementation of SAM in 3D, which may serve as a starting point for future research in promptable 3D segmentation.

Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:

Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM

To Data & Beyond

My New E-Book: LLM Roadmap from Beginner to Advanced Level

Discussion about this post

To Data & Beyond

Top Important Computer Vision Papers for the Week from 08/07 to 14/07

Stay Updated with Recent Computer Vision Research

Table of Contents:

My New E-Book: LLM Roadmap from Beginner to Advanced Level

1. Diffusion Models

1.1. SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher

1.2. Diffusion Models Are Real-Time Game Engines

1.3. ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

2. Vision Language Models (VLMs)

2.1. Building and better understanding vision-language models: insights and future directions

2.2. Law of Vision Representation in MLLMs

2.3. CogVLM2: Visual Language Models for Image and Video Understanding

3. Video Understanding & Generation

3.1. Training-free Long Video Generation with Chain of Diffusion Model Experts

3.2. Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

4. Text to Image Generation

4.1. GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

5. Segmentation

5.1. SAM2Point: Segment Any 3D Videos in Zero-shot and Promptable Manners

Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:

Discussion about this post