Top Important Computer Vision Papers for the Week from 27/11 to 03/12

Stay Updated with Recent Computer Vision Research

Dec 04, 2023

Every week, several top-tier academic conferences and journals showcased innovative research in computer vision, presenting exciting breakthroughs in various subfields such as image recognition, vision model optimization, generative adversarial networks (GANs), image segmentation, video analysis, and more.

This article provides a comprehensive overview of the most significant papers published in the first week of December 2023, highlighting the latest research and advancements in computer vision. Whether you’re a researcher, practitioner, or enthusiast, this article will provide valuable insights into the state-of-the-art techniques and tools in computer vision.

1. VideoBooth: Diffusion-based Video Generation with Image Prompts

Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users’ intents, especially for customized content creation. In this paper, the authors study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts.

Specifically, the authors propose a feed-forward framework VideoBooth, with two dedicated designs:

They propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from the image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance.
In the attention injection module at a fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency.

Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward passes.

View arXiv page
View PDF

2. GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

As pretrained text-to-image diffusion models become increasingly powerful, recent efforts have been made to distill knowledge from these text-to-image pretrained models for optimizing a text-guided 3D model. Most of the existing methods generate a holistic 3D model from a plain text input. This can be problematic when the text describes a complex scene with multiple objects because the vectorized text embeddings are inherently unable to capture a complex description with multiple entities and relationships.

Holistic 3D modeling of the entire scene further prevents accurate grounding of text entities and concepts. To address this limitation, the authors propose GraphDreamer, a novel framework to generate compositional 3D scenes from scene graphs, where objects are represented as nodes and their interactions as edges.

By exploiting node and edge information in scene graphs, the method makes better use of the pretrained text-to-image diffusion model and can fully disentangle different objects without image-level supervision. To facilitate the modeling of object-wise relationships, they used signed distance fields as representation and imposed a constraint to avoid inter-penetration of objects.

To avoid manual scene graph creation, they designed a text prompt for ChatGPT to generate scene graphs based on text inputs. They conduct both qualitative and quantitative experiments to validate the effectiveness of GraphDreamer in generating high-fidelity compositional 3D scenes with disentangled object entities.

View arXiv page
View PDF

3. HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models.

While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images.

In this work, the authors introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation.

The proposed method employs a parameter-efficient fine-tuning framework, comprising a denoising process and a pivotal inversion process. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate the sample fidelity.

Additionally, they propose a reference-guided generation approach that leverages the pivotal inversion of a reference image to mitigate unwanted subject variations and artifacts. They further extend the method to a novel image editing task: substituting the subject in an image through textual manipulations.

Experimental evaluations conducted on the DreamBooth dataset using the Stable Diffusion model showcase promising results. Fine-tuning solely on textual embeddings improves the CLIP-T score by 3.6 points and improves the DINO score by 9.6 points over Textual Inversion. When fine-tuning all parameters, HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2 points over DreamBooth, establishing a new state of the art.

View arXiv page
View PDF

4. Dolphins: Multimodal Language Model for Driving

The quest for fully autonomous vehicles (AVs) capable of navigating complex real-world scenarios with human-like understanding and responsiveness. In this paper, the authors introduce Dolphins, a novel vision-language model architected to imbibe human-like abilities as a conversational driving assistant.

Dolphins are adept at processing multimodal inputs comprising video (or image) data, text instructions, and historical control signals to generate informed outputs corresponding to the provided instructions.

Building upon the open-sourced pretrained Vision-Language Model, OpenFlamingo, they first enhance Dolphins’ reasoning capabilities through an innovative Grounded Chain of Thought (GCoT) process.

Then they tailored Dolphins to the driving domain by constructing driving-specific instruction data and conducting instruction tuning. Through the utilization of the BDD-X dataset, we designed and consolidated four distinct AV tasks into Dolphins to foster a holistic understanding of intricate driving scenarios.

As a result, the distinctive features of Dolphins are characterized into two dimensions:

The ability to provide a comprehensive understanding of complex and long-tailed open-world driving scenarios and solve a spectrum of AV tasks
The emergence of human-like capabilities including gradient-free instant adaptation via in-context learning and error recovery via reflection.

View arXiv page
View PDF

5. MoMask: Generative Masked Modeling of 3D Human Motions

This paper introduces MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details.

Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy.

This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at the training stage.

During the generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from the current layer.

Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

View arXiv page
View PDF

6. DREAM: Diffusion Rectification and Estimation-Adaptive Models

This paper presents DREAM, a novel training framework representing Diffusion Rectification and Estimation-Adaptive Models, requiring minimal code changes (just three lines) yet significantly enhancing the alignment of training with sampling in diffusion models.

DREAM features two components: diffusion rectification, which adjusts training to reflect the sampling process, and estimation adaptation, which balances perception against distortion. When applied to image super-resolution (SR), DREAM adeptly navigates the tradeoff between minimizing distortion and preserving high image quality.

Experiments demonstrate DREAM’s superiority over standard diffusion-based SR methods, showing a 2 to 3 times faster training convergence and a 10 to 20 times reduction in necessary sampling steps to achieve comparable or superior results. We hope DREAM will inspire a rethinking of diffusion model training paradigms.

View arXiv page
View PDF

7. Text-Guided 3D Face Synthesis — From Generation to Editing

Text-guided 3D face synthesis has achieved remarkable results by leveraging text-to-image (T2I) diffusion models. However, most existing works focus solely on the direct generation, ignoring the editing, restricting them from synthesizing customized 3D faces through iterative adjustments.

In this paper, the authors propose a unified text-guided framework from face generation to editing. In the generation stage, the authors propose a geometry-texture decoupled generation to mitigate the loss of geometric details caused by coupling. Besides, decoupling enables us to utilize the generated geometry as a condition for texture generation, yielding highly geometry-texture-aligned results.

They further employ a fine-tuned texture diffusion model to enhance texture quality in both RGB and YUV space. In the editing stage, they first employ a pre-trained diffusion model to update facial geometry or texture based on the texts. To enable sequential editing, they introduce a UV domain consistency preservation regularization, preventing unintentional changes to irrelevant facial attributes.

Besides, they propose a self-guided consistency weight strategy to improve editing efficacy while preserving consistency. Through comprehensive experiments, they showcase our method’s superiority in face synthesis.

8. FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

Novel view synthesis from limited observations remains an important and persistent task. However, high efficiency in existing NeRF-based few-shot view synthesis is often compromised to obtain an accurate 3D representation. To address this challenge, we propose a few-shot view synthesis framework based on 3D Gaussian Splatting that enables real-time and photo-realistic view synthesis with as few as three training views.

The proposed method, dubbed FSGS, handles the extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process. Our method iteratively distributes new Gaussians around the most representative locations, subsequently infilling local details in vacant areas. We also integrate a large-scale pre-trained monocular depth estimator within the Gaussians optimization process, leveraging online augmented views to guide the geometric optimization towards an optimal solution.

Starting from sparse points observed from limited input viewpoints, our FSGS can accurately grow into unseen regions, comprehensively covering the scene and boosting the rendering quality of novel views. Overall, FSGS achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender.

9. X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model.

Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects. Consequently, employing 2D diffusion models directly for optimizing 3D representations may lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a novel approach for high-quality text-to-3D content creation that effectively bridges the gap between text-to-2D and text-to-3D synthesis.

The key components of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation (CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically incorporates camera information into the pretrained diffusion models by employing camera-dependent generation for trainable parameters.

This integration enhances the alignment between the generated 3D assets and the camera’s perspective. AMA loss guides the attention map of the pretrained diffusion model using the binary mask of the 3D object, prioritizing the creation of the foreground object. This module ensures that the model focuses on generating accurate and detailed foreground objects. Extensive evaluations demonstrate the effectiveness of our proposed method compared to existing text-to-3D approaches.

10. StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired stylized videos due to

Text’s inherent clumsiness in expressing specific styles
The generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image.

Considering the scarcity of stylized video datasets, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image using a decoupling learning strategy.

Additionally, we design a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors.

View arXiv page
View PDF

Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:

Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM

To Data & Beyond

Discussion about this post