Top Important Computer Vision Papers for the Week from 30/10 to 5/11

Stay Relevant to Recent Computer Vision Research

Nov 06, 2023

On a weekly basis, several top-tier academic conferences and journals showcased innovative research in computer vision, presenting exciting breakthroughs in various subfields such as image recognition, vision model optimization, generative adversarial networks (GANs), image segmentation, video analysis, and more.

This article provides a comprehensive overview of the most significant papers published in the first week of November 2023, highlighting the latest research and advancements in computer vision. Whether you’re a researcher, practitioner, or enthusiast, this article will provide valuable insights into the state-of-the-art techniques and tools in computer vision.

Are you looking to start a career in data science and AI and need to learn how? I offer data science mentoring sessions and long-term career mentoring:

Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM

1. Image Generation

1.1. De-Diffusion Makes Text a Strong Cross-Modal Interface

This paper demonstrates text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation, this approach represents an image as text, from which they enjoy the interpretability and flexibility inherent to natural language.

The authors employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text, which is then fed into the fixed text-to-image diffusion decoder to reconstruct the original input — a process they term De-Diffusion.

Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images, such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example, a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples.

View arXiv page
View PDF

1.2. CapsFusion: Rethinking Image-Text Data at Scale

Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success but suffer from excessive noise.

Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However, the proposed method reveals significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions, which have been largely obscured by their initial benchmark success.

Upon closer examination, the authors identify the root cause as the overly-simplified language structure and lack of knowledge of details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data, the authors propose CapsFusion, an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions.

Extensive experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance (e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample efficiency (requiring 11–16 times less computation than baselines), world knowledge depth, and scalability. These effectiveness, efficiency, and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training.

View arXiv page
View PDF

1.3. Beyond U: Making Diffusion Models Faster & Lighter

Diffusion models are a family of generative models that yield record-breaking performance in tasks such as image synthesis, video generation, and molecule design. Despite their capabilities, their efficiency, especially in the reverse denoising process, remains a challenge due to slow convergence rates and high computational costs.

In this work, the authors introduce an approach that leverages continuous dynamical systems to design a novel denoising network for diffusion models that is more parameter-efficient, exhibits faster convergence, and demonstrates increased noise robustness.

Experimenting with denoising probabilistic diffusion models, this framework operates with approximately a quarter of the parameters and 30% of the Floating Point Operations (FLOPs) compared to standard U-Nets in Denoising Diffusion Probabilistic Models (DDPMs). Furthermore, this model is up to 70% faster in inference than the baseline models when measured in equal conditions while converging to better quality solutions.

View arXiv page
View PDF

1.4. SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips (“shot-level”) depicting a single scene. To deliver a coherent long video (“story-level”), it is desirable to have creative transition and prediction effects across different clips.

This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos.

Specifically, the authors propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, the proposed model generates transition videos that ensure coherence and visual quality.

Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, the authors propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment.

Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos.

1.5. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, the authors introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models.

T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. The proposed T2V model can generate realistic and cinematic-quality videos with a resolution of 1024 times 576, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style.

This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. The authors believe that these open-source video generation models will contribute significantly to the technological advancements within the community.

View arXiv page
View PDF

1.6. Text-to-3D with classifier score distillation

Text-to-3D generation has made remarkable progress recently, particularly with methods based on Score Distillation Sampling (SDS) that leverages pre-trained 2D diffusion models.

While the usage of classifier-free guidance is well acknowledged to be crucial for successful optimization, it is considered an auxiliary trick rather than the most essential component. In this paper, the authors re-evaluate the role of classifier-free guidance in score distillation and discover a surprising finding: the guidance alone is enough for effective text-to-3D generation tasks.

We name this method Classifier Score Distillation (CSD), which can be interpreted as using an implicit classification model for generation. This new perspective reveals new insights for understanding existing techniques.

The authors validate the effectiveness of CSD across a variety of text-to-3D tasks including shape generation, texture synthesis, and shape editing, achieving results superior to those of state-of-the-art methods.

1.7. CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

Incorporating a customized object into image generation presents an attractive feature in text-to-image generation. However, existing optimization-based and encoder-based methods are hindered by drawbacks such as time-consuming optimization, insufficient identity preservation, and a prevalent copy-pasting effect.

To overcome these limitations, the authors introduce CustomNet, a novel object customization approach that explicitly incorporates 3D novel view synthesis capabilities into the object customization process. This integration facilitates the adjustment of spatial position relationships and viewpoints, yielding diverse outputs while effectively preserving object identity.

Moreover, the authors introduce delicate designs to enable location control and flexible background control through textual descriptions or specific user-defined images, overcoming the limitations of existing 3D novel view synthesis methods.

The authors further leverage a dataset construction pipeline that can better handle real-world objects and complex backgrounds. Equipped with these designs, our method facilitates zero-shot object customization without test-time optimization, offering simultaneous control over the viewpoints, location, and background. As a result, our CustomNet ensures enhanced identity preservation and generates diverse, harmonious outputs.

View arXiv page
View PDF

1.8. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image

This paper introduces a 3D-aware diffusion model, ZeroNVS, for single-image novel view synthesis for in-the-wild scenes. While existing methods are designed for single objects with masked backgrounds, we propose new techniques to address challenges introduced by in-the-wild multi-object scenes with complex backgrounds.

Specifically, they train a generative prior on a mixture of data sources that capture object-centric, indoor, and outdoor scenes. To address issues from data mixtures such as depth-scale ambiguity, they propose a novel camera conditioning parameterization and normalization scheme.

Further, they observe that Score Distillation Sampling (SDS) tends to truncate the distribution of complex backgrounds during the distillation of 360-degree scenes and propose “SDS anchoring” to improve the diversity of synthesized novel views.

The proposed model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting, even outperforming methods specifically trained on DTU.

They further adapt the challenging Mip-NeRF 360 dataset as a new benchmark for single-image novel view synthesis and demonstrate strong performance in this setting.

2. Image Recognition

2.1. Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

Neural network-based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network.

However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose.

Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more.

Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating the strengths and weaknesses of existing approaches through a comprehensive analysis conducted on more than 1500 training runs.

While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, they find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider.

Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, they find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets.

2.2. Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human perception of reality isn’t always faithful to the physical world.

This raises a key question: do VLMs have similar kinds of illusions as humans do, or do they faithfully learn to represent reality? To investigate this question, they build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs.

The findings have shown that although the overall alignment is low, larger models are closer to human perception and more susceptible to visual illusions. Our dataset and initial findings will promote a better understanding of visual illusions in humans and machines and provide a stepping stone for future computational models that can better align humans and machines in perceiving and communicating about the shared visual world.

2.3. Idempotent Generative Network

This paper proposes a new approach for generative modeling based on training a neural network to be idempotent. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely f(f(z))=f(z).

The proposed model f is trained to map a source distribution (e.g., Gaussian noise) to a target distribution (e.g. realistic images) using the following objectives:

Instances from the target distribution should map to themselves, namely f(x)=x. We define the target manifold as the set of all instances that f maps to themselves.
Instances that form the source distribution should map onto the defined target manifold. This is achieved by optimizing the idempotence term, f(f(z))=f(z) which encourages the range of f(z) to be on the target manifold. Under ideal assumptions such a process provably converges to the target distribution.

This strategy results in a model capable of generating an output in one step, maintaining a consistent latent space, while also allowing sequential applications for refinement. Additionally, we find that by processing inputs from both target and source distributions, the model adeptly projects corrupted or modified data back to the target manifold. This work is the first step towards a ``global projector’’ that enables projecting any input into a target data distribution.

View arXiv page
View PDF

3. Image Segmentation

3.1. LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

LLaVA-Interactive is a research prototype for multimodal human-AI interaction. The system can have multi-turn dialogues with human users by taking multimodal user inputs and generating multimodal responses.

Importantly, LLaVA-Interactive goes beyond language prompt, where visual prompt is enabled to align human intents in the interaction. The development of LLaVA-Interactive is extremely cost-efficient as the system combines three multimodal skills of pre-built AI models without additional model training: visual chat of LLaVA, image segmentation from SEEM, as well as image generation and editing from GLIGEN.

A diverse set of application scenarios is presented to demonstrate the promises of LLaVA-Interactive and to inspire future research in multimodal interactive systems.

View arXiv page
View PDF

4. Video Analysis & Understanding

4.1. MM-VID: Advancing Video Understanding with GPT-4V(ision)

This paper presents MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding.

MM-VID is designed to address the challenges posed by long-form videos and intricate tasks such as reasoning within hour-long content and grasping storylines spanning multiple episodes. MM-VID uses a video-to-script generation with GPT-4V to transcribe multimodal elements into a long textual script.

The generated script details character movements, actions, expressions, and dialogues, paving the way for large language models (LLMs) to achieve video understanding. This enables advanced capabilities, including audio description, character identification, and multimodal high-level comprehension.

Experimental results demonstrate the effectiveness of MM-VID in handling distinct video genres with various video lengths. Additionally, they showcase its potential when applied to interactive environments, such as video games and graphic user interfaces.

View arXiv page
View PDF

Are you looking to start a career in data science and AI and do not know how? I offer data science mentoring sessions and long-term career mentoring:

Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM

To Data & Beyond

Discussion about this post