Computer vision, a field of artificial intelligence focused on enabling machines to interpret and understand the visual world, is rapidly evolving with groundbreaking research and technological advancements.
On a weekly basis, several top-tier academic conferences and journals showcased innovative research in computer vision, presenting exciting breakthroughs in various subfields such as image recognition, vision model optimization, generative adversarial networks (GANs), image segmentation, video analysis, and more.
In this article, we will provide a comprehensive overview of the most significant papers published in the second week of July 2023, highlighting the latest research and advancements in computer vision. Whether you’re a researcher, practitioner, or enthusiast, this article will provide valuable insights into the state-of-the-art techniques and tools in the field of computer vision.
Table of Contents:
Image Recognition
Vision Model Optimization
Image Segmentation
Video Analysis
Image Generation
Looking to start a career in data science and AI and need to learn how. I offer data science mentoring sessions and long-term career mentoring:
Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM
All the resources and tools you need to teach yourself Data Science for free!
The best interactive roadmaps for Data Science roles. With links to free learning resources. Start here: https://aigents.co/learn/roadmaps/intro
The search engine for Data Science learning recourses. 100K handpicked articles and tutorials. With GPT-powered summaries and explanations. https://aigents.co/learn
Teach yourself Data Science with the help of an AI tutor (powered by GPT-4). https://community.aigents.co/spaces/10362739/
1. Image Recognition
1.1. GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Instruction tuning large language models (LLM) on image-text pairs has achieved unprecedented vision-language multimodal abilities. However, their vision-language alignments are only built on image level, the lack of region-level alignment limits their advancements to fine-grained multimodal understanding.
In this paper, the authors propose instruction tuning on region-of-interest. The key design is to reformulate the bounding box as the format of spatial instruction. The interleaved sequences of visual features extracted by the spatial instruction and the language embedding are input to LLM and trained on the transformed region-text data in instruction tuning format. The region-level vision-language model, termed GPT4RoI, brings brand-new conversational and interactive experiences beyond image-level understanding.
Controllability: Users can interact with our model by both language and spatial instructions to flexibly adjust the detail level of the question.
Capacities: The model supports not only single-region spatial instruction but also multi-region. This unlocks more region-level multimodal capacities such as detailed region caption and complex region reasoning.
Composition: Any off-the-shelf object detector can be a spatial instruction provider so as to mine informative object attributes from our model, like color, shape, material, action, relation to other objects, etc.
Here are the link to the paper project and the paper itself:
Project Page: https://github.com/jshilong/GPT4RoI.
Paper: https://arxiv.org/pdf/2307.03601.pdf
1.2. Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios.
Alongside flexible model usage, they demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. They believe that NaViT marks a departure from the standard, CNN-designed, input and modeling pipeline used by most computer vision models, and represents a promising direction for ViTs.
Paper: https://arxiv.org/pdf/2307.06304.pdf
2. Vision Model Optimization
2.1. SVIT: Scaling up Visual Instruction Tuning
Thanks to the emergence of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, dialogue, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data.
To push the limits of multimodal capability, the authors Sale up Visual Instruction Tuning (SVIT) by constructing a dataset of 3.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs and 1.6M complex reasoning QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by a high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. They empirically verify that training multimodal models on SVIT can significantly improve multimodal performance in terms of visual perception, reasoning, and planning.
Paper: https://arxiv.org/pdf/2307.04087.pdf
3. Image Segmentation
3.1. Semantic-SAM: Segment and Recognize Anything at Any Granularity
In this paper, the authors introduce Semantic-SAM, a universal image segmentation model to enable segmentation and recognize anything at any desired granularity. The model offers two key advantages: semantic awareness and granularity-abundance. To achieve semantic awareness, the authors consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts. This allows the model to capture rich semantic information. For the multi-granularity capability, they propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels that correspond to multiple ground-truth masks. Notably, this work represents the first attempt to jointly train a model on SA-1B, generic, and part segmentation datasets. Experimental results and visualizations demonstrate that the model successfully achieves semantic awareness and granularity-abundance. Furthermore, combining SA-1B training with other segmentation tasks, such as panoptic and part segmentation, leads to performance improvements.
Paper: https://huggingface.co/papers/2307.04767
4. Video Analysis & Generation
4.1. EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system.
In this work, the authors introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, the proposed fusion in the backbone strategy is more lightweight and compute-efficient than stacking additional fusion-specific layers. Extensive experiments on a wide range of VL tasks demonstrate the effectiveness of EgoVLPv2 by achieving consistent state-of-the-art performance over strong baselines across all downstream.
Project Page: https://shramanpramanick.github.io/EgoVLPv2/
Paper: https://arxiv.org/pdf/2307.05463.pdf
4.2. Test-Time Training on Video Streams
Prior author work has established test-time training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is trained on the same instance using a self-supervised task, such as image reconstruction with masked autoencoders. The author extends TTT to the streaming setting, where multiple test instances - video frames in this case - arrive in temporal order.
The extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The relative improvement is 45% and 66% for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. This differs from previous findings using synthetic videos. They conceptualize locality as the advantage of online over offline TTT. They analyze the role of locality with ablations and a theory based on the bias-variance trade-off.
Paper: https://arxiv.org/pdf/2307.05014.pdf
4.3. InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of a total of 4.1B words.
The core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, they utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, the dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.
Paper: https://arxiv.org/pdf/2307.06942.pdf
4.4. Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, the key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. They achieve this by developing a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with a desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts.
For the first module, they leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, they propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines.
Paper: https://arxiv.org/pdf/2307.06940.pdf
5. Image Generation
5.1. Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation
Significant progress has recently been made in creative applications of large pre-trained models for downstream tasks in 3D vision, such as text-to-shape generation. This motivates researchers to investigate how these pre-trained models can be used effectively to generate 3D shapes from sketches, which has largely remained an open challenge due to the limited sketch-shape paired datasets and the varying level of abstraction in the sketches.
The authors discovered that conditioning a 3D generative model on the features (obtained from a frozen large pre-trained vision model) of synthetic renderings during training enables them to effectively generate 3D shapes from sketches at inference time. This suggests that the large pre-trained vision model features carry semantic signals that are resilient to domain shifts, i.e., allowing them to use only RGB renderings, but generalizing to sketches at inference time. They conduct a comprehensive set of experiments investigating different design factors and demonstrate the effectiveness of our straightforward approach for the generation of multiple 3D shapes per input sketch regardless of their level of abstraction without requiring any paired datasets during training.
Paper: https://arxiv.org/pdf/2307.03869.pdf
5.2. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics.
In this paper, the authors propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. They conduct evaluations on several public representative personalized text-to-image models across anime pictures and realistic photographs and demonstrate that the proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs.
Code and Pre-trained Weights: https://animatediff.github.io/
Paper: https://arxiv.org/pdf/2307.04725.pdf
5.3. Collaborative Score Distillation for Consistent Visual Synthesis
Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, the authors address this challenge with a novel method, Collaborative Score Distillation (CSD).
CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, they propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates the seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. They show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. The results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.
Paper: https://arxiv.org/pdf/2307.04787.pdf
6.4. AutoDecoding Latent 3D Diffusion Models
This paper presents a novel approach to the generation of static and articulated 3D assets that has a 3D auto-decoder at its core. The 3D auto-decoder framework embeds properties learned from the target dataset in the latent space, which can then be decoded into a volumetric representation for rendering view-consistent appearance and geometry.
They then identify the appropriate intermediate volumetric latent space and introduce robust normalization and de-normalization operations to learn a 3D diffusion from 2D images or monocular videos of rigid or articulated objects. This approach is flexible enough to use either existing camera supervision or no camera information at all -- instead efficiently learning it during training. The evaluations demonstrate that the generated results outperform state-of-the-art alternatives on various benchmark datasets and metrics, including multi-view image datasets of synthetic objects, real in-the-wild videos of moving people, and a large-scale, real video dataset of static objects.
Paper: https://arxiv.org/pdf/2307.05445.pdf
5.5. Efficient 3D Articulated Human Generation with Layered Surface Volumes
Access to high-quality and diverse 3D articulated digital human assets is crucial in various applications, ranging from virtual reality to social platforms. Generative approaches, such as 3D generative adversarial networks (GANs), are rapidly replacing laborious manual content creation tools. However, existing 3D GAN frameworks typically rely on scene representations that leverage either template meshes, which are fast but offer limited quality, or volumes, which offer high capacity but are slow to render, thereby limiting the 3D fidelity in GAN settings.
In this work, authors introduce layered surface volumes (LSVs) as a new 3D object representation for articulated digital humans. LSVs represent a human body using multiple textured mesh layers around a conventional template. These layers are rendered using alpha compositing with fast differentiable rasterization, and they can be interpreted as a volumetric representation that allocates its capacity to a manifold of finite thickness around the template. Unlike conventional single-layer templates that struggle with representing fine off-surface details like hair or accessories, our surface volumes naturally capture such details. LSVs can be articulated, and they exhibit exceptional efficiency in GAN settings, where a 2D generator learns to synthesize the RGBA textures for the individual layers. Trained on unstructured, single-view 2D image datasets, our LSV-GAN generates high-quality and view-consistent 3D articulated digital humans without the need for view-inconsistent 2D upsampling networks.
Paper: https://arxiv.org/pdf/2307.05462.pdf
5.6. T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench, a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional text prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions).
They further propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation. We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS), to boost the compositional text-to-image generation abilities of pretrained text-to-image models. Extensive experiments and evaluations are conducted to benchmark previous methods on T2I-CompBench and to validate the effectiveness of our proposed evaluation metrics and GORS approach.
Project page: https://karine-h.github.io/T2I-CompBench/
Paper: https://arxiv.org/pdf/2307.06350.pdf
Looking to start a career in data science and AI and do not know how. I offer data science mentoring sessions and long-term career mentoring:
Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM
Great work thanks for sharing