Top Computer Vision Papers During Week From 3/7 To 9/7
Stay Updated With Recent Computer Vision Research Output
Computer vision, a field of artificial intelligence focused on enabling machines to interpret and understand the visual world, is rapidly evolving with groundbreaking research and technological advancements.
On a weekly basis, several top-tier academic conferences and journals showcased innovative research in computer vision, presenting exciting breakthroughs in various subfields such as image recognition, vision model optimization, generative adversarial networks (GANs), image segmentation, video analysis, and more.
In this article, we will provide a comprehensive overview of the most significant papers published in the first week of July 2023, highlighting the latest research and advancements in computer vision. Whether you’re a researcher, practitioner, or enthusiast, this article will provide valuable insights into the state-of-the-art techniques and tools in the field of computer vision.
Table of Contents:
Image Recognition
Image Segmentation
Video Analysis
Image & Video Editing
Image Generation
Action Recognition
If you like the article and would like not to miss any future articles and letters make sure to subscribe from the button below.
Looking to start a career in data science and AI and need to learn how. I offer data science mentoring sessions and long-term career mentoring:
Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM
1. Image Recognition
1.1 Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing
Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and generalization performance.
For example, ViTs have interesting properties with respect to early layer non-local feature dependence, as well as self-attention mechanisms which enhance learning flexibility, enabling them to ignore out-of-context image information more effectively. The authors hypothesize that this power to ignore out-of-context information (which they name patch selectivity) while integrating in-context information in a non-local manner in early layers, allows ViTs to more easily handle occlusion.
In this study, the aim is to see whether CNNs can simulate this ability of patch selectivity by effectively hardwiring this inductive bias using Patch Mixing data augmentation, which consists of inserting patches from another image onto a training image and interpolating labels between the two image classes.
Project Page: https://arielnlee.github.io/PatchMixing/
1.2. MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers
The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much computing as dense, cluttered areas. To address this issue, the authors propose a dynamic mixed-scale tokenization scheme for ViT, MSViT.
This method introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input. The proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs (e.g., 20 epochs on ImageNet) with little training overhead. In addition, to enhance the conditional behavior of the gate during training, they introduce a novel generalization of the batch-shaping loss. They show that the gating module is able to learn meaningful semantics despite operating locally at the coarse patch level. They validate MSViT on the tasks of classification and segmentation where it leads to improved accuracy-complexity trade-off.
2. Image Segmentation
2.1. ReMaX: Relaxing for Better Training on Efficient Anoptic Segmentation
This paper presents a new mechanism to facilitate the training of mask transformers for efficient panoptic segmentation, democratizing its deployment. The authors observe that due to its high complexity, the training objective of panoptic segmentation will inevitably lead to much higher false positive penalization. Such unbalanced loss makes the training process of the end-to-end mask-transformer-based architectures difficult, especially for efficient models. In this paper, authors present ReMaX which adds relaxation to mask predictions and class predictions during training for panoptic segmentation. They demonstrate that via these simple relaxation techniques during training, this model can be consistently improved by a clear margin without any extra computational cost on inference. By combining the proposed method with efficient backbones like MobileNetV3-Small, this method achieves new state-of-the-art results for efficient panoptic segmentation on COCO, ADE20K, and Cityscapes. *
Code and pre-trained checkpoints: https://github.com/google-research/deeplab2.
2.2. Segment Anything Meets Point Tracking
The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, employing interactive prompts such as points to generate masks. This paper presents SAM-PT, a method extending SAM’s capability to track and segment anything in dynamic videos.
SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation, demonstrating that a SAM-based segmentation tracker can yield strong zero-shot performance across popular video object segmentation benchmarks, including DAVIS, YouTube-VOS, and MOSE.
Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information that is agnostic to object semantics.
Project Page: https://github.com/SysCV/sam-pt.
3. Video Analysis
3.1. VideoGLUE: Video General Understanding Evaluation of Foundation Models
In this paper, the authors evaluate existing foundation models' video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, they propose a scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks.
The main findings are as follows:
First, task-specialized models significantly outperform the six FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding.
Second,video-native FMs, whose pretraining data contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action.
Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks(e.g., freezing the FM backbones), while image-native FMs win in full end-to-end fine-tuning.
The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs.
4. Image & Video Editing
4.1. LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance
Recent large-scale text-guided diffusion models provide powerful image-generation capabilities. Currently, a significant effort is given to enable the modification of these images using text only as means to offer intuitive and versatile editing. However, editing proves to be difficult for these generative models due to the inherent nature of editing techniques, which involves preserving certain content from the original image. Conversely, in text-based models, even minor modifications to the text prompt frequently result in an entirely distinct result, making attaining one-shot generation that accurately corresponds to the user's intent exceedingly challenging. In addition, to edit a real image using these state-of-the-art tools, one must first invert the image into the pre-trained model's domain — adding another factor affecting the edit quality, as well as latency.
In this exploratory report, authors propose LEDITS — a combined lightweight approach for real-image editing, incorporating the Edit Friendly DDPM inversion technique with Semantic Guidance, thus extending Semantic Guidance to real image editing, while harnessing the editing capabilities of DDPM inversion as well. This approach achieves versatile edits, both subtle and extensive as well as alterations in composition and style, while requiring no optimization nor extensions to the architecture.
5. Image Generation
5.1. Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors
Magic123 is a two-stage coarse-to-fine approach for high-quality, textured 3D meshes generation from a single unposed image in the wild using both 2D and 3D priors.
In the first stage, they optimize a neural radiance field to produce a coarse geometry. In the second stage, they adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture. In both stages, the 3D content is learned through reference view supervision and novel views guided by a combination of 2D and 3D diffusion priors.
They introduce a single trade-off parameter between the 2D and 3D priors to control the exploration (more imaginative) and exploitation (more precise) of the generated geometry. Additionally, they employ textual inversion and monocular depth regularization to encourage consistent appearances across views and to prevent degenerate solutions, respectively. Magic123 demonstrates a significant improvement over previous image-to-3D techniques, as validated through extensive experiments on synthetic benchmarks and diverse real-world images.
Project Page: https://github.com/guochengqian/Magic123.
5.2. JourneyDB: A Benchmark for Generative Image Understanding
While recent advancements in vision-language models have revolutionized multi-modal understanding, it remains unclear whether they possess the capabilities of comprehending the generated images. Compared to real data, synthetic images exhibit a higher degree of diversity in both content and style, for which there are significant difficulties for the models to fully apprehend.
To this end, the authors present a large-scale dataset, JourneyDB, for multi-modal visual understanding in generative images. This curated dataset covers 4 million diverse and high-quality generated images paired with the text prompts used to produce them. They further design 4 benchmarks to quantify the performance of generated image understanding in terms of both content and style interpretation. These benchmarks include prompt inversion, style retrieval, image captioning, and visual question answering. Lastly, they assess the performance of current state-of-the-art multi-modal models when applied to JourneyDB, and provide an in-depth analysis of their strengths and limitations in generated content understanding.
Dataset: https://journeydb.github.io.
5.3. MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion
This paper introduces MVDiffusion, a simple yet effective multi-view image generation method for scenarios where pixel-to-pixel correspondences are available, such as perspective crops from panorama or multi-view images given geometry (depth maps and poses).
Unlike prior models that rely on iterative image warping and inpainting, MVDiffusion concurrently generates all images with a global awareness, encompassing high resolution and rich content, effectively addressing the error accumulation prevalent in preceding models. MVDiffusion specifically incorporates a correspondence-aware attention mechanism, enabling effective cross-view interaction. This mechanism underpins three pivotal modules:
A generation module that produces low-resolution images while maintaining global correspondence.
An interpolation module that densifies spatial coverage between images
A super-resolution module that upscales into high-resolution outputs. In terms of panoramic imagery, MVDiffusion can generate high-resolution photorealistic images up to 1024times1024 pixels. F
For geometry-conditioned multi-view image generation, MVDiffusion demonstrates the first method capable of generating a textured map of a scene mesh.
Project Page: https://mvdiffusion.github.io.
5.4. SketchMetaFace: A Learning-based Sketching Interface for High-fidelity 3D Character Face Modeling
Modeling 3D avatars benefits various application scenarios such as AR/VR, gaming, and filming. Character faces contribute significant diversity and vividity as a vital component of avatars. However, building 3D character face models usually requires a heavy workload with commercial tools, even for experienced artists. Various existing sketch-based tools fail to support amateurs in modeling diverse facial shapes and rich geometric details. In this paper, authors present SketchMetaFace — a sketching system targeting amateur users to model high-fidelity 3D faces in minutes.
They carefully design both the user interface and the underlying algorithm. First, curvature-aware strokes are adapted to better support the controllability of carving facial details. Second, considering the key problem of mapping a 2D sketch map to a 3D model, they develop a novel learning-based method termed “Implicit and Depth Guided Mesh Modeling” (IDGMM). It fuses the advantages of mesh, implicit, and depth representations to achieve high-quality results with high efficiency. In addition, to further support usability, they present a coarse-to-fine 2D sketching interface design and a data-driven stroke suggestion tool. User studies demonstrate the superiority of the system over existing modeling tools in terms of the ease to use and visual quality of results. Experimental analyses also show that IDGMM reaches a better trade-off between accuracy and efficiency.
Project Page: https://zhongjinluo.github.io/SketchMetaFace/.
5.5. DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation
Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful effectiveness in generating high-quality 2D images. However, it is still being determined whether the Transformer architecture performs equally well in 3D shape generation, as previous 3D diffusion methods mostly adopted the U-Net architecture. To bridge this gap, the authors propose a novel Diffusion Transformer for 3D shape generation, namely DiT-3D, which can directly operate the denoising process on voxelized point clouds using plain Transformers.
Compared to existing U-Net approaches, our DiT-3D is more scalable in model size and produces much higher quality generations. Specifically, the DiT-3D adopts the design philosophy of DiT but modifies it by incorporating 3D positional and patch embeddings to adaptively aggregate input from voxelized point clouds. To reduce the computational cost of self-attention in 3D shape generation, they incorporate 3D window attention into Transformer blocks, as the increased 3D token length resulting from the additional dimension of voxels can lead to high computation. Finally, linear and devoxelization layers are used to predict the denoised point clouds.
In addition, this transformer architecture supports efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on the ShapeNet dataset demonstrate that the proposed DiT-3D achieves state-of-the-art performance in high-fidelity and diverse 3D point cloud generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy of the state-of-the-art method by 4.59 and increases the Coverage metric by 3.51 when evaluated on Chamfer Distance.
5.6. DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
Despite the ability of existing large-scale text-to-image (T2I) models to generate high-quality images from detailed textual descriptions, they often lack the ability to precisely edit the generated or real images. In this paper, the authors propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models.
Specifically, they construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model. It can transform the editing signals into gradients via feature correspondence loss to modify the intermediate representation of the diffusion model. Based on this guidance strategy, they also build multi-scale guidance to consider both semantic and geometric alignment. Moreover, a cross-branch self-attention is added to maintain the consistency between the original image and the editing result.
This method, through an efficient design, achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging. It is worth noting that all editing and content preservation signals come from the image itself, and the model does not require fine-tuning or additional modules.
Project Page: https://github.com/MC-E/DragonDiffusion.
5.7. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
In this paper, the authors present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. They design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. They also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. They demonstrate that SDXL shows drastically improved performance compared to the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators.
Project Page: https://github.com/Stability-AI/generative-models
6. Action Recognition
6.1. Real-time Monocular Full-body Capture in World Space via Sequential Proxy-to-Motion Learning
Learning-based approaches to monocular motion capture have recently shown promising results by learning to regress in a data-driven manner. However, due to the challenges in data collection and network designs, it remains challenging for existing solutions to achieve real-time full-body capture while being accurate in world space.
In this work, authors contribute a sequential proxy-to-motion learning scheme together with a proxy dataset of 2D skeleton sequences and 3D rotational motions in world space. Such proxy data enables us to build a learning-based network with accurate full-body supervision while also mitigating the generalization issues. For more accurate and physically plausible predictions, a contact-aware neural motion descent module is proposed in our network so that it can be aware of foot-ground contact and motion misalignment with the proxy observations.
Additionally, they share the body-hand context information in our network for more compatible wrist poses recovery with the full-body model. With the proposed learning-based solution, they demonstrate the first real-time monocular full-body capture system with plausible foot-ground contact in world space.
Project Page: https://liuyebin.com/proxycap.
If you like the article and would like not to miss any future articles and letters make sure to subscribe from the button below.
Looking to start a career in data science and AI and do not know how. I offer data science mentoring sessions and long-term career mentoring:
Mentoring sessions: https://lnkd.in/dXeg3KPW
Long-term mentoring: https://lnkd.in/dtdUYBrM
👏👏👏👏👏👏