Building Multimodal RAG Application #4: Video Preprocessing Multimodal RAG

Oct 27, 2024

∙ Paid

The integration of multiple data modalities is crucial for enhancing the effectiveness of retrieval-augmented generation (RAG) applications. This article, the fourth in our series on building multimodal RAG applications, delves into the critical process of video preprocessing.

In this article, I provide a step-by-step guide to setting up the working environment, downloading video corpora, and extracting valuable insights from video content and its transcripts.

Readers will learn how to handle various scenarios, including downloading videos and transcripts from specified links and managing situations where transcripts may not be readily available.

Additionally, we explore the implementation of large vision language models (LVLM) inference for extracting frames and metadata for videos without spoken languages, enriching the dataset for improved information retrieval and generation.

By the end of this article, readers will be equipped with the knowledge and tools necessary to preprocess videos effectively, laying the groundwork for developing robust multimodal RAG applications.

This article is the Fourth in the ongoing series of Building Multimodal RAG Application:

Introduction to Multimodal RAG Applications (Published)
Multimodal Embeddings (Published)
Multimodal RAG Application Architecture (Published)
Processing Videos for Multimodal RAG (You are here)
Multimodal Retrieval from Vector Stores (Coming soon!)
Large Vision Language Models (LVLMs) (Coming soon!)
Multimodal RAG with Multimodal LangChain (Coming soon!)
Putting it All Together! Building Multimodal RAG Application (Coming soon!)

You can find the codes and datasets used in this series in this GitHub Repo

Setting Up Working Environment
Downloading Videos Corpuses
2.1. Download Videos & Transcripts from Given Link
2.2 Video Corpus and Its Transcript Are Available
2.3. Video Corpus without Available Transcript
Video Corpus without Language
3.1. LVLM Inference Example
3.2. Extract Frames and Metadata for Videos Using LVLM Inference

1. Setting Up Working Environment

To kick off the video preprocessing pipeline in our multimodal RAG setup, we’ll start by importing a set of libraries tailored for handling video, audio, and image data.

The pathlib and os libraries streamline filesystem navigation, making it easy to access and manipulate files across different operating systems. For data handling, we’ll use json to work with structured metadata files.

Then, cv2 from OpenCV, paired with moviepy’s VideoFileClip, empowers us with robust video manipulation capabilities, from reading frames to editing clips.

For handling audio and captions, we’re bringing in whisper for transcription and webvtt to parse subtitles additionally, PIL.Image will assist with image processing, and base64 ensures we can encode image data when needed.

Together, these libraries form the core toolkit for building a versatile preprocessing pipeline that caters to the diverse data requirements of a multimodal application.

from pathlib import Path
import os
from os import path as osp
import json
import cv2
import webvtt
import whisper
from moviepy.editor import VideoFileClip
from PIL import Image
import base64

Next, we will define two helper functions. The first one is str2time, which plays a crucial role in converting time strings from the WebVTT format into milliseconds — a unit that’s more manageable for processing video timestamps.

The function starts by removing any extraneous quotation marks from the input string. It then splits the cleaned string into hours, minutes, and seconds, converting each component into a floating-point number. By using simple arithmetic, we calculate the total duration in seconds and subsequently convert it into milliseconds.

This conversion is essential for synchronizing various elements of our multimodal application, ensuring that we can accurately align video frames with their corresponding captions or audio snippets.

# a helper function that helps to convert a specific time written as a string in format `webvtt` into a time in miliseconds
def str2time(strtime):
    # strip character " if exists
    strtime = strtime.strip('"')
    # get hour, minute, second from time string
    hrs, mins, seconds = [float(c) for c in strtime.split(':')]
    # get the corresponding time as total seconds 
    total_seconds = hrs * 60**2 + mins * 60 + seconds
    total_miliseconds = total_seconds * 1000
    return total_miliseconds

The section helper function is the maintain_aspect_ratio_resize function, which is designed to resize images while preserving their aspect ratio, which is vital for maintaining the integrity of visual content in our multimodal application.

It begins by checking the dimensions of the input image, obtaining its height and width. If neither a new width nor height is specified, the function simply returns the original image, ensuring no unnecessary processing occurs.

If only the height is provided, it calculates the appropriate width that maintains the original aspect ratio. Conversely, if the width is given without height, the function computes the new height based on the same principle.

Finally, the function uses OpenCV’s cv2.resize method to resize the image, applying the specified interpolation method to ensure the best possible quality. This resizing capability is crucial for preparing images for further processing, ensuring they fit within the constraints of our application while keeping their proportions intact.

# Resizes a image and maintains aspect ratio
def maintain_aspect_ratio_resize(image, width=None, height=None, inter=cv2.INTER_AREA):
    # Grab the image size and initialize dimensions
    dim = None
    (h, w) = image.shape[:2]

    # Return original image if no need to resize
    if width is None and height is None:
        return image

    # We are resizing height if width is none
    if width is None:
        # Calculate the ratio of the height and construct the dimensions
        r = height / float(h)
        dim = (int(w * r), height)
    # We are resizing width if height is none
    else:
        # Calculate the ratio of the width and construct the dimensions
        r = width / float(w)
        dim = (width, int(h * r))

    # Return the resized image
    return cv2.resize(image, dim, interpolation=inter)

Now that the working environment is ready we will start with the second step which is downloading the video corpora that will be used in building the multimodal RAG application.

2. Downloading Videos Corpuses

2.1. Download Videos & Transcripts from Given Link

We will start by defining three helper functions to help us extract video transcripts from YouTube, specifically formatted as WebVTT files for easy integration into our application.

The first function, get_video_id_from_url, parses the provided YouTube video URL to extract the unique video ID. It handles various URL formats, including shortened links (youtu.be), embedded videos, and standard watch links, ensuring compatibility across different URL structures.

from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import WebVTTFormatter

def get_video_id_from_url(video_url):
    import urllib.parse
    url = urllib.parse.urlparse(video_url)
    if url.hostname == 'youtu.be':
        return url.path[1:]
    if url.hostname in ('www.youtube.com', 'youtube.com'):
        if url.path == '/watch':
            p = urllib.parse.parse_qs(url.query)
            return p['v'][0]
        if url.path[:7] == '/embed/':
            return url.path.split('/')[2]
        if url.path[:3] == '/v/':
            return url.path.split('/')[2]
    return video_url

Next, the get_transcript_vtt function utilizes the YouTubeTranscriptApi to fetch the video transcript in the specified languages (in this case, English). If a transcript file already exists at the specified path, the function returns its location to avoid redundant downloads.

Otherwise, it formats the transcript using WebVTTFormatter and writes the resulting WebVTT data to a file, making it ready for subsequent processing in our multimodal RAG pipeline. This capability is essential for ensuring that we have accurate captions available for audio-visual content, enhancing the overall performance of our application.

def get_transcript_vtt(video_url, path='/tmp'):
    video_id = get_video_id_from_url(video_url)
    filepath = os.path.join(path,'captions.vtt')
    if os.path.exists(filepath):
        return filepath

    transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['en-GB', 'en'])
    formatter = WebVTTFormatter()
    webvtt_formatted = formatter.format_transcript(transcript)
    
    with open(filepath, 'w', encoding='utf-8') as webvtt_file:
        webvtt_file.write(webvtt_formatted)
    webvtt_file.close()

    return filepath

The final function is the download_video function which serves as a foundational step for acquiring video content for our multimodal application.

It starts by logging the video URL and providing feedback on the ongoing process. If the input URL does not start with “http,” the function assumes it’s a direct file name and simply returns the path where the video would be stored.

For URLs that are valid, the function checks the specified directory for existing .mp4 files using the glob module. If it finds any, it returns the first matching file, which helps prevent redundant downloads. This design ensures that we efficiently manage video assets, avoiding unnecessary duplication while streamlining the workflow for subsequent processing tasks.

def download_video(video_url, path='/tmp/'):
    print(f'Getting video information for {video_url}')
    if not video_url.startswith('http'):
        return os.path.join(path, video_url)

    filepath = glob.glob(os.path.join(path, '*.mp4'))
    if len(filepath) > 0:
        return filepath[0]

Now let's put this into action and download the following YouTube video save it in a given path and get its transcript.

# first video's url
vid1_url = ""

# download Youtube video to ./shared_data/videos/video1
vid1_dir = "./data/videos/video1"
vid1_filepath = download_video(vid1_url, vid1_dir)

# download Youtube video's subtitle to ./shared_data/videos/video1
vid1_transcript_filepath = get_transcript_vtt(vid1_url, vid1_dir)

Getting video information for “ “

# show the paths to video1 and its transcription
print(vid1_filepath)
print(vid1_transcript_filepath)

./data/videos/video1/Welcome back to Planet Earth.mp4
./data/videos/video1/captions.vtt

Now let's see what the video transcript will look like:

!head -n15 {vid1_transcript_filepath}

WEBVTT
00:00:03.620 → 00:00:06.879
As I look back on the the mission that we’ve had here
00:00:06.879 → 00:00:10.559
on the International Space Station,
I’m proud to have been a part of much of
00:00:10.559 → 00:00:13.679
the science activities that happened over the last
00:00:13.680 → 00:00:14.420
two months.

We can also download a video hosted on Amazon S3 and save it to a specific directory. To ensure the directory exists before downloading, we use Path.mkdir() from the pathlib module, which creates the necessary folders if they aren’t already present.

Finally, we utilize urlretrieve from the urllib.request module to download the video directly from the specified URL, saving it with the chosen filename in the specified directory.

from urllib.request import urlretrieve
# second video's url
vid2_url=(
    "https://multimedia-commons.s3-us-west-2.amazonaws.com/" 
    "data/videos/mp4/010/a07/010a074acb1975c4d6d6e43c1faeb8.mp4"
)
vid2_dir = "./shared_data/videos/video2"
vid2_name = "toddler_in_playground.mp4"

# create folder to which video2 will be downloaded 
Path(vid2_dir).mkdir(parents=True, exist_ok=True)
vid2_filepath = urlretrieve(
                        vid2_url, 
                        osp.join(vid2_dir, vid2_name)
                    )[0]

To Data & Beyond

Building Multimodal RAG Application #4: Video Preprocessing Multimodal RAG

Table of Contents:

1. Setting Up Working Environment

2. Downloading Videos Corpuses

2.1. Download Videos & Transcripts from Given Link

2.2. Video Corpus and Its Transcript Are Available

This post is for paid subscribers