Building an Object Detection Assitant Application: A Step-by-Step Guide

Developing Your Own Object Detection Assistant: A Step-by-Step Manual

Jul 27, 2024

∙ Paid

Object detection is one of the main and most important tasks emerging as one of its most transformative applications. This article provides a comprehensive guide to developing a personalized object detection assistant, detailing each step from conceptualization to demo deployment.

In this article, you will explore and use computer vision models to build a practical application. The main goal is to create an assistant that can help a visually impaired person understand what is in a picture.

This involves working with state-of-the-art computer vision techniques to recognize and interpret images effectively, summarize the output, and finally convert the text to sound.

My New E-Book: LLM Roadmap from Beginner to Advanced Level

Youssef Hosni

June 18, 2024

I am pleased to announce that I have published my new ebook LLM Roadmap from Beginner to Advanced Level. This ebook will provide all the resources you need to start your journey towards mastering LLMs.

Read full story

1. Setting Up the Environment

We will start with importing important packages. These packages will provide the necessary tools to build our computer vision application, including the transformers library for model handling, Gradio for creating user interfaces, and timm, inflect, and phonemizer for additional processing needs.

    !pip install transformers
    !pip install gradio
    !pip install timm
    !pip install inflect
    !pip install phonemizer

Next, we will import some helper functions, starting with load_image_from_url which we will use it to load the images given a URL

def load_image_from_url(url):
    return Image.open(requests.get(url, stream=True).raw)

The second function render_results_in_image function is designed to visualize the results of an object detection model by overlaying bounding boxes and labels on an image. It takes two inputs:

in_pil_img: A PIL image object that represents the input image to be processed.
in_results: A list of prediction results, where each prediction includes the bounding box coordinates, the label of the detected object, and the confidence score.

The function processes these inputs to create a visual representation of the object detection results. It uses the matplotlib library to draw rectangles around detected objects and annotate them with labels and confidence scores.

The final annotated image is saved to an BytesIO object and returned without displaying it, making it suitable for further processing or display elsewhere.

def render_results_in_image(in_pil_img, in_results):
    plt.figure(figsize=(16, 10))
    plt.imshow(in_pil_img)

    ax = plt.gca()

    for prediction in in_results:

        x, y = prediction['box']['xmin'], prediction['box']['ymin']
        w = prediction['box']['xmax'] - prediction['box']['xmin']
        h = prediction['box']['ymax'] - prediction['box']['ymin']

        ax.add_patch(plt.Rectangle((x, y),
                                   w,
                                   h,
                                   fill=False,
                                   color="green",
                                   linewidth=2))
        ax.text(
           x,
           y,
           f"{prediction['label']}: {round(prediction['score']*100, 1)}%",
           color='red'
        )

    plt.axis("off")

    # Save the modified image to a BytesIO object
    img_buf = io.BytesIO()
    plt.savefig(img_buf, format='png',
                bbox_inches='tight',
                pad_inches=0)
    img_buf.seek(0)
    modified_image = Image.open(img_buf)

    # Close the plot to prevent it from being displayed
    plt.close()

    return modified_image

The third function we will use is the summarize_predictions_natural_language function, which generates a natural language description of object detection results by analyzing a list of predictions, each containing a label indicating the type of object detected.

It creates a dictionary (summary) to count the occurrences of each label and then constructs a descriptive sentence using the Inflect library to convert numerical counts into words (e.g., "three cats").

The function builds a grammatically correct string by iterating through the dictionary, appending each label and its count to the result string, adding pluralization where necessary, and ensuring that conjunctions like "and" are placed correctly. Finally, it returns a complete sentence that describes the detected objects in the image, formatted for human readability.

def summarize_predictions_natural_language(predictions):
    summary = {}
    p = inflect.engine()

    for prediction in predictions:
        label = prediction['label']
        if label in summary:
            summary[label] += 1
        else:
            summary[label] = 1

    result_string = "In this image, there are "
    for i, (label, count) in enumerate(summary.items()):
        count_string = p.number_to_words(count)
        result_string += f"{count_string} {label}"
        if count > 1:
          result_string += "s"

        result_string += " "

        if i == len(summary) - 2:
          result_string += "and "

    # Remove the trailing comma and space
    result_string = result_string.rstrip(', ') + "."

    return result_string

To Data & Beyond

Building an Object Detection Assitant Application: A Step-by-Step Guide

Developing Your Own Object Detection Assistant: A Step-by-Step Manual

Table of Contents:

My New E-Book: LLM Roadmap from Beginner to Advanced Level

1. Setting Up the Environment

This post is for paid subscribers