If you want to build a Local OCR application such that you do not need to send your documents to APIs. You can use 𝐒𝐦𝐨𝐥𝐃𝐨𝐜𝐥𝐢𝐧𝐠 which is an ultra-compact vision-language model for end-to-end multi-modal document conversion that can be used for LocalOCR.
𝗦𝗺𝗼𝗹𝗗𝗼𝗰𝗹𝗶𝗻𝗴 is a compact 256M open-source vision language model designed for OCR. It offers end-to-end document conversion without complex pipelines, allowing a single small model to handle everything. It’s fast and efficient, processing a page in just 0.35 seconds on a consumer GPU with less than 500MB VRAM. Despite its small size, it delivers high accuracy, outperforming models 27× larger in full-page transcription, layout detection, and code recognition.
In this two-part hands-on tutorial, we’ll build a local OCR application using 𝗦𝗺𝗼𝗹𝗗𝗼𝗰𝗹𝗶𝗻𝗴. In the first part, we’ll develop the OCR pipeline step by step, breaking down each component. In the second part, we’ll integrate everything and create the application interface.
Table of contents:
Introduction to OCR & SmolDocling Model
Setting Up the Working Environment
Initializing the Processor and Model
Preparing OCR Prompt
Generating OCR Output
Wrapping the OCR Pipeline
You can find the codes and data used in this article in this GitHub Repo.
My New E-Book: LLM Roadmap from Beginner to Advanced Level
I am pleased to announce that I have published my new ebook LLM Roadmap from Beginner to Advanced Level. This ebook will provide all the resources you need to start your journey towards mastering LLMs.
1. Introduction to OCR & SmolDocling Model
Optical Character Recognition (OCR) is the process that converts an image of text into a machine-readable text format. For example, if you scan a form or a receipt, your computer saves the scan as an image file. You cannot use a text editor to edit, search, or count the words in the image file. However, you can use OCR to convert the image into a text document, and its contents can be stored as text data.
OCR software can take advantage of artificial intelligence (AI) to implement more advanced methods of intelligent character recognition (ICR) for identifying languages or handwriting. Organizations often use the process of OCR to turn printed legal or historical documents into PDF documents so that users can edit, format, and search the documents as if created with a word processor.
In this tutorial, we will use the SmolDocling model as our core AI engine to extract the text from the image. SmolDocling is an ultra-compact vision-language model targeting end-to-end document conversion. The model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location.
Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure, and spatial location of document elements in a 256M parameters vision-language model.
SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms — significantly extending beyond the commonly observed focus on scientific papers
2. Setting Up the Working environment
The first in implementing the OCR pipeline is to set up the working environment. We will start with installing the required libraries:
torch: PyTorch for handling deep learning models.
docling_core: The core library for working with SmolDocling OCR.
transformers: Provides pre-trained models and utilities from Hugging Face.
flash-attn: An optional optimization for faster attention computation on compatible GPUs (FlashAttention only supports Ampere GPUs or newer).
!pip install torch -q
!pip install docling_core -q
!pip install transformers -q
!pip install flash-attn --no-build-isolation -q
Next, we import the necessary modules:
torch: Checks GPU availability and loads the model efficiently.
DoclingDocument & DocTagsDocument: These are data structures from docling_core used to represent OCR-extracted text.
AutoProcessor: Used to preprocess images and text before feeding them into the model.
AutoModelForVision2Seq: Loads the SmolDocling OCR model.
load_image: A helper function to load images from a file or URL.
import os
import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
Finally, we will check whether a GPU (CUDA) is available and set the device accordingly:
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
If a GPU is available, the model runs on “cuda” (faster execution). Otherwise, it falls back to “cpu”.
3. Initializing the Processor and Model
The next is to initialize the processor and the model. The Processor (AutoProcessor) handles preprocessing, including image conversion and text tokenization.
The Model (AutoModelForVision2Seq) will load the SmolDocling OCR model and we will set the torch_dtype=torch.bfloat16 to use a lower-precision format (bfloat16) to save memory without sacrificing much accuracy.
We will also apply the .to(DEVICE) which moves the model to GPU (if available) for better performance. You can uncomment the _attn_implementation line to enable FlashAttention 2 if your GPU supports it.
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
#_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
4. Preparing OCR Prompt
After we prepare the model and the processor we will define the instruction template for the OCR model. This creates an instruction template for the OCR model. The image is passed along with this text prompt: "Convert this page to docling."
which tells the model to extract structured text from the image.