Best Practices for Reading PDFs in Laravel
〜How Should We Combine pdftotext, OCR, and Generative AI?〜
1. Conclusion First: The “Three-Layer” Best Practice
If you prioritize accuracy,
the best practice for reading PDFs in Laravel can roughly be summarized as three layers:
-
Determine the type of PDF
- Is it a text-embedded PDF?
- A scanned image PDF (text can’t be selected)?
- Or a hybrid where both coexist?
-
Separate the extraction layer
- Text PDFs → extract via
pdftotext-style tools (e.g.,spatie/pdf-to-text) - Image PDFs → OCR (e.g., Tesseract + PHP wrapper / LaraOCR, or cloud OCR)
- Text PDFs → extract via
-
Use generative AI for structuring, summarizing, and field extraction
- Feed the extracted text to an LLM (generative AI) for interpretation and structuring
- Things like “invoice field extraction,” “work history extraction from a CV,” or “summarization” are done here
The key idea is:
“Reading characters” = pdftotext / OCR
“Understanding and organizing meaning” = generative AI
Make this separation explicit.
If you mix these into a single layer, you get:
- Slow processing
- High cost
- More AI “hallucinations” — answers that sound plausible but are wrong
…which all reduce accuracy.
From here, we’ll look at the relative strengths of each approach
and walk through how to put them together concretely in Laravel.
2. Why Reading PDFs Is Hard, and How to Tell the “Types” Apart
2-1. There Are Two Major Types of PDFs
Even if they look the same on screen, PDFs can have completely different internal structures.
-
Text-embedded PDFs (digital PDFs)
- Generated directly from web systems, Word/Excel, LaTeX, etc.
- The text is embedded as “text objects” in the PDF
- You can drag and copy text on screen
-
Image PDFs (scanned PDFs)
- Created by scanning paper and bundling the images as a PDF
- Internally, it’s just “images” – no embedded text at all
- You can’t drag to select text on screen
More recently, we also see many hybrid PDFs, for example:
- Only the cover is an image
- The inner pages are text PDFs
2-2. First, Add a “Text PDF or Not” Check
On the Laravel side, a robust flow looks something like this:
- User uploads a PDF
- Use
spatie/pdf-to-textor similar to attempt text extraction - Judge based on the length and content of the extracted text
- If there’s a certain amount of text → treat as “text PDF”
- If it’s almost empty → treat as “image PDF” and route to OCR
use Spatie\PdfToText\Pdf;
$text = Pdf::getText(storage_path('app/'.$path));
if (mb_strlen(trim($text)) < 50) {
// Very few characters → high likelihood it’s a scanned PDF
// → route to OCR
} else {
// Handle as text PDF
}
spatie/pdf-to-text uses the pdftotext command internally,
and is a de-facto standard in Laravel tutorials on PDF text extraction.
3. Approach ①: pdftotext (for Text PDFs)
3-1. How It Works and the Laravel Standard
pdftotext is a CLI tool included in the Poppler PDF library.
It parses the text objects inside a PDF and outputs plain text.
In Laravel, this package is practically a standard:
spatie/pdf-to-text- A simple PHP wrapper around
pdftotext - Frequently used in Laravel tutorials
- A simple PHP wrapper around
3-2. Benefits (Competitive Advantages)
1. Near-100% accuracy for text PDFs
- It reads “text objects” as they are embedded in the PDF,
so there are essentially no recognition-based typos. - Unlike OCR, it’s not “guessing characters from images,”
which gives it a huge advantage in accuracy.
2. Fast and cheap
- It’s just plain text extraction, so processing is light
and suitable for batch processing large numbers of PDFs. - As long as
poppler-utilsis installed on the server,
licensing is straightforward and running costs are basically just server costs.
3. Fully on-premise and strong for sensitive data
- No need to send data to an external API,
so even personal or confidential documents can be processed entirely on an in-house server.
3-3. Drawbacks and Limits
- It returns nothing for image PDFs (almost zero characters)
- Layout info such as tables and multi-column text is easily broken
- Tables become just line breaks and spaces
- Two-column papers tend to have intermingled text
In other words:
For text PDFs it’s “fast and accurate,”
but it doesn’t take care of layout or structure.
That’s essentially its role.
4. Approach ②: OCR (for Image / Scanned PDFs)
4-1. OCR vs Text Extraction
-
Text extraction (pdftotext, etc.)
- Reads “text objects” already present inside the PDF
- If no text is embedded, nothing can be extracted
-
OCR (Optical Character Recognition)
- Looks at the image pixels and infers the characters drawn there
- Technology that “reconstructs” text from scanned PDFs or photos
In data-extraction-related articles, you’ll often see the distinction:
OCR = role of “turning images into text”
Text extraction = role of “picking out info from text already there”
4-2. OCR Engines Often Used with Laravel
1. Tesseract OCR
-
A battle-tested open-source OCR engine
-
Supports many languages, including Japanese
-
From PHP you typically use wrappers like:
thiagoalessio/tesseract_ocr(PHP wrapper)LaraOCR(Laravel-oriented wrapper)
2. Example of Laravel integration packages
NilGems/laravel-textract- Unified package that uses Tesseract for images and pdftotext for PDFs
- (Newer example)
laravelsmartocr/laravel-smart-ocr- Advertises OCR + AI cleansing + templates, etc.
4-3. Benefits of OCR (Competitive Advantages)
1. Can read scanned PDFs and photos
- This is a world where “without OCR, you’re stuck.”
- Paper contracts scanned into PDFs, photos of receipts, fax PDFs —
these still appear frequently in real-world workflows.
2. Can somewhat preserve layout
- Depending on preprocessing and settings,
it may output text in a way that preserves columns and table columns to some extent. - But it’s not perfect, so for tables and forms
it’s safer to plan to refine them with AI or a dedicated parser later.
4-4. Drawbacks and Caveats of OCR
1. You’ll never get 100% perfect text
- Accuracy varies with resolution, fonts, skew, noise, etc.
- Mistakes in numbers and symbols are inevitable, so for critical fields like
amounts or IDs you either need human review or dual checks.
2. Heavy, slow, and potentially costly
- The flow is PDF page → image → OCR, so
it’s much heavier than pdftotext. - Higher resolution (300–400 dpi) improves accuracy
but increases server load and processing time.
3. Pure OCR does not understand “meaning”
- OCR is only “transcribing characters.”
- It does not know which part is an invoice number or a date.
That “understanding” piece is where
generative AI or cloud document-analysis services
(Textract / Document AI / Form Recognizer, etc.) come in.
5. Approach ③: Where Does Generative AI (LLM) Fit In?
5-1. Role Split Between OCR and AI
Recent writeups often summarize it as:
OCR is the technology to “read characters.”
AI (LLM) is the technology to “understand and organize meaning.”
- OCR: turns images into raw strings like “A”, “B”, “3,000”
- LLM: takes that text and decides things like
- This is the invoice total
- This is a date
- This invoice is from company X
…then outputs structured data like JSON.
NVIDIA and various blogs also say in practice:
For PDF extraction, the realistic approach is a combination
of OCR + layout analysis + LLM.
5-2. Strengths of Generative AI
1. Robust against OCR noise
Even with some typos,
it can often interpret the intended meaning from context.
2. Excellent at structuring
Tasks like:
- “Is this PDF an invoice?”
- “Extract supplier name, bank account, total amount, and due date.”
tend to be handled more flexibly by LLMs than by rule-based systems.
3. Handles semi-structured / unstructured docs well
For documents without rigid templates, such as:
- CVs / résumés
- Meeting minutes
- Long contracts
you can still do things like “summarize,” “extract key clauses,” etc.
5-3. Weaknesses of Generative AI and What to Watch For
1. Hallucinations (plausible but incorrect answers)
- It may “fill in” details that aren’t actually in the original PDF.
- Where accuracy is crucial, you need safeguards like:
- Prompting: “Do NOT hypothesize anything outside the source text,” and
- Cross-checking extracted values against the original text.
2. Cost and latency
- Throwing a large PDF directly into an LLM
explodes the token count, making it very heavy in cost and time.
3. Risky as the “primary extraction source”
- A workflow like PDF → directly into LLM and
“let the LLM handle all of the OCR and reading”
currently tends to miss text and misread things. - As a base text-extraction layer, this is still too risky.
So the realistic separation is:
Character extraction: pdftotext / OCR
Meaning and structure: LLM
This layered design makes sense from both an accuracy and cost perspective.
6. Best-Practice Architecture in Laravel (Implementation Sketch)
From here, let’s imagine a common set of requirements
and sketch out how the Laravel app could be structured.
6-1. Overall Architecture
-
Upload & metadata storage
pdfstable: file path, status, page count, type (text/image/hybrid), etc.
-
Extraction jobs (queued)
DetectPdfTypeJob: do a trial read withspatie/pdf-to-textand decide typeExtractPdfTextJob:- Text PDFs → use pdftotext to extract all pages
- Image PDFs → render each page as an image → OCR per page
-
AI structuring job
AnalyzePdfContentJob:- Feed extracted text to an LLM and
for invoices, get JSON like{ supplier, total_amount, due_date, invoice_number }
- Feed extracted text to an LLM and
-
Review / admin UI
- Screen where operators can review and correct extraction results
- For critical fields (amounts, dates, etc.), assume human review
6-2. Using pdftotext (spatie/pdf-to-text)
# Install poppler-utils on the server (Ubuntu example)
apt install poppler-utils
# Add the package to your Laravel project
composer require spatie/pdf-to-text
use Spatie\PdfToText\Pdf;
$pdfPath = storage_path('app/'.$pdf->path);
// Extract as a single string
$text = Pdf::getText($pdfPath);
// If you want to process per page, either split the PDF (pdftk, etc.)
// or leverage pdftotext options for page ranges.
Laravel tutorials often introduce
almost exactly this setup.
6-3. Using OCR (Tesseract)
1) Tesseract + PHP wrapper
# Tesseract itself
apt install tesseract-ocr tesseract-ocr-jpn
composer require thiagoalessio/tesseract_ocr
use thiagoalessio\TesseractOCR\TesseractOCR;
$imgPath = storage_path('app/pages/page-1.png');
$text = (new TesseractOCR($imgPath))
->lang('jpn', 'eng')
->psm(3) // page segmentation mode
->run();
Using Tesseract from Laravel like this
is a standard pattern, introduced in LaraOCR and various blogs/Q&A.
2) Converting PDF to images
The best-practice flow is:
- Convert each PDF page to an image (e.g., 300 dpi) via
imagickorghostscript - Run Tesseract on each image to extract text
6-4. Structuring with Generative AI (Example: Invoice)
Once you’ve stored the extracted text in the DB,
you pass “text + extraction format” into the LLM.
$prompt = <<<EOT
You are an assistant for extracting data from invoices.
From the given text, extract the following fields in JSON.
- supplier_name: name of the issuing company
- invoice_number: invoice number
- issue_date: issue date (YYYY-MM-DD)
- due_date: payment due date (YYYY-MM-DD)
- total_amount: total amount (numbers only)
Notes:
- If you cannot determine a field, set it to null.
- Do NOT assume or infer values that do not appear in the original text.
- Respond with JSON only.
=== TEXT ===
{$plainText}
EOT;
By explicitly saying “no guessing” and “unknown → null,”
you can significantly reduce hallucinations.
7. Best Combinations by Use Case
7-1. CV / Résumé Search and Auto-Tagging
- Most are text-embedded PDFs exported from Word, etc.
- Layouts vary, and content is free-form.
Recommended setup
- Extraction:
pdftotext(spatie/pdf-to-text) - Structuring: use LLM to extract
- work history list
- skills list
- desired work location, etc.
- Search: feed extracted fields + full text into Elasticsearch / Meilisearch
7-2. Automatic Processing of Invoices, Quotes, Receipts
- Scanned PDFs may be mixed in
- Many “one-character mistakes are unacceptable” fields (amounts, dates, etc.)
Recommended setup
- Extraction:
- Text PDFs → pdftotext
- Image PDFs → Tesseract OCR or a cloud OCR (Textract / Document AI / Form Recognizer)
- Structuring:
- Use an LLM to output JSON
- For critical fields (amounts, dates, bank info), use
- human double-checks, and/or
- dual checks against regex/rule-based extraction
If accuracy requirements are strict,
cloud IDP (Intelligent Document Processing) solutions can be worth considering instead of pure Tesseract.
Since they combine OCR + layout analysis + ML/LLM as a
“document-specific AI,” they’re often reported to be more accurate and stable than plain OCR.
7-3. Summarizing and Turning Contracts / Terms / Reports into Knowledge
- Mostly text PDFs
- Goal is capturing the gist, not fully structuring everything
Recommended setup
- Extraction: pdftotext (for text PDFs)
- Structuring: use LLM for
- summarization
- clause classification
- listing risk items
- Optionally store the full text in a vector store for RAG search
In this case, the priority is less
“every single character must be perfect” and more
“no critical points are missed,”
so it’s a good idea to have humans sample and review the summaries.
8. Operational Tips for Maximizing Accuracy
Finally, here are some key points
if you truly care about accuracy.
-
Always store “original PDF → extracted text → structured data”
- So you can always run diff checks later
-
Automate OCR quality checks
- For example:
- Are full-width and half-width digits mixed?
- Do amount fields contain anything other than digits, commas, or dots?
- Use rule-based validation to flag “suspicious” entries for human review
- For example:
-
Limit LLM input to “what’s really needed”
- Send text by page or section
- Strip redundant headers/footers beforehand
- This stabilizes both cost and accuracy
-
Treat the role of generative AI as “organizing, summarizing, and extracting”
- If you try to make it a full replacement for OCR,
it becomes very hard to validate what it missed or misread.
- If you try to make it a full replacement for OCR,
9. Summary
-
The best practice is a three-layer approach
- Determine the PDF type
- First-pass extraction via pdftotext / OCR
- LLM for understanding and structuring
-
pdftotext is ideal for text PDFs
- Fast, accurate, and cheap;
spatie/pdf-to-textis the de-facto standard in Laravel
- Fast, accurate, and cheap;
-
OCR is essential for image PDFs
- Tesseract + Laravel wrappers (LaraOCR, etc.)
- For critical documents, consider cloud IDP solutions as well
-
Generative AI’s job is not “reading characters” but “understanding meaning”
- It shines at field extraction, classification, summarization, and error correction
-
The keys to higher accuracy are “layer separation” and a “validation flow”
- Separate the extraction layer and the AI layer
- Use machine + human dual checks for critical fields
If you design with this mindset,
you’re less likely to run into “not accurate enough” or “too expensive”
when adding PDF ingestion features to your Laravel application.
Reference Links (English / Japanese)
- spatie/pdf-to-text – PHP library for extracting text from PDFs (uses pdftotext)
- How to use pdftotext with Laravel to read text from PDFs (Japanese article)
- Tutorial: Reading content from PDF files in Laravel 12 using spatie/pdf-to-text
- LaraOCR – Tesseract OCR wrapper for Laravel
- thiagoalessio/tesseract_ocr – PHP wrapper for Tesseract
- NilGems/laravel-textract – Unified Laravel text-extraction package using pdftotext + Tesseract
- Overview of Tesseract OCR and using it from PHP (Japanese article)
- Difference between data extraction, OCR, and IDP (Intelligent Document Processing)
- Document data extraction in 2025: LLMs vs OCRs
