How do you extract data from documents, receipts, and forms with AI?
Use OCR to turn the image into text, then a language model to pull named fields into a structured record. The hard part isn't reading the words; it's handling messy layouts, validating numbers, and routing the result into the system that needs it. Done by hand it's slow and error-prone. Modern AI gets you to roughly 90% on clean documents, and the last 10% is the work.
Every business runs on documents it didn’t design. A leasing agent gets a pay stub as a phone photo. A clinic gets an intake form scanned at an angle. A logistics coordinator gets a bill of lading as a faxed PDF. An insurance agency gets an application where half the fields are typed and half are scrawled in the margins. Someone has to turn all of that into clean rows in a system, and that someone is usually a person retyping it by hand.
The work is hard for a boring reason: documents are inconsistent. The same field sits in a different place on every vendor’s form, half the scans are crooked, and the numbers have to be exactly right or you get a billing dispute. AI has gotten genuinely good at the reading part. It’s the rest that decides whether the result is usable.
What actually decides the outcome
Whether extraction works on your documents comes down to a few things, in roughly this order.
- Document quality. A digital PDF born from software extracts almost perfectly. A clean scan does well. A photo taken at an angle in bad light, or a third-generation fax, is where errors creep in. Garbage in stays garbage.
- Layout variability. If every invoice comes from the same template, this is easy. If you take bills of lading from forty carriers, each with its own format, the model has to find fields by meaning rather than position. That’s the difference between a rule and real extraction.
- Field criticality. A typo in a customer’s middle name is annoying. A typo in a dollar amount, a policy number, or a ship date causes a downstream failure. The fields that matter most are usually the numbers, and numbers are exactly where you need validation, not just reading.
- Where it has to land. Pulling fields out is half the job. Getting them into Epic, Salesforce, an Oracle transport system, or a dashboard, mapped to the right columns in the right format, is the other half, and it’s often the harder half.
- What you do with uncertainty. Good extraction returns a confidence score per field. The teams who get this right don’t aim for a perfect single pass. They auto-accept high-confidence fields and send the low-confidence ones to a human. That’s how you get speed without silent errors.
How to do it by hand
You can build a workable extraction pass yourself with free or cheap tools. The honest steps:
- Get clean text first. Run the document through OCR (Tesseract is free; Google Document AI, Azure, and AWS Textract are paid and better on messy scans). Digital PDFs may already have a text layer you can pull directly. Straighten and crop crooked photos before this step; it helps more than anything else.
- Define the fields you actually need. Write them down explicitly: invoice number, gross pay, date of birth, container number, whatever your case is. Vague requests get vague results.
- Ask a language model to extract them. Paste the OCR text and ask for the named fields back as structured data. Tell it to return null when a field is missing rather than guessing.
- Validate the output. Check formats: does the date parse, is the total a number, does the policy number match the expected pattern. This catches the errors OCR introduces.
- Map and load. Put the fields into the right columns of your destination system. For one document this is copy-paste. For a stack of them, this is where a person’s afternoon goes.
For a handful of documents, this is fine. The pain shows up at volume and variety.
Where it goes wrong
- Trusting the first pass. Models will confidently return a wrong number. Without per-field confidence and a validation check, you won’t know which ones to doubt.
- Hallucinated fields. Ask for a field that isn’t on the page and a model may invent a plausible value. Always allow and expect nulls.
- Tables and line items. Pulling one header value is easy. Pulling thirty line items from a freight manifest or an itemized receipt, in order, without dropping rows, is where naive approaches fall apart.
- The mapping tax. Even with perfect extraction, loading data into a strict system like an EHR or a CRM means matching field names, formats, and required-field rules. This quietly eats the time you thought you saved.
- Regulated data in the wrong place. Pasting pay stubs or patient forms into a consumer chatbot is a compliance problem, not a workflow.
Doing it yourself vs. handing it to Physea
By hand, you can get the reading and the extraction working. What stays manual is the chain around it: pulling the file from wherever it arrived, running OCR, extracting, validating, mapping to your system’s exact fields, and loading it, every time, for every document. That chain is the actual chore.
Physea’s Liminality runs that whole route end to end over MCP, across the tools you already use. It reaches your documents where they live, extracts and checks the fields, and lands them in the system that needs them, grounded in your specifics and reused so the second batch is faster than the first. You connect your tools and get the finished record back, not the data-entry shift.
Common questions
- Can AI read handwritten and scanned documents accurately?
- Printed and digital PDFs extract reliably. Clean handwriting and good scans do well; faded fax pages, skewed photos, and cramped handwriting are where accuracy drops. The practical move is to keep a confidence score on every field and route low-confidence ones to a human instead of trusting a single pass blindly. Physea can run that extract-then-verify loop across your own documents.
- Is it safe to send pay stubs, medical intake, or insurance PDFs to an AI tool?
- Only if the data is handled correctly. For pay stubs, patient demographics, or insurance applications you want a tool that processes inside an environment you control, doesn't train on your files, and keeps an audit trail. Consumer chatbots are the wrong home for regulated data. Physea processes your documents through connections you authorize rather than pasting into a public tool.
- What's the difference between OCR and AI extraction?
- OCR converts an image of text into raw characters. It gives you a wall of text, not a record. AI extraction is the next step: it reads that text and decides which string is the invoice number, which is the ship date, which is the gross pay, and returns named fields you can file. You usually need both, OCR first and a model second.