Industry Insights

What AI Vendors Miss About Data Preparation

Michael Goldman

Every enterprise AI conversation eventually arrives at the same place: the data. Not the model architecture, not the inference speed, not the user interface. The data. And yet, in most vendor presentations and product demos, data preparation receives roughly the same attention as the fine print on a software license agreement.

The Sprinklenet team has deployed AI systems across commercial enterprises ranging from startups to large regulated organizations. The single most consistent pattern across all of these engagements is this: the work that matters most is the work that gets discussed least. This post is about that work.

The Hidden 80 Percent

Across the enterprise AI projects the Sprinklenet team has delivered, somewhere between 60 and 80 percent of the total effort goes into data preparation. Not model training. Not prompt engineering. Not fine-tuning. Data preparation.

This ratio surprises many executives who have watched polished demos where a chatbot answers questions about internal documents in real time. What those demos do not show is the weeks of work that preceded them: cataloging data sources, resolving format inconsistencies, handling duplicates, designing chunking strategies, building ingestion pipelines, and validating output quality against known-good answers.

This is not a sign that something has gone wrong. It is the normal, expected shape of a well-run AI project. Organizations that understand this from the outset plan better, budget more accurately, and reach production faster than those that treat data preparation as a preliminary step to rush through.

The challenge is that many vendor conversations skip this part entirely. The pitch moves from “we connect to your data” straight to “and here are your AI-powered insights.” The gap between those two statements is where most of the real engineering lives.

What Data Preparation Actually Involves

Data preparation is not a single activity. It is a sequence of distinct disciplines, each with its own complexity. Here is what the Sprinklenet team typically works through with clients:

Discovery

Before anything else, the team needs to know where data lives. In most organizations, the answer is “everywhere.” SharePoint sites, shared drives, legacy databases, email archives, PDF repositories, ticketing systems, wikis, and often a handful of spreadsheets that one person maintains and everyone depends on. Discovery is the process of building a complete map of these sources, understanding their volume, and identifying which ones are relevant to the AI use case at hand.

This phase alone can take weeks in large organizations. As the Sprinklenet team described in Gold in the Basement, many agencies are sitting on enormously valuable data assets that have never been inventoried, let alone made accessible to modern tools.

Assessment

Once data sources are identified, the next step is understanding their quality. How complete is the data? How consistent are the formats? Are there fields that are supposed to be populated but routinely are not? Are dates stored as strings in six different formats? Is the same entity referred to by different names in different systems?

Assessment produces a realistic picture of what the data can and cannot support. It is far better to discover quality issues here than after a model has been trained on flawed inputs.

Cleaning

With the assessment complete, the work of cleaning begins. This includes deduplication, normalization of values and formats, handling missing data (through imputation, flagging, or exclusion), resolving encoding issues, and stripping artifacts from document conversions. In document-heavy environments like federal agencies, PDF extraction quality is a major factor. A PDF that renders beautifully on screen may produce garbled text when parsed programmatically, especially if it was scanned rather than digitally created.

Structuring

Clean data still needs to be structured for the specific AI application. For retrieval-augmented generation (RAG) systems, this means designing a chunking strategy, defining schemas, and establishing how relationships between data elements will be preserved. For analytical AI, it means building feature sets and ensuring temporal consistency. For classification systems, it means creating or validating label taxonomies.

Pipeline Engineering

Finally, all of the above needs to be automated. Production AI systems cannot rely on one-time data loads. They need pipelines that ingest new data, apply the same cleaning and structuring logic, handle errors gracefully, and keep the AI system’s knowledge base current. This is software engineering, and it requires the same rigor as any other production system.

Chunking Strategy Matters More Than Model Choice

This is the point in the conversation where the Sprinklenet team tends to get the most surprised reactions: how you split your documents for RAG retrieval has more impact on answer quality than which large language model you use.

Consider a 200-page policy manual. If you chunk it into arbitrary 500-token blocks, you will lose context at every boundary. A question about “the approval process for Category B expenditures” might pull a chunk that contains the procedure but not the definition of Category B, or vice versa. The model will either hallucinate the missing context or produce an incomplete answer.

Different data types demand different chunking strategies:

Structured databases: Chunk by logical record. Each row or entity becomes its own retrievable unit, with column headers preserved as context.
Long-form documents: Use overlapping chunks with section headers carried forward, so each chunk retains its position in the document hierarchy. A paragraph from Section 4.2.1 should arrive at the model with “Section 4: Procurement > 4.2: Thresholds > 4.2.1: Small Purchases” as context.
Forms and structured PDFs: Extract field-value pairs rather than treating the document as prose. A DD-254 has a defined structure; chunking it as free text discards that structure.
Email and correspondence: Preserve thread context, sender/recipient relationships, and temporal ordering. An email reply means nothing without the message it is responding to.
Meeting notes and transcripts: Chunk by topic or agenda item rather than by length, with speaker attribution preserved.

The Sprinklenet team has seen cases where switching from naive fixed-length chunking to context-aware chunking improved retrieval accuracy by double-digit percentages, with no change to the underlying model. That is the kind of improvement that no amount of prompt engineering can replicate.

The Metadata Layer

Raw text chunks, no matter how well structured, are only part of the picture. Production-grade RAG systems need a metadata layer that enriches each chunk with attributes that enable filtering, attribution, and governance.

At minimum, each chunk should carry:

Source: Which system, document, or database did this come from?
Timestamp: When was the source last modified? When was this chunk ingested?
Author or owner: Who created or maintains the source material?
Access level: Who is authorized to see this information? This is critical in environments with mixed classification or sensitivity levels.
Document type: Is this a policy, a procedure, a regulation, a memo, a contract, or something else?
Version: Is this the current version of the document, or has it been superseded?

This metadata enables capabilities that matter enormously in practice. A user asking about current travel policy should not receive an answer based on a superseded regulation. A junior analyst should not receive chunks from documents above their access level. An auditor reviewing AI-generated responses should be able to trace every claim back to its source document and verify its currency.

Without metadata, an AI system can answer questions. With metadata, it can answer questions responsibly.

Data Governance as a Prerequisite

Many organizations approach AI hoping to leapfrog their existing data management challenges. The Sprinklenet team’s experience is that AI does not bypass data governance problems; it amplifies them.

Before connecting an AI system to enterprise data, several questions need clear answers:

Where does the data live? Not just which systems, but which instances, which environments, and which backup or archive tiers.
Who owns it? Data ownership in large organizations is often ambiguous. AI projects have a way of forcing clarity on this question.
How current is it? A system trained on stale data will produce confidently wrong answers, which is worse than producing no answers at all.
What access controls apply? Especially in DoW and federal contexts, mixing data across authorization boundaries is not just a quality issue; it is a compliance issue.
What are the retention and disposition rules? AI systems that ingest data need to respect the same records management policies as the source systems.

The Sprinklenet team often recommends an AI data audit as a first engagement. This is a structured assessment that maps data sources, evaluates their readiness for AI consumption, identifies governance gaps, and produces a prioritized remediation roadmap. As outlined in the AI Readiness framework, understanding the current state of the data estate is the foundation everything else builds on. It is far less expensive than discovering these issues mid-deployment.

The Ongoing Commitment

Perhaps the most important thing vendors tend to understate: data preparation is not a project with a completion date. It is an ongoing operational commitment.

Source systems change. New document types appear. Personnel turnover means that tribal knowledge about data quirks walks out the door. Schema migrations break ingestion pipelines. Quality that was acceptable at launch degrades as edge cases accumulate.

Production AI systems need continuous attention to their data layer:

Pipeline monitoring: Automated alerts when ingestion fails, when data volumes shift unexpectedly, or when quality metrics drift.
Periodic re-assessment: Scheduled reviews of data sources to catch changes in format, quality, or relevance.
Feedback loops: When end users report incorrect or outdated answers, the root cause is almost always in the data, not the model. A clear process for tracing issues back to their data source and correcting them is essential.
Capacity planning: As data volumes grow, chunking strategies, vector databases, and retrieval pipelines need to scale with them.

Organizations that budget for ongoing data maintenance alongside their AI platform costs are the ones that see sustained value. Those that treat data preparation as a one-time setup cost are the ones that wonder, six months later, why their AI system’s answers are getting worse.

How the Sprinklenet Team Approaches This

Knowledge Spaces, the Sprinklenet platform for enterprise AI, was designed around the reality described in this post. The platform includes 15+ data connectors that handle ingestion from SharePoint, S3, databases, APIs, file systems, and other common enterprise sources. But connectors alone are not the point.

The point is what happens between ingestion and retrieval. Knowledge Spaces applies configurable chunking strategies tailored to each data type, enriches every chunk with source metadata, enforces access controls at the retrieval layer, and provides monitoring tools that surface data quality issues before they reach end users.

The Sprinklenet team’s standard engagement model reflects this priority. Before configuring a single model parameter, the team works with the client to complete data discovery, assess source quality, design the chunking and metadata strategy, and build automated pipelines. Only after the data layer is solid does model configuration begin.

This approach takes longer to show a first demo. It takes less time to reach production. And it produces systems that remain reliable after the initial excitement fades and the real work of serving users begins.

Data preparation is not the glamorous part of AI. It does not make for compelling keynote presentations or viral LinkedIn posts. But it is where enterprise AI projects succeed or fail, and understanding that from the start is the single best investment an organization can make.

About the Author

LLM Evaluation Analyst, Sprinklenet Research

Michael Goldman is a Sprinklenet Research contributor focused on retrieval quality, model behavior, prompt risk, and audit controls for enterprise AI systems.

His work examines where AI systems fail in practice, including weak grounding, fragile handoffs, unclear review paths, and brittle integrations.

Latest Posts

When to Use Fine-Tuning Instead of Retrieval - Sprinklenet Insights cover

Find the right AI solution for your business.

Request a Consultation

Evaluate your AI readiness, identify practical opportunities, and learn how Sprinklenet delivers governed, production-ready AI systems for your organization.

Response Within 24 Hours

No Obligation

Senior Team Only

Scope a Six-Week Pilot