From Data Platforms to AI Platforms: How LLMs Change the Playbook for Data Engineers and Architects
Large Language Models (LLMs) didn’t replace data engineering—they turned it into the backbone of a new generation of intelligent systems. The shift from predictive ML to generative AI changes how we design data flows, how we govern information, and how we run production systems at scale.
This article is a practical guide for data engineers and architects who want to ship high‑impact AI features without gambling on reliability, cost, or compliance. We’ll cover the architectural patterns, operational practices, and skill shifts that matter most in the LLM era.
Why LLMs change the architecture conversation
Traditional ML stacks optimize for structured features and batch scoring. LLM‑powered systems optimize for knowledge access, context construction, and safe generation—often in real time. That pushes the platform to evolve in three big ways:
- Retrieval becomes a first‑class citizen
- You don’t just “store data”; you curate knowledge that LLMs can reliably retrieve.
- Document parsing, chunking, embeddings, and hybrid search (dense + sparse) become core data flows.
- Governance shifts from models to context
- For RAG (Retrieval‑Augmented Generation), the answer is only as good as the retrieved context.
- Provenance, freshness, lineage, and access control are central to trust and compliance.
- Operability moves from batch SLAs to conversational SLOs
- Latency, cost per request, groundedness, and safety are now top‑level metrics.
- You need LLM‑aware tracing, evaluation, and feedback loops (LLMOps).
A reference AI platform for the real world
Think of the AI platform as an evolution of your data platform. Here’s a pragmatic reference architecture and how each layer maps to responsibilities you likely have today.
-
Data foundations (unchanged in importance, expanded in scope)
- Batch and streaming ingestion, quality gates, and contracts
- Document pipelines (PDFs, HTML, Office docs) with robust parsing/OCR
- Metadata everywhere: source, timestamp, owner, PII flags, legal hold
-
Retrieval layer
- Embedding pipelines with consistent tokenization/normalization
- Chunking strategies (semantic, layout-aware, or hybrid)
- Vector database choices: Pinecone, Weaviate, Qdrant, pgvector, or vendor-native
- Hybrid search (BM25 + vectors) with filters and re‑ranking where needed
-
Orchestration and reasoning
- Prompt templates, tools/functions, and routing
- Frameworks: LangChain, LlamaIndex, or homegrown orchestration for control
- Guardrails: input/output validation, policy checks, PII redaction, citation enforcement
-
Model access and policy
- Provider abstraction (OpenAI, Azure OpenAI, Bedrock, Vertex, local models)
- Cost/latency routing, caching (prompt/response and embeddings), fallbacks, canaries
-
LLMOps and evaluation
- Tracing and analytics (latency, token usage, cache hits, failure modes)
- Quality evaluation: groundedness, factuality, instruction‑following, toxicity
- Golden sets and task‑specific evals (RAGAS or custom rubrics)
- Human feedback loops (thumbs up/down, error reports, corrections)
-
Security and governance
- Row/column‑level access and purpose‑based policy
- Redaction and masking pre‑embedding and pre‑generation
- Auditability: “why did the model answer X?” (context, sources, versions)
RAG vs. fine‑tuning: pick your battles
-
Start with RAG when:
- The knowledge changes frequently or must reflect the source of truth.
- You need citations, explainability, or strict governance.
- You want faster iteration with lower model/platform lock‑in.
-
Consider fine‑tuning when:
- The style/format must be learned and repeated consistently.
- The task is narrow and the corpus is stable.
- You’ve exhausted prompt and retrieval improvements.
In practice, many teams blend both: RAG provides up‑to‑date facts; a lightly tuned model handles style or domain terminology.
Retrieval that works: 7 practical tips
-
Parse the document you actually have
Use layout‑aware parsing (headings, tables, lists) and preserve structure in metadata. -
Chunk for meaning, not just length
Semantic chunking or layout‑driven chunking often outperforms naive fixed windows. -
Store rich metadata
Source, section, page, timestamp, version, sensitivity labels—these power precise filtering. -
Use hybrid search
Combine dense vectors with keyword (BM25) and metadata filters; re‑rank when the corpus is large or noisy. -
Cache embeddings and responses
Avoid recomputation and keep costs predictable; invalidate with document versioning. -
Evaluate retrieval, not just answers
Track top‑k coverage, overlap with ground truth, and citation correctness. -
Prefer freshness and authority
Retrieval ranking should boost recent, authoritative, and internally verified content.
Data contracts, lineage, and “Right to be forgotten”
LLMs amplify small data mistakes. That means:
- Contracts must include unstructured data: content type, size limits, PII flags, retention.
- Lineage must be bidirectional: show which answers came from which sources and versions.
- Redaction and deletion must propagate: if a source is retracted, invalidate related embeddings and caches.
This is both a governance obligation and a quality win.
Cost and latency without the guesswork
Data engineers and architects own the economics of AI systems.
- Pre‑filter before vector search to reduce candidate sets (by time, owner, label).
- Compress prompts and context; prefer citations over dumping entire documents.
- Token‑aware chunking and prompt templates reduce waste.
- Cache aggressively (embedding and response caches); track hit rates.
- Route by cost/latency/SLA: cheap summarization provider, premium reasoning provider, and local fallback if allowed.
KPIs to watch:
- P50/P95 latency; cost per request; cache hit rate; grounding/accuracy; containment of sensitive data.
The new skills for data engineers and architects
-
Retrieval engineering
Document parsing, chunking, embeddings, hybrid search, and vector DB design. -
LLMOps and evaluation
Tracing, dataset curation for evals, groundedness checks, human‑in‑the‑loop. -
Security and governance for GenAI
PII detection/redaction, least‑privilege retrieval, audit trails, policy‑based routing. -
Prompt and workflow design
System prompts, tool usage, retries, fallbacks, canaries, and shadow traffic. -
Cost/latency management
Token accounting, caching strategies, multi‑provider abstraction.
If you’ve built streaming pipelines, feature stores, and data products—you already have 70% of the skills. The rest is applying them to retrieval and generation.
A pragmatic 30‑60‑90 day roadmap
0–30 days: Discover and validate
- Pick 1–2 high‑value use cases (internal search, support assistant, policy Q&A).
- Build thin RAG prototypes with basic ingestion, embeddings, and search.
- Set quality and safety metrics; instrument tracing.
31–60 days: Stabilize and govern
- Productionize ingestion: parsing, chunking, metadata, contracts, and lineage.
- Introduce guardrails (PII redaction, policy checks, citation enforcement).
- Add LLM evals and golden sets; establish staging and canary releases.
61–90 days: Scale and optimize
- Add caching, multi‑provider routing, and cost/latency dashboards.
- Close the loop: user feedback, error reports, automatic retraining of retrieval.
- Define incident playbooks (provider outage, cost spike, data recall).
Common pitfalls (and how to avoid them)
- Hallucinations: Missing or irrelevant retrieval. Fix chunking and ranking; require citations.
- Over‑contexting: Dumping too much text. Use filters, compression, and relevance thresholds.
- No evals: Relying on vibes. Build golden sets and task‑specific metrics early.
- Weak governance: Unredacted PII in embeddings or context. Add detection, masking, and access control.
- Vendor lock‑in: Hard‑coding to one model or database. Abstract providers and retrieval; keep data portable.
In closing: AI platforms are data platforms, with context
The LLM era rewards teams that treat knowledge as a product: well‑parsed, well‑chunked, richly annotated, and safely retrievable. If you already own the data platform, you’re closer than you think to owning the AI platform.
Want help turning your data platform into an AI platform? I’m happy to review your use case and propose a practical path to production.
— Evgeni Altshul

