AI is infrastructure

Embeddings, vector databases, RAG, and the one question nobody asks: who owns the switch?

Apr 20, 2026

I. The illusion of intelligence

For the last few years, we have treated AI like a parlor trick—a clever API call where you send a string of text and pray for a coherent response. In the early stages of any technology, this kind of experimentation is necessary. But the era of the “magic black box” is ending. We are currently witnessing the transition of AI from an experimental API call to a core piece of persistent infrastructure.

The problem with the current “wrapper” economy is one of sovereignty. Most engineers are building on foundations they neither control nor fully understand. If your entire product logic is a sequence of prompts sent to a third-party provider, you aren’t building a system; you are renting an outcome. From an engineering perspective, this is a massive single point of failure. From an investment perspective, it’s a business with no moat and high platform risk.

When you treat AI as a transient API call, you are essentially dealing with a stateless function. It has no memory, no context, and no way to interact with the messy, unstructured reality of your specific data. To move beyond this, we have to stop thinking about “chatting” and start thinking about architecting.

Infrastructure requires three things that a simple API call lacks: state, memory, and interfaces.

State is knowing where you are in a process.
Memory is the ability to retrieve the right information at the right millisecond.
Interfaces are the protocols that allow different parts of the machine to talk to each other without custom, brittle glue code.

If you look at the history of computing, we’ve seen this pattern before. Databases were once specialized tools for academics; now they are the bedrock of every application. Version control was a niche utility; now it’s the heartbeat of production. AI is following the same trajectory. It is moving out of the “playground” and into the “server room.”

To build in this new environment, we have to look under the hood at the components that actually make a system persistent: embeddings, vector storage, and orchestration. Once you understand these, the “magic” disappears, and you’re left with something much more useful: an engineering problem.

II. Meaning as math: the embedding layer

To build persistent AI infrastructure, you first have to solve the problem of representation. Computers are exceptionally good at comparing integers; they are historically terrible at comparing ideas.

The breakthrough that makes modern AI possible is the embedding. At its most basic level, an embedding is just a list of numbers—a high-dimensional vector. But its value isn’t in the numbers themselves; it’s in their position.

Think of an embedding as a coordinate in a massive, multidimensional map of human meaning. In this space, “closeness” equals similarity. The phrase “write project report” and “draft team summary” may share zero keywords, but their vectors will be mathematically adjacent because their intent is nearly identical. This is a fundamental shift from keyword matching to semantic retrieval.

For an engineer, embeddings are the “universal adapter” for unstructured data. We no longer need to write complex regex or brittle heuristics to categorize data. Whether it’s an email, a snippet of Python code, or a Slack message, we can project it into this vector space. Once everything is a coordinate, the problem of “understanding” a task becomes a problem of geometry.

From a systems perspective, this is where the “intelligence” of the model is actually operationalized. When you ask an AI to find your top priorities, it isn’t “thinking” in the human sense. It is performing a mathematical calculation to find which coordinates in your data map are closest to the coordinate of your question.

However, a coordinate is useless if you have nowhere to store the map. If you’re regenerating these vectors every time a user asks a question, your latency will kill the product, and your API bill will kill the company. This leads us to the next requirement of the stack: a way to make these mathematical meanings persistent and searchable at scale.

III. The persistence problem: vector databases

If an embedding is a coordinate, a vector database is the physical medium that holds the map.

In traditional software architecture, we rely on relational databases (Postgres, MySQL) or document stores (MongoDB) to handle state. These systems are optimized for exact matches. If you query an ID or a specific timestamp, the database uses B-tree indices to find that record in logarithmic time. This works perfectly for structured business logic, but it fails completely for AI.

You cannot use a B-tree to find “things that are kind of like this.” If you tried to perform a similarity search in a standard SQL database by calculating the distance between a query vector and every vector in your table, you would be running an O(n) operation. At a scale of millions of vectors—representing years of notes, code, and documentation—your system would grind to a halt.

This is the persistence problem. For AI to become infrastructure, it needs a “Long-Term Memory” that can perform Approximate Nearest Neighbor (ANN) searches in milliseconds.

Vector databases like Milvus, Weaviate, and Pinecone are designed for this specific workload. They don’t just store data; they organize it using specialized indexing algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index).

From an engineering perspective, the choice of a vector database is a classic trade-off between latency, recall (accuracy), and cost.

Managed services (Pinecone) offer low operational overhead but introduce a dependency on external cloud reliability.
Self-hosted solutions (Milvus, FAISS) offer sovereignty and potentially lower long-term costs but require significant DevOps expertise to scale.

I look at the vector database layer as the most defensible part of the AI infrastructure stack. Models are becoming commodities; the data—and the efficiency with which you can retrieve it—is the moat. If your AI assistant feels like it “remembers” the right detail at the right time, it isn’t because the model is brilliant. It’s because your vector database is performing a high-speed retrieval of the relevant context before the model even begins to generate a response.

Without this layer, AI remains a “goldfish”—brilliant in the moment, but incapable of learning from the past.

IV. Grounding the model: RAG vs. fine-tuning

Even the most sophisticated large language model has a glaring weakness: it is a frozen snapshot of the past. Once a model finishes training, its “worldview” is locked. It doesn’t know about the code you pushed ten minutes ago, the meeting you just finished, or the current state of the market.

In the early days of this cycle, many teams assumed the solution was fine-tuning—the process of retraining a model on a specific dataset to bake in new knowledge. But as an engineering pattern, fine-tuning for knowledge retrieval is often a mistake. It is expensive, slow, and non-deterministic. You are essentially trying to force a student to memorize a library by whispering the books to them while they sleep. They might remember some of it, but they will likely hallucinate the rest.

The superior architectural pattern for persistent infrastructure is Retrieval-Augmented Generation (RAG).

Think of the difference between an “unsupervised exam” and an “open-book exam.” Fine-tuning is the unsupervised exam; the model relies solely on its internal weights. RAG is the open-book exam. When a user asks a question, the system first consults the library (the vector database), pulls out the relevant “pages” (the context), and hands them to the model along with the question.

The model’s job changes from remembering to synthesizing.

From a systems design perspective, RAG offers three critical advantages:

Truth grounding: because the model is looking at a specific piece of retrieved text, the probability of hallucination drops significantly. You can even force the model to cite its sources, creating an audit trail that is impossible with a raw LLM.
Zero-latency updates: if your data changes, you don’t need to retrain a model. You simply update the vector database. The next time the RAG pipeline runs, the model sees the new information immediately.
Cost efficiency: running a vector search and a standard inference call is orders of magnitude cheaper and faster than maintaining a custom fine-tuned model pipeline.

For the investor, RAG is where the “Value” in Value Investing lives. It allows a company to put its proprietary data to work—its real-world edge—without leaking that data into a general-purpose model’s training set. It turns the AI from a generalist philosopher into a specialist with access to your private company records.

RAG turns the AI into a tool that is not just smart, but current. But having a smart, current advisor is still passive. The final leap in the stack is moving from an advisor that talks to an agent that acts.

V. From chat to agency: orchestration and ReAct

Retrieval-Augmented Generation (RAG) makes an AI knowledgeable, but it doesn’t make it useful. A smart advisor who can find your notes is a convenience; a system that can actually do the work is an asset. To cross this gap, we move from passive text generation to Agentic Orchestration.

In the “experimental API” phase, we treated the model as a one-shot oracle: you ask a question, it gives an answer. In the “infrastructure” phase, we treat the model as a reasoning engine within a closed loop. This is often formalized through a pattern called ReAct (Reason + Act).

The ReAct loop is essentially a state machine. Instead of trying to guess the final answer in one go, the model is prompted to follow a structured sequence:

Thought: the model analyzes the request and determines what it needs.
Action: the model selects a tool (an API call, a database query, or a search).
Observation: the model consumes the output of that tool and updates its “thought” process.

From an engineering perspective, this is where AI starts to look like traditional distributed systems. We aren’t just “chatting”; we are managing control flow. When an agent decides to “summarize the top three tasks and email them,” it is executing a multi-step transaction. It must verify the tasks exist (Vector DB), prioritize them (Reasoning), and then hit an external SMTP or Graph API endpoint (Action).

This introduces a new set of architectural challenges: reliability and latency. Every step in a reasoning loop adds “hops” to the process. If your model takes 2 seconds to “think” and you have a 5-step loop, your user is waiting 10 seconds. For a senior engineer, the focus here isn’t on the “intelligence” of the agent, but on the throughput and error handling of the orchestration framework (like LangChain or Haystack). If the email API returns a 500 error, does the agent know how to retry, or does it hallucinate a success message?

From the investor’s seat, orchestration is the bridge to true productivity. The value isn’t in the model itself—which is depreciating toward zero cost—but in the proprietary “chains” or “graphs” of logic a company builds around it. A system that can autonomously negotiate a schedule or reconcile a ledger is fundamentally more valuable than one that merely writes poems about them.

However, as we give agents more power to act, we run into the “Integration Mess.” If every tool requires a custom, bespoke connector, the system becomes a maintenance nightmare. This leads us to the need for a standardized interface.

VI. The universal port: Model Context Protocol (MCP)

As we move from passive chat to active agency, we hit a scaling wall known as the NxM integration problem.

If you have N different AI models and you want them to interact with M different enterprise tools—Slack, Jira, Salesforce, internal databases—you traditionally need to write NxM custom connectors. Every time a tool updates its API or a new model is released, the “glue code” breaks. For a senior engineer, this is technical debt in its purest, most toxic form. For the business, it’s a massive integration tax that slows down every new feature.

The solution that has emerged as the industry standard in 2026 is the Model Context Protocol (MCP).

Often described as the “USB-C for AI,” MCP is an open standard—now governed by the Linux Foundation’s Agentic AI Foundation—that provides a universal interface for AI agents. Instead of building bespoke middleware for every app, you build an MCP Server that exposes your data or tools in a standardized format. Any MCP Client (like a coding assistant or a corporate agent) can then discover and use those tools securely.

From an architectural standpoint, MCP introduces a clean separation of concerns:

The server: wraps the data source (e.g., your Postgres DB or your GitHub repo) and tells the agent what it can do.
The client: the AI application that orchestrates the reasoning.
The host: the secure environment (like an IDE or a sandboxed container) where the interaction happens.

By 2026, we’ve moved beyond simple text-based commands. Modern MCP implementations support sampling—where the server can ask the model to reason about intermediate steps—and elicitation, which allows the system to pause and ask the user for permission or clarification (like an OAuth flow or a payment confirmation).

For the engineer, MCP means “write once, run anywhere.” You build one MCP server for your internal documentation, and it instantly works with Claude, GPT-5, and your self-hosted Llama models.

For the investor, MCP represents the death of the “connector moat.” Companies that previously charged high premiums just to sync data between two apps are seeing their value evaporate. The value has shifted upstream to the quality of the data and downstream to the sophistication of the agent’s reasoning.

However, as we connect more “live” tools to our AI, the surface area for failure increases. We aren’t just sending prompts anymore; we are opening bidirectional data pipes into our most sensitive internal systems. This brings us to the final, and most critical, layer of the stack: Sovereignty and Security.

VII. The infrastructure pivot: sovereignty and security

So far, we have mapped out a sophisticated stack: embeddings for meaning, vector databases for memory, RAG for grounding, and MCP for connectivity. But as a systems thinker, you have to ask the most uncomfortable question in the room: Who owns the switch?

Right now, for the vast majority of companies, the answer is “someone else.” Your embeddings are computed by OpenAI; your reasoning is handled by Anthropic; your vector database is a managed instance on a third-party cloud. This is a fragile architecture. If a provider suffers an outage—as we’ve seen with major LLM providers—your “persistent infrastructure” evaporates instantly.

From an engineering perspective, this is a single point of failure. From an investment perspective, it’s a failure of risk management. You are essentially building a skyscraper on land where the lease can be revoked at any moment.

This is why we are seeing a massive pivot toward sovereign AI infrastructure. Serious engineering teams are pulling the stack in-house:

Self-hosted models: using inference engines like vLLM or Ollama to run open-weights models on private GPU clusters.
Private vector stores: running Milvus or Qdrant inside a private Kubernetes cluster rather than relying on a managed SaaS.
Local embedding pipelines: ensuring data never leaves the corporate network.

The goal is redundancy, cost control, and data privacy. But this pivot introduces a new, lethal problem: the access dilemma.

Once you move your models and data behind a firewall, how do your agents and developers actually reach them? The “old way” was to set up a bulky VPN or, worse, open a port. A VPN is a blunt instrument; it gives a user access to the entire network when they only need access to a single API endpoint. It’s the antithesis of modern security.

The “infrastructure” approach to this problem is Zero Trust. Instead of trusting anyone on the network, you authenticate based on identity. This is where tools like Twingate come into play. They allow you to define a specific LLM endpoint as a “protected resource.” An encrypted tunnel is established only for authenticated users through their existing identity provider (Okta, Google, GitHub). The connector lives inside your cluster, forwarding requests directly to the model.

You stop paying “rent” to the cloud providers and start building “equity” in your own infrastructure. You gain the ability to run your system during a global outage, you eliminate the “privacy tax” of sending data to third parties, and you secure your proprietary logic behind a zero-trust perimeter.

In the 1990s, companies realized they couldn’t just rent time on mainframes; they needed their own servers. In the 2020s, we are realizing we cannot just rent “intelligence” via an API. We need to own the stack.

VIII. Conclusion: AI is infrastructure

The history of computing is a history of demystification. We begin by treating new capabilities as magic, then as luxuries, and finally as mundane necessities. We are currently at the final stage of that cycle with artificial intelligence. The “magic” of a chatting machine has worn off, replaced by the sober realization that AI is simply a new layer of the modern infrastructure stack.

When we look back at the “AI Hype” of the early 2020s, we will see it not as the birth of a new god, but as the chaotic childhood of a new utility. The transition from an experimental API call to a sovereign, persistent system is the signal that the industry is maturing. We are moving away from “AI-powered” apps—which were often just thin wrappers around someone else’s model—toward AI-integrated systems that own their memory, their logic, and their security.

The Caffeinated Engineer

Discussion about this post

Ready for more?