The Real Reason Early AI Projects Fail
Most early-stage AI projects die quietly. Not from bad models. Not from wrong architecture choices. They die because nobody knows what the thing is actually doing. Features ship, demos look great, and then six months later someone asks a reasonable question — "is this working?" — and nobody has an answer.
That's an instrumentation problem. And it kills more AI projects than hallucinations ever will.
Features Are Easy. Visibility Is Hard.
Here's what I see over and over: a team gets excited, picks a model, builds something genuinely clever, and ships it. The first two weeks feel like magic. Then the cracks appear. A user gets a bad response. A workflow silently degrades. A prompt that worked last month stops working after a model update. Nobody catches it because nobody built the infrastructure to catch it.
One of our local AI community members put it perfectly when she described using an AI agent to audit CRM behaviour across her team. The question wasn't "does this tool work?" It was "is our team actually using it, and why isn't the process sticking?" That's the right question. That's instrumentation thinking. She ran the audit and the findings were gold. Not because the AI was sophisticated, but because she built in the mechanism to observe what was happening.
That's the difference between a project that scales and one that quietly becomes obsolete..
What Instrumentation Actually Means in AI
Instrumentation in AI consulting isn't glamorous. It doesn't make it into pitch decks. But it's what separates clients who can improve their systems from clients who are flying blind.
At the minimum, it means:
Logging inputs and outputs. Every prompt, every response, every tool call. Not forever — but long enough to catch problems before they become crises. You cannot debug what you cannot see.
Tracking latency and cost per call. One of our community members flagged something that stuck with me: after six months, the ROI on local compute pays for itself in R&D mode. That calculation is only possible if you're tracking cost. Snowflake, Databricks, cloud compute — these things add up fast if you're not watching.
Watching for model drift. Models change. Providers update weights, change defaults, retire versions. A pipeline that ran cleanly in January can quietly degrade by March. You need version-pinning and regression tests to catch it.
Measuring actual usage, not assumed usage. This is the CRM problem again. Paying for a tool doesn't mean your team is using it correctly — or at all. Agent-based audits of user behaviour aren't a luxury. They're how you find out what's actually happening.
"The Ground Truth Is the Running System"
We have a phrase at Millwater: “get to the ground truth”. It means what's actually true in the running system right now — not what the docs say, not what you remember, not what the demo showed. When something breaks at 2am, the ground truth is the only thing that saves you.
Building instrumentation is how you access the ground truth at 2pm on a Tuesday before something breaks. It's the difference between reactive firefighting and proactive improvement.
Doing It Once, Doing It Right
We also say: let's do it once, and do it right. The temptation on early-stage AI projects is to skip instrumentation because it feels like overhead and it’s not “sexy”. You'll add it later. You'll set up proper logging when things get serious. You won't. The backlog fills up, the system grows, and retrofitting observability into a system that wasn't designed for it is genuinely painful.
Build it in from day one. Even a simple structured log of inputs, outputs, latency, and cost changes the entire trajectory of a project. It gives you the data to make real decisions instead of guesses.
What This Looks Like in Practice
For clients at Millwater, instrumentation is non-negotiable from the first sprint. Before we talk about which model, which agent framework, which vector database — we talk about how we'll know if this is working. What does success look like? How will we measure it? What will tell us it's degrading?
Those questions aren't blockers. They're the foundation that gives you the data to take to your board when they ask what you’ve been working on, or “how is that AI feature tracking?”. Features are built on top of observability, not the other way around.
The projects that win aren't the ones with the most impressive demos. They're the ones where, six months in, a stakeholder asks "how is this performing?" and someone can pull up a dashboard and actually answer.
Build the foundation first. The features will thank you later.
— Rishi Prasad, Lead Full Stack Developer, Millwater Consulting https://millwater.consulting







