What We Learned Trying to Build Our Own AI Agent at VidMob

By: Kelly Flood (Product Manager at VidMob)

From Skepticism to Agentic Reality: What Building (and Rebuilding) Maddie Taught Me About AI Agents

If you’ve ever tried building an AI agent inside a complex product, you probably know the feeling: it seems straightforward at first, and then it slowly turns into a massive prompt engineering experiment that delivers less value than you expected and far less value than you hoped.

This was our experience with Maddie, our original creative intelligence AI agent. At Vidmob, we help brands understand what drives creative performance by turning ads into structured data — analyzing elements like messaging, visuals, pacing, and talent and connecting those signals directly to outcomes. Our users aren’t just reviewing dashboards; they’re trying to uncover nuanced patterns and make strategic decisions about how creative should evolve. Because of that, we work with complex datasets and fairly specialized workflows, which initially made us believe that building an AI agent internally was the only realistic way to create something truly useful.

For a while, it made me question whether agentic experiences could ever really work in a practical product environment. I thought it had some potential in limited use cases, but I doubted the possibility of a true agentic workflow.

I want to share our story of what didn’t work, what finally did, and what it taught us about building AI agents — especially how flexibility, iteration speed, and a healthy amount of skepticism ended up mattering more than trying to perfectly control intelligence from the start.

Phase 1: A Focused Proof of Concept

Before Maddie existed as a product experience, we started a very focused proof of concept with a known GenAI vendor. The goal was intentionally narrow: generate meaningful creative insights from our Analytics data. One of our key reports helps users to uncover the creative drivers of performance. What’s the right setting to use in my ads? Should I focus more on functional messaging or emotional? How much of an impact does including a celebrity spokesperson have? A frequent point of feedback is that the data is valuable, but it takes a lot of digging actually to uncover a valuable insight. To address that pain point, we wanted an agent that could identify meaningful performance drivers and present them in a structured insight format.

At the time, this felt like an achievable goal. Rather than starting with a general-purpose agent, we wanted to teach a model how to produce structured insights using a specific framework.

My role in that project was defining what “good” looked like. We developed a structured approach:

Identify the creative element driving performance
Ground it in a specific datapoint
Contextualize how the element was used
Explain why it matters
Provide a recommendation for what to do

In theory, this seemed like an ideal use case for an LLM. The instructions were clear, the data was organized and rich, and the outcome was well-defined.

We only ever got partway there. The outputs were technically accurate but often felt lukewarm. While they followed the insights formula, it always felt like something was missing – not enough context, underwhelming “why” statement, or uninspiring recommendations. It felt as if fixing for one thing, like accuracy, would lead to a tradeoff in something else, like relevance.

We went back and forth to and from the drawing board. We kept refining instructions, experimenting with different approaches, injecting more technology – but we never got to the point of feeling confident putting it directly in front of customers.

That experience taught us a lot as I’ll share below, but it also sowed the first seeds of doubt for me. If a narrow, highly-structured use case was this difficult to get right, how much harder would a fully agentic system be? Maybe we would always need a human in the loop to think of insights that actually inspire.

Phase 2: Building an Agent In-House

Alongside that proof of concept, the team also developed in-house versions of Maddie. The internal agent focused initially on analytics workflows:

Generating creative insights
Answering performance questions
Providing normative, industry-specific learnings

As we tried to build thoughtful architecture where the agent could correctly interpret users’ questions and generate the right insights, we ended up attempting to add more and more control. We needed precise instructions for each type of creative element. We couldn’t easily add new reports and their data to the model – rather, adding new reports would require essentially starting from scratch. We ultimately ended up with a rigid, inflexible system.

That rigidity still didn’t get us the outcomes that we wanted. While the insights weren’t wrong, the overall feedback is that they didn’t feel interesting or compelling. Our beta testers felt the insights were too generic, and there were enough accuracy issues that they doubted the model’s reliability.

We only ever launched Ask Maddie behind a feature flag for friendly testers. Even within that group, adoption remained low – no one was pushing for broader rollout.

Eventually, the pattern became clear: We were investing a huge amount of effort into building infrastructure that still didn’t deliver the level of insight quality or flexibility we needed.

Phase 3: Trying Something Off-the-Shelf

At this point, we had spent a long time trying to make our internal approach work. We kept iterating, adjusting instructions, refining the architecture — but we weren’t seeing the kind of progress that made us confident enough to fully roll it out.

So we decided to try something different. We ran a short proof of concept using Foldspace — an off-the-shelf agent framework — mostly just to see what would happen if we stopped trying to build everything ourselves. I went into that curious but without any expectations. I wasn’t sure how much better the outputs would be, compared to our previous attempts.

The difference became clear immediately.

The setup was scrappy in the best way. It was just me and an engineer sitting side-by-side, defining actions, writing instructions, connecting the code, testing immediately, and adjusting as we went. There wasn’t this long ramp-up period where we had to build infrastructure, launch, wait, and gather feedback. We were interacting with the agent almost immediately.

And within a week, we had something that felt more flexible than what we had spent months building internally. Not only could the agent generate well-reasoned, compelling insights in our key reports (our goal with previous builds), but it served a number of other benefits too. It could infer users’ intent and bring them to the right destination on the platform – no more requiring a map to find the right report. Users simply ask something like “How are my ads performing today?”, and the agent brings them to the right destination and kindly loads default filters to give them immediate access to our creative data. Additionally, the agent was automatically generating executive summaries at the top of our key reports, giving them a taste test of quick takeaways rather than forcing them to dive in and make sense of it all.

Within a week, we had something that felt more flexible than what we had spent months building internally. The speed and breadth of what we could launch, just us two, in such a short time felt like a huge unlock.

The Real Shift

The biggest difference wasn’t that the agent suddenly became smarter – it was that the system wasn’t rigid anymore.

Instead of creating predefined flows for every possible scenario, we focused on defining tools and actions the agent could use. We gave the agent a brain filled with all of our Help Center articles to make it an expert on our platform. The rest came naturally. It allowed the agent to operate across all of our reports and workflows – not just the limited scope on top of which we initially tried to build – without needing a completely new setup each time. Whereas we previously had to start almost from scratch when adding a new dataset, this approach made expansion simple and efficient.

Really, though, the biggest change wasn’t just about setup or outcomes – it was about iteration.

Before this, testing felt slow and heavy. We would make changes, wait for deployment, run feedback sessions, analyze results, and then start the cycle again. Even small adjustments took time, which made experimentation feel expensive. With Foldspace, iteration became part of my normal workflow as a product manager.

I could read user conversations with the agent myself, see where responses felt strong or weak, tweak instructions, publish changes, and immediately see the impact. I rarely ever need an engineer to step in to make changes to the code. Switching models became a toggle. Adjusting reasoning levels became a setting. Updating suggested prompts could be done in minutes. Rolling something back wasn’t a major decision — it was just another experiment.

This is where things really started to shift. Building the agent felt like true, agile product work instead of a long-running engineering project. Instead of trying to design the perfect system upfront, we could continuously shape it through feedback and iteration.

Build vs Buy Isn’t the Question — Defining the Problem Is

After everything we went through, I don’t think the takeaway is simply “always buy instead of build.” There are absolutely situations where building internally makes sense, especially if agent infrastructure itself is core to your product or if you need extremely specialized control over how reasoning happens. At the beginning, building our own agent felt like the obvious path. We have a unique dataset, a specific way that strategists interpret creative performance, and workflows that don’t necessarily map cleanly onto generic tools. It felt reasonable to assume that an off-the-shelf solution wouldn’t understand our world well enough.

What changed for me wasn’t realizing that doing it ourselves was wrong. It was realizing that we had been solving the wrong problem. We thought the biggest challenge was training the model and tailoring it to our data. In reality, the bigger challenge was flexibility and how quickly we could iterate once the agent was live.

Building internally gave us a lot of theoretical control, but it slowed down experimentation. Every change required more structure, more coordination, more data science and engineering work. That made it harder to learn from real usage and harder to evolve the system over time.

Moving to an off-the-shelf framework didn’t eliminate control, it just moved it and made it much easier to manage. Instead of controlling architecture upfront, we gained control through iteration. We could experiment more quickly, test different approaches, and refine behavior based on what actually worked instead of what we predicted would work – often without any additional technical work required by an engineer.

For us, that tradeoff ended up being worth it.

The question isn’t really “build or buy.” It’s where you want to spend your complexity budget: on infrastructure or on iteration.

My Advice to Other PMs Building Agents

If you’re a PM thinking about building an agent inside an existing product, here are a few things I wish I had internalized earlier.

First, don’t assume that intelligence is the hardest problem. It’s easy to believe that success depends on picking the right model or designing the perfect prompt framework. In our experience, the harder problem was building something flexible enough to evolve as we learned.

Second, iteration speed matters more than you think. Agents are inherently probabilistic; you need to yell at it a bit before you get things right. The faster you can test ideas, observe real interactions, and make adjustments, the faster you’ll get to something genuinely useful.

Third, be honest about how much setup your users need to do and are willing to do. One of the biggest barriers we ran into early on was requiring too much input before the agent could provide value. The closer you can get to a single click to value, the more likely users will be to adopt.

Lastly, it’s okay to be skeptical. For a long time, I thought agents might only ever get you halfway to something meaningful — maybe 30–50% of the way to a useful insight — and that a human would always need to finish the job.

It’s actually better to have some doubts in mind while you explore what’s possible. Because I wasn’t assuming the agent would magically solve everything, it forced us to focus on the hardest questions:

Those questions became the bar we had to meet, and they ultimately made the final experience stronger.

What changed my perspective wasn’t a breakthrough in intelligence. It was realizing that with the right level of flexibility and iteration, that gap starts to shrink.

Agents aren’t magic, but when the system is designed to evolve quickly, they can become far more practical than one might initially believe.