Blog

From Human-in-the-Loop to Agentic (Part 2): A Pattern We Discovered Building Page2Play

Part 2: Building the Agent (And Teaching It Taste) By Olu • March 2026 In Part 1, I explained how…

Part 2: Building the Agent (And Teaching It Taste)

By Olu • March 2026


In Part 1, I explained how Page2Play’s 75% frame accuracy wasn’t production-ready, and how building simplified UX controls for human reviewers accidentally created the exact toolkit an AI agent would need; four discrete editing methods that turned complex prompt engineering into structured, predictable actions.

Now: what happened when we actually built the agent; and the problem nobody warns you about.

Building the Agent

Once we recognized the pattern, building the agent was surprisingly direct. It took about three weeks to get it working. Here’s how it operates:

Assessment: For each frame in a storyboard, the agent compares the generated image against the character lineup. It analyzes the character as defined in the reference material versus how they appear in the rendered scene, within the context of the story. It produces a consistency score from 0 to 100.

Decision: If the score is 90 or above, the frame passes. If it’s below 90, the agent identifies what’s off and determines which adjustment controls to use; the same four methods a human user would choose from.

Correction: The agent adjusts the relevant parameters and triggers a regeneration. New image, new assessment, new score.

Retry Budget: The agent tries this loop a maximum of three times. If after three attempts it still hasn’t crossed the 90 threshold, it selects the highest-scoring version, sets it as the active image for that frame, keeps the other attempts as alternative renders, and moves on; ready for human review if needed.

No infinite loops. No burning tokens chasing perfection. A clear budget, a clear threshold, and a graceful fallback.

How It Performs in Practice:

On a typical storyboard run:

  • 70-80% of frames score above 90 on the first attempt and pass immediately with no corrections needed
  • Of the 20-30% that fail initially, about 65% pass on the second attempt after the agent adjusts the prompts
  • The remaining frames get a third attempt with further refinements
  • After three tries, any frames still below 90 are flagged with their highest-scoring version set as active

This entire process gets us to 95-98% of frames being production-ready, measured the same way we measured the original 75%; manual review by me and the team. The exact percentage depends on the story type and action level in the scenes.

The Unexpected Problem: Teaching the Agent Taste

This is the part nobody warns you about.

The agent didn’t just match human performance. It exceeded it; in ways we didn’t want.

Let me show you what I mean with two real examples from a “Slow and Steady Wins the Race” storyboard:

Example 1: The Tortoise’s Scarf The agent flagged a frame where the tortoise’s scarf had shifted slightly in color tone compared to the character lineup. To a human reviewer, it looked fine; the overall character was clearly the same tortoise. But the agent was right: there was a subtle drift that would compound across dozens of frames.

Example 2: The Missing Helmet In another story, the character Olu was designed with a helmet in his character lineup. In one scene, the helmet was off because the story context was that he had returned home. A human reviewer would pass this immediately; the story called for no helmet. But the agent scored it 75% and flagged it as inconsistent with the character design, not understanding the narrative context that made the change appropriate.

The issue is that humans assess holistically. We see the overall frame, understand the story context, feel whether the character “looks right,” and forgive irrelevant details. The agent assesses exhaustively. Every pixel deviation carries equal weight unless you tell it otherwise. A shifted scarf color and a missing helmet that’s supposed to be missing; both get flagged the same way.

So we had to teach it taste.

How We Built the Taste Layer:

We implemented weighted scoring where different elements carry different importance:

  • Facial features are weighted heavily; a missing eye or wrong expression matters a lot
  • Clothing color variations are weighted less; minor shade differences don’t break consistency
  • Proportion and positioning are contextual; if a character is facing away from the camera, we reduce the weight of facial details
  • Character names vs. descriptions in prompts; we taught the agent to use actual character names (like “Olu”) instead of descriptions (like “the character in the grey shirt”), which prevents the model from inventing unnecessary details

We’re still refining this taste layer. Teaching an agent which types of adjustments matter for production quality and which are noise is genuinely hard. It’s encoding human aesthetic judgment into an automated scoring system. I think this layer; the taste layer; is where a lot of the real work in agentic AI lives. Not in the automation itself, but in teaching the agent what to care about.

What This Unlocked

Character consistency went from 75% to 95-98% across storyboard panels. The need for manual frame-by-frame review dropped dramatically. But the bigger unlock was downstream.

Once the frame revision agent was reliable, it created room to connect the other automated processes in the pipeline. Each agent handles its piece, and the final output; the completed storyboard; is the only thing that needs human review. Not each frame. Not each parameter. The finished product.

That’s the real shift from human-in-the-loop to agentic. The human isn’t removed. They’re moved to where their judgment actually matters: the final output, the creative direction, the overall quality assessment. Not the pixel-level corrections an agent handles better anyway.

The total timeline from starting with pure manual review to having a working agent was about 5-6 months:

  • 2 months of manual review (understanding the problem)
  • 8-12 weeks building and evolving the prompt-free UX
  • 3 weeks building and training the agent

What’s Next: The Pattern Repeats

The current agent adjusts parameters within a fixed composition. The camera angle, the shot type, the framing; those stay the same. The agent works within those constraints.

The next step is giving the agent a second dimension: the ability to change the composition itself. If a character doesn’t look right from a front-facing medium shot, maybe a three-quarter angle or a different shot type would express the scene better.

Here’s what’s interesting: we’ve already built UX controls for humans to change camera angles and shot types. Following the same pattern we discovered before, those human-designed controls will become the agent’s toolkit for composition changes. We’re not building agent capabilities from scratch; we’re recognizing that the tools already exist in the UX we designed for people.

That’s the next 3-6 months: teaching the agent not just to perfect a composition, but to choose better compositions. Same pattern. Same progression.

The Pattern Worth Taking Away

If you’re building AI-powered tools and thinking about when and how to introduce agents, here’s what our experience taught us:

1. Start with the human workflow. Don’t design for agents first. Design an excellent human experience. Clear controls, structured parameters, simplified decision points. Solve the problem for people. We spent 8-12 weeks iterating on UX before we even thought about building an agent.

2. Recognize the action space. The structured tools you built for humans define what an agent can do. If the controls are clear and the parameters are discrete, you’ve already built the agent’s toolkit. You just didn’t know it yet. Our four editing methods became the agent’s four intervention strategies.

3. Give the agent a budget, not unlimited freedom. Three attempts. A score threshold. A graceful fallback. Agents that run indefinitely chasing perfection burn resources and ship nothing. Our agent stops after three tries and selects the best attempt; good enough to move forward, flagged for human review if needed.

4. Build the taste layer. The hardest part isn’t automating the work. It’s teaching the agent what matters. What to flag and what to forgive. This is where human judgment still lives in an agentic system; not in the doing, but in the standards. We’re still refining how the agent weighs facial features versus clothing details, how it understands story context versus visual consistency.

5. Move the human to where they matter most. The goal isn’t to remove humans. It’s to stop spending human attention on things an agent handles better, so that attention goes where it’s irreplaceable: creative direction, final quality assessment, and the big-picture decisions that require taste and context.

A More Honest Way to Build

We didn’t start with an architecture diagram mapping out human-in-the-loop to agentic workflows. We started with a problem: 75% accuracy wasn’t good enough. We built tools to solve it for humans. Then we recognized that those tools were the architecture for automation.

If you take one thing from this, it’s this:

If your product involves human review or creative control, here’s a pattern worth considering:

Don’t start by designing for agents. Start by designing the best possible human workflow. Structure the controls. Simplify the parameters. Make the decisions discrete. This isn’t just good UX; it creates a natural action space for automation.

Check your existing UX for hidden agent toolkits:

  • Do you have discrete actions (checkboxes, toggles, sliders)?
  • Are parameters clear and bounded (not freeform input)?
  • Are outputs predictable (not creative exploration)?

If yes, the agent toolkit might already be there. You just need to recognize it.

This pattern works best when:
  • Humans are already doing the task manually
  • You need human judgment for final approval
  • Quality matters more than pure automation speed
  • The workflow is structured and repeatable
This pattern is wrong when:
  • Pure background automation with no human touchpoint
  • Agents need capabilities beyond human scale
  • Real-time requirements exceed manual workflows
  • The task is exploratory, not production-focused

For creative production with human quality control; like we built at Page2Play; this emergent approach was more effective than trying to architect the agent system upfront. The human workflow revealed what the agent needed to do.

You might find the same is true for what you’re building.

Olu is the founder of Page2Play, an AI-powered platform that turns written scripts into production-ready videos.

[← Read Part 1: The Accidental Toolkit]

More Articles

Explore other insights from our experience

Blog

Integrity as a Feature, Not a Bug (Marcus@CareerWeave)

Apr 17, 2026

By Olu • April 2026 I’ve been testing CareerWeave, an AI career tool where users chat with an agent named Marcus who captures their professional experience. Something strange kept happening. Some users loved it. They’d say things like, “Wow, Marcus understood what I meant and said it better than I could.” They’d lean in. They’d […]

Blog

From Human-in-the-Loop to Agentic (Part 1): A Pattern We Discovered Building Page2Play

Mar 28, 2026

Part 1: The Accidental Toolkit By Olu • April 2026 I’m tired of seeing customer service tickets used to explain agentic AI. Not because it’s wrong; it’s just not the only story. So I want to share a different one, from building an animation production platform, that shows how we moved from human-dependent workflows to […]

Blog

The Curse of Good UX: The “Looks Simple” Trap when UX Works Too Well

Feb 27, 2026

This was written in late 2025 when Page2Play was focused on education and ministry. Since then, we've pivoted toward enterprise B2B — read the agentic AI series for where the product is heading. I’ve had the same conversation a dozen times now. A church, school, or publisher sees Page2Play: our AI platform that turns written […]

Have a bottleneck?

We are always looking for interesting problems. If something in your business is too slow, too expensive, or too dependent on one person, we should talk.

Scroll to Top