How I embedded as the de facto product lead at one of the world's largest pharmaceutical companies to design and ship an AI-powered MLR review platform — navigating enterprise regulation, a distributed engineering team across four countries, and the messy reality of probabilistic AI in a compliance-driven environment.
At Eli Lilly, every piece of promotional and medical content — from web banners to HCP materials — goes through a Medical, Legal, and Regulatory (MLR) review before it can be published. For a company operating across 125 countries with thousands of branded assets, the stakes of getting that review wrong are significant: regulatory fines, product recalls, and reputational damage.
But the review process itself was deeply manual and inconsistent. There was no formal training or onboarding documentation for new content reviewers. Instead, new hires shadowed experienced reviewers for weeks before being left to develop their own approach — complete with personal cheat sheets, bookmark folders, and SharePoint documents that lived entirely in their own heads.
Creatives called them "volleys" — submissions that bounced back with conflicting feedback depending on which reviewer picked them up. One reviewer might flag an issue another had never mentioned. The same content, reviewed twice, could get two different answers. Every volley meant rework, delay, and eroding trust between creative and review teams.
The opportunity was clear: if the review rules lived in code rather than people's heads, every submission would be evaluated against the same standards, every time. And when a rule changed, it would propagate instantly — no retraining, no interpretation drift, no lag.
As a third-party consultant, Lilly's governance policies prevented me from holding formal system ownership — that sat with a Lilly employee. In practice, however, I functioned as the full product lead: running discovery, rewriting the PRD from scratch after inheriting incomplete requirements, building the product roadmap, defining the release strategy, and managing the engineering backlog day to day.
My Lilly counterpart focused on timeline management, navigating internal bureaucracy, and stakeholder communications — the organizational interface a consultant can't fully occupy. The product thinking, the engineering partnership, and the decisions about what to build and in what order were mine.
I also used Microsoft Copilot agents to generate the PRD, Epics, and User Stories — reducing documentation overhead by 30% and freeing time for the work that actually required judgment. The product I was building used AI to eliminate manual review. It would have been inconsistent not to apply the same thinking to my own workflow.
Lilly's internal AI platform, Cortex, sat on top of multiple LLMs including GPT-4.5. On paper, it could do a lot. In practice, nobody on the project had shipped something on it before. Rather than commit to a full multi-agent architecture before understanding what the platform could actually do, I designed the Alpha as a technical probe: one agent, one brand, one file type, one user type.
This constraint — a single agent reviewing a single brand — wasn't timidity. It was risk management. The Alpha would tell us what Cortex could and couldn't do, which would inform every architectural decision that followed. Without that signal, we'd be building on assumptions.
The team I inherited was distributed across Canada, Mexico, Costa Rica, and India. We hadn't worked together before. The business requirements I received were incomplete enough that I rewrote the PRD from scratch before the first sprint. And we were operating inside Lilly's enterprise governance framework, which required formal application registration, architecture solution reviews, cybersecurity reviews, and environment configuration before a single user could log in.
In six weeks, the team delivered: repositories and deployment pipelines, separate Dev and QA environments, application registration within Lilly's governance framework, a full web UI integrated into the CATS ecosystem, core logging and error handling, and the Legal agent processing six business rules on day one.
Throughout development, I worked directly with the engineering team on prompt engineering and refinement — not just defining what the agent should check, but how it should reason about edge cases. I also sourced and created positive and negative test cases to feed the evaluation framework, building the quality foundation the agents would be measured against.
When my Lilly counterpart performed UAT on the Alpha release, she was not satisfied with the agent's accuracy. The complaints were understandable. What surfaced underneath them was a more fundamental misalignment that nobody had explicitly addressed: Lilly expected the AI to behave like traditional software — either right or wrong, with 100% accuracy as the baseline expectation.
Our team's model was different. For an Alpha release, we were targeting 65% accuracy as a reasonable floor for a first agent on a new platform. The path to Production ran through a 95% accuracy threshold — but you don't get there by waiting until everything is perfect before releasing. You get there by releasing, observing, and improving.
Probabilistic AI systems are not software bugs waiting to be fixed. A Legal agent that's right 68% of the time in Alpha isn't broken — it's a starting point. The feedback loop between user accepts/rejects and prompt refinement is the improvement mechanism. This distinction between deterministic and probabilistic systems is one of the most common — and most consequential — gaps in enterprise AI adoption.
We proceeded to Beta with a clear goal: hit 95% accuracy, complete all regulatory documentation, and push to Production within 8 weeks. The team expanded the Legal agent from 6 to 41 business rules across Trademark & IP, Privacy & Consent, Digital & Web Compliance, and Disclaimers & Branding.
While we were navigating Lilly's regulatory requirements for a Production release — the architecture reviews, security approvals, documentation, and environment hardening that enterprise software deployments require — an internal Lilly team, working in isolation, built a competing prototype in three weeks. Leveraging the business rules and requirements our team had spent months gathering, they demonstrated a tool with more visible features in a fraction of the time.
The project was placed on hold. I understood the decision. If the internal team had built something better and faster, pausing to evaluate was the right call for the business.
The internal team built in a room, without the overhead our team carried: no environment setup, no security reviews, no application registration, no enterprise governance. They built a prototype. We built a production-ready platform. Whether they'll face the same regulatory waters on the path to Production — and whether the business rules they inherited will hold up to Medical and Regulatory scrutiny — remains to be seen.
I should have advocated more forcefully for getting the Alpha in front of end users despite the accuracy concerns. Yes, some confidence might have been lost in the short term. But the iterative feedback loop — real reviewers accepting and rejecting agent findings — is the mechanism that improves the model. Waiting for perfection in the lower environments was the wrong trade-off.
I relied too heavily on my Lilly counterpart to keep stakeholders informed. In hindsight, I should have created more forcing functions — regular written updates, a visible dashboard of agent accuracy progress, something that made the team's work legible to people who weren't in the sprints. Stakeholders who feel in the dark become nervous stakeholders, and nervous stakeholders are a project risk.
We were building an AI-forward product, but we weren't using AI to build it. In retrospect, that was a missed opportunity. AI-assisted development tools could have accelerated our work through the complex integration challenges — particularly the RAG pipeline tuning and the accuracy evaluation loops that consumed significant sprint capacity. I should have pushed for earlier adoption of these tools as a standard part of the team's workflow.