EDS AI Copilot: $3.1M QA Automation Case Study

Overview

Estee Lauder Companies' (ELC) Enterprise Design System (EDS) team was burning $600K a year on manual QA with zero strategic return. I built the business case, architecture, and prototype that turned that cost center into a $3.1M savings model, validated bottom-up against 200+ Jira tickets.

EDS AI Copilot hero image — $3.1M annual savings model validated against internal Jira data, not industry benchmarks. The bottom-up methodology was the reason executives approved the $200K pilot.

What the Evidence Supported

This project didn't ship to production during my tenure. In Q3 2026, ELC approved a $200K pilot budget based on the deliverables below.

1. The business case reframed Design Systems (DS) as a capital investment. The $3.1M model gave the VP of Experience Design a framework to present DS investment to the CTO using the same language as engineering infrastructure proposals. This was the first time a DS initiative at ELC was positioned as a capital proposal rather than a headcount request.

2. A Model Context Protocol (MCP) Server architecture that passed engineering review. Front-end engineers confirmed that the three-source integration pattern (Storybook, Confluence, Token Studio) was technically sound and that the canonical data layer was the right constraint model for enterprise governance.

3. A Jira-validated methodology structured for reuse. Extract operational data, categorize by task type, build the ROI bottom-up, present with an audit trail.

Background

My Role: Staff Product Designer, DS Lead: end-to-end ownership of business case, prototype design, ROI validation, and stakeholder documentation
Duration: Q3–Q4 2025 (6 weeks)
Company: Estée Lauder Companies (ELC)
Team: DS team of 3 designers supporting 20 product designers across 17 brand teams. Validated with the Design System Director, Design Excellence leadership, UX/UI designers, front-end engineers, and PM/Analytics.
Context: Expanded from an AI research assignment into a full business case with functional prototypes and stakeholder validation.
Project Status: $200K pilot approved in Q3 2026. Proposal advanced to CTO via VP of Experience Design executive review.
Tools: Figma Design, Figma Make, Figma MCP Server, Jira, Confluence

Personas

Three roles, each losing hours to the same root cause: every quality decision required a human in the DS team's queue.

DS Designer: 60%+ of the day on manual Figma layer review, no time for architecture or governance. → Lint Scan, Generate Documentation
Front-End Engineer: Hours lost to back-and-forth when handoff docs are missing or inconsistent. → Generate Documentation, Ask EDS
E-Commerce UX/UI Designer: 3-5 day wait for DS validation on every design cycle. → Ask EDS, Lint Scan

The Challenge

ELC's EDS served hundreds of digital commerce and brand websites across 25+ global brands in 150+ markets. It was built as a component library, not an execution system. Every quality decision required a human in the loop: manual Figma layer review, Slack threads for questions, Jira tickets for resolution, and 3-5 day turnaround per cycle.

60% of DS capacity was consumed by this overhead. $600K+ annually in billable hours with no strategic return (50 releases × 160h manual QA × $75/hr). The VP of Experience Design needed AI efficiency metrics to demonstrate the team's strategic value to senior leadership.

Manual QA workflow loop diagram — The manual QA loop that consumed 60% of DS capacity: every review cycle required human triage through four handoff points, each adding 3-5 days. This is the workflow the MCP Server architecture was designed to eliminate.

Process

Methodology

Every number in the impact model traces to one source: Q3 2025 Jira analysis across 200+ tickets, categorized by QA, documentation, support, and prototyping task types, with time allocation validated against DS Director estimates. This bottom-up approach replaced the industry benchmarks I initially proposed after executives questioned their credibility.

Feature Design and Validation

Jira categorization determined feature scope. The three highest-volume task types became three features: Lint Scan (QA validation), Ask EDS (support questions), and Generate Documentation. I wrote a PRD for each. Lower-volume categories like brand onboarding automation were scoped out; they required org-level process changes beyond the pilot's mandate.

I built a high-fidelity Figma Make prototype with pre-populated ELC component data. Across three structured validation sessions, designers engaged with live prototype flows against real EDS components. One consistent observation: designers ignored the advisory-level results until all blocking violations were resolved, which validated the severity-tiered hierarchy and confirmed that collapsing advisory notes by default was the right interaction pattern. Front-end engineers flagged that the MCP Server's Confluence chunking strategy needed a versioning layer; I added a version-anchor requirement to the architecture spec before the executive review.

Cross-Functional Collaboration

The most consequential design decision came from a challenge, not from my own analysis. The Design System Director rejected my initial results hierarchy because it mirrored Figma's data model, not the team's triage workflow. I redesigned the entire hierarchy around severity tiers: blocking violations surface first with inline fix guidance, advisory notes collapse below. That conversation changed how I approach governance UX: start with the team's workflow, not the tool's data model (see Expanded: Lint Scan Results Hierarchy below).

I replaced ad-hoc Slack feedback with structured weekly reviews against three criteria: governance compliance, interaction completeness, and edge-case coverage. Finance challenged the blended hourly rate; I re-derived it from contractor billing data and it held within 3%. The VP of Experience Design pushed to reframe the narrative around capital investment rather than cost savings; I restructured the executive presentation around that lens. By presentation day, no number in the deck was unvetted. The VP of Experience Design approved the $200K pilot and advanced the proposal to the CTO.

Solution

System Architecture: Governance as the Foundation

Every feature expresses one architectural principle: EDS governance is enforced at the data layer, not in prompts, not in the UI, but in the mechanism that controls what the AI model can see and return. The MCP Server is the load-bearing element. Without it, every AI response is a hallucination risk. With it, non-EDS outputs are structurally impossible.

System architecture diagram — The MCP Server as canonical data layer is the core architectural decision: by bridging Storybook, Confluence, and Token Studio through a single governance layer, the system eliminates non-EDS outputs at the data level rather than relying on policy enforcement.

Design Decisions

Three core UX decisions shaped the product. Each resolved a tension between simplicity and trust.

Decision	Options Considered	Tradeoff	Final Direction
Lint Scan Results Hierarchy	Flat list by component · Flat list by violation type · Severity-tiered hierarchy	Flat lists transfer triage to the designer: the cognitive overhead the tool was supposed to eliminate	Two-tier "Blocking vs. Advisory" hierarchy based on DS Director review of governance criteria
Ask EDS Response Format	Single answer, no attribution · Answer with source link · Answer with source + version anchor + confidence indicator	Simpler formats reduce friction but transfer the trust problem; an unverifiable answer undermines the premise	Answer + source attribution + version anchor. Low-confidence state surfaced when MCP can't ground the response
Generate Docs Scope	Comprehensive spec (all states, tokens) · Minimal spec (name + defaults) · Decision-point spec (default + divergences)	Comprehensive specs move filtering from generation to reading. Minimal specs shift clarification back to Slack	Three-tier decision-point scope: component identity, deviating states, brand overrides. Only what requires an engineering decision

Interaction walkthrough of the EDS Copilot prototype built in Figma Make.

Expanded: Lint Scan Results Hierarchy

The central UX challenge wasn't the scan; it was what happens after. A complex Figma file can surface dozens of violations simultaneously, and the wrong hierarchy makes the tool unusable.

My initial design organized results by Figma component: each component listed its violations underneath, mirroring the layer panel's structure. I chose this because it matched the mental model of navigating a Figma file. The Design System Director rejected it in the first review. The problem: when a designer runs a scan, they need to know what to fix first, not which component to look at first. A component-based hierarchy transfers the triage decision to the designer, which is exactly the cognitive overhead the tool was supposed to eliminate.

The redesigned hierarchy organizes by severity. Blocking violations (wrong token applied, non-EDS component used, accessibility failure) surface at the top with inline fix guidance. Advisory notes (spacing within tolerance, style preferences) collapse below. In subsequent validation sessions, designers completed triage tasks faster because the hierarchy matched their actual workflow: fix what's broken, then address what's optional.

Lint Scan results hierarchy — The severity-tiered hierarchy resolved the core UX tension: organizing by governance priority (blocking vs. advisory) rather than by Figma layer structure, based on DS Director feedback on how the team actually triages violations.

Ask EDS response surface — Answer with source attribution, deep link, and version/confidence indicator: the response format decision that prevents the trust problem of unverifiable AI answers.

Generate Documentation three-tier output — Three-tier decision-point spec: component identity, deviating states, and brand overrides. Only what requires an engineering decision, avoiding the comprehensive-spec problem where filtering shifts from generation to reading.

EDS Copilot

Three features, each targeting a specific bottleneck, each governed by the MCP Server:

Lint Scan: Automated component validation in 30 seconds, replacing 3-5 day manual review cycles. Includes Accessibility Scan for automated contrast ratio checks and WCAG compliance with auto-fix suggestions.
Ask EDS: Instant AI-powered answers grounded in canonical component and token data via the MCP Server. Replaces synchronous Slack threads with asynchronous self-service.
Generate Documentation: 10-second spec generation, replacing the 4-6 hour manual process of screenshots, token copying, Confluence formatting, and notifications.

EDS Copilot walkthrough — Full feature walkthrough demonstrating all three features against live ELC component data, validating that the MCP Server's canonical data layer produces governance-compliant outputs across feature types.

Lint Scan result screen — Lint Scan surfaces compliance violations in 30 seconds, proving the automated scan can replace the 3-5 day manual review that consumed 60% of DS capacity.

Ask EDS interface — Ask EDS answers an implementation question with a documentation-grounded response via the MCP Server, demonstrating the self-service model that replaces synchronous Slack support threads.

Generate Documentation output — Generate Documentation produces a decision-point spec in 10 seconds, proving the three-tier scope model (identity, deviating states, brand overrides) eliminates the 4-6 hour manual documentation workflow.

Plugin UX: Governance Feedback States

The system enforces governance at the data and validation layers, but what the designer actually sees is where those decisions surface. Each feature resolves to one of three Plugin UI states: a validated result, a blocked result caught by Backend API validation, or a low-confidence state when the MCP Server can't ground the request in canonical EDS data.

The three states use progressive visual weight to match the required user response. Validated results render in a compact, low-emphasis format so the designer's attention stays on their work. Blocked results use high-contrast error styling with an expanded detail panel, because a governance violation requires the designer to stop and act. The low-confidence state uses an amber treatment with a collapsible source-inspection panel, giving the designer the choice to proceed with caution or escalate to the DS team.

Governance feedback states — Three governance outcomes enforce the architectural principle at the UI layer: validated results pass through, blocked results are caught by the Backend API, and low-confidence states surface when canonical data can't ground the response.

Technical Architecture

Figma Plugin: Designer-facing interface for all three features, built inside Figma with no context switching
MCP Server: Canonical data layer bridging Storybook, Confluence, and Token Studio. Full component manifest injected at request time
Backend API: Brand-isolated authentication and output validation. Schema-checks all token IDs and component keys before results reach the Plugin
Claude Sonnet 3.5: Powers Q&A, auto-fix, and documentation generation. $15/day cost ceiling at pilot-scale
GitHub + CI/CD: Auto-syncs design tokens and triggers Storybook deployments. No net-new tooling required

Impact

$3.1MAnnual savingsProjected recurring value

19×ROIYear 1 return on $165K investment

90%Manual QA eliminatedQA time automated away

15×Capacity scalingTeam throughput multiplier

All outcomes represent modeled results validated against Q3 Jira analysis. Five segments, each traced to a specific Jira task type, build the annual recurring figure. The largest segment (prototype acceleration, $2.05M) depends on adoption rate, which I couldn't validate pre-launch. I modeled three scenarios against the $165K investment: 25% adoption ($1.1M, 5.7× ROI), 50% adoption ($1.9M, 10× ROI), and 75% adoption ($2.5M, 14× ROI). Even the conservative floor clears the investment by 5×. The 50% tier was the target I presented, based on ELC's Figma migration hitting roughly 60% active usage within 90 days of structured onboarding.

KPI tree showing $3.1M breakdown by feature — KPI tree showing how the $3.1M breaks down by feature: feature-level attribution connects each savings segment to a specific product decision, strengthening the strategic case beyond task-type segmentation alone.

A chatbot that can answer questions and tell people how to use a component? That is scalable to a thousand designers across the world. That's a much bigger thing.

Design Excellence Leadership, after reviewing the 15× capacity model and Ask EDS prototype

Planned Measurement Framework

For unshipped work, the measurement plan is the credibility proof. I designed a 2-brand pilot instrumented on three signals:

Lint scan adoption rate: Scans per active Figma file per week, targeting 3+ scans per file within 30 days of rollout.
Ask EDS query volume vs. Slack DS support threads: Target 50% reduction in synchronous support requests within 60 days.
Support ticket deflection rate: Percentage of Jira DS-support tickets avoided, measured against the Q3 baseline.

These three signals map directly to the three largest savings segments in the ROI model. If the pilot validates the 50% adoption tier, the business case supports full 25-brand rollout.

Learnings

When I initially presented industry benchmarks, the room pushed back on credibility. When I rebuilt the model from 200+ internal Jira tickets categorized by task type, the same executives approved a $200K pilot. The lesson: how you prove the number matters as much as the number itself.

When the Design System Director and the VP of Experience Design had conflicting concerns about governance, the architecture that satisfied both was the right one. I now design governance as a user-facing feature, not a background policy. On this project, that meant the MCP Server's data layer enforced compliance automatically, so designers never had to think about it and leadership never had to police it.

RECENT WORK

SELECTED WORK

CONNECT

EDS AI Copilot: Turning Manual QA Into a 30-Second Scan