Shaping AI's role in high-stakes hiring decisions

Shaping AI's role in high-stakes hiring decisions

Shaping AI's role in high-stakes hiring decisions

I defined how AI should support interviewers without replacing their judgment. As our first high-stakes AI initiative, we had no established benchmarks for 'correct' behaviour. The design isn't complex. The decisions are. Every choice balanced user convenience against accountability risk.

Role

I led design and research, and shaped product direction. Worked with science, product, and engineering.

Scope

Amazon's first AI initiative in hiring decisions. Cross-functional collaboration with science, product, and engineering teams.

Outcome

Early adopter release: 62% continued usage, 12% improvement in feedback quality. Findings shaped how the tool balances AI assistance with interviewer accountability before global release.

At global release, 30,000+ users adopted within two months with 98% continued usage.

Business background & challenge

AI in hiring decisions carries accountability risk. The real risk wasn't bad AI. It was undermining interviewer ownership of their own assessments.


Amazon's recruiting platform serves over 200k interviewers, from first-timers to seasoned Bar Raisers. Off-the-shelf AI tools weren't enough. They lacked Amazon-specific context and produced generic outputs.

What I needed to solve: How should AI enter a workflow where human judgment is the product? I broke this into three design questions:

How should AI fit into the feedback workflow?

Where AI assists, where it stays out of the way, and how interviewers maintain control.

What should AI provide, and where's the line?

How much AI output helps versus how much starts replacing interviewer judgment.

How do we help interviewers write stronger feedback?

Beyond generating content, how can AI and design improve feedback quality.

Designing the flow

With no PM in the early phase, I led the exploration to define how AI should assist without distracting from the task of writing evidence-based feedback. I shaped a single workflow that supports different entry points in how interviewers start their feedback.

AI support that meets interviewers where they start

Designing the flow

Whether interviewers generate an AI summary or write from scratch, the same tools support them through to submission.

Interviewers can generate AI feedback or start writing on their own

AI suggestions and quality checks help strengthen feedback before submission

Treating AI summary as reference, not the answer

Most AI tools optimize for zero-click automation. I tested whether that approach would work for high-stakes hiring decisions. Across usability sessions with interviewers and Bar Raisers, when AI content was auto-populated into the feedback field, participants tended to make surface-level edits rather than adding their own observations.


I chose to keep the AI output in a separate space to introduce intentional friction. This forces a cognitive checkpoint, requiring interviewers to manually move the content into their feedback field. I made this call because ownership mattered more than speed.

Comparison: Prioritizing "Zero-click" speed vs Designing for accountability

Comparison: Prioritizing "Zero-click" speed vs Designing for accountability

Auto-populate: AI output appears directly in the field

What I chose: friction to preserve ownership

The interviewer reviews the AI summary and pulls what they need into their own feedback

Setting the right boundary for AI-generated draft

I collaborated with the Science team to determine the optimal depth of AI output.


Testing a spectrum from minimal summaries to deep analysis, I identified the threshold where AI maximizes efficiency without encroaching on interviewer judgment.


The decision was one reliable default. While I initially considered a "choose your mode" approach, testing revealed that choice added friction. I defined a single experience to provide a clear starting point while keeping the interviewer in control.

Raw notes and interview context provided to participants

Three AI styles that participants evaluated and compared

I partnered with the science team to test three output styles

I partnered with the science team to test three output styles

Option A was too safe

I assumed a basic summary was the best starting point, but testing showed it didn't save enough time to be worthwhile.

Option C was too risky

The AI's evaluation felt like a black box. Users lost trust when the logic felt forced and subjective, undermining their own judgment.

Option B hit the sweet spot

Some testers wanted more polish, but I advocated for this level: a structured draft that supports judgment, not replaces it.

Helping interviewers strengthen their feedback

During pilot, 15% of interviewers submitted AI summaries with little to no modification, risking feedback quality and hiring decisions. For global release, For global release, I introduced a check aligned with how the science team evaluates strong feedback. Divergence from AI alone doesn't mean the feedback is good. When gaps exist, it surfaces targeted suggestions on what to strengthen.

Feedback quality check highlights gaps and suggests targeted improvements before submission

Feedback quality check highlights gaps and suggests targeted improvements before submission

Proactive prompts surface when limited modifications are detected, nudging interviewers to add their own observations.

Feedback quality check evaluates feedback across four dimensions: overall quality, analytical depth, balance between strengths and concerns, and coherence.

Actionable suggestions each dimension surfaces specific suggestions to help interviewers address gaps.

Designing the flow

I considered adding rating-feedback alignment checks but decided against it. The AI only sees text, not the context the interviewer observed directly. Flagging mismatches could anchor interviewers to AI's interpretation or penalize nuanced assessments.

Impact and what came next

I partnered with PM to define success metrics. Early adopter release showed 62% continued usage and 12% improvement in feedback quality. Pilot insights drove iteration before global launch.


The same approach now extends across the interviewer experience. I designed the question preparation tool and consulted on debrief and outcome summaries for recruiters. In each case, AI provides a starting point, humans review and finalize.

GLOBAL RELEASE

52%

52%

Adoption in 2 months

30K+

30K+

Unique users

98%

98%

Continued usage

Built by Fangru Wu 2026

Craft matters