Services Manifesto About Contact
Career Track

AI QA & Eval Engineer

Remote or Hybrid US Hours Required

February 2026

PythonLLM APIsClaude CodeStatistical AnalysisPrompt EngineeringRed-Teaming

One Person. Every Failure Mode. Every Benchmark.

AI systems are probabilistic. Traditional QA asks “does this pass or fail?” AI QA asks “why did the model do this — and should we trust it?” That question is harder than it sounds, and most companies don’t have anyone who can answer it rigorously.

This role is the answer.

Evaluation is a new discipline. There’s no established playbook. The companies we work with are shipping AI into production — healthcare, semiconductor, fintech, legal — and they need people who can build the systems that measure whether AI actually works. Not vibes. Not demos. Rigorous evaluation frameworks that catch real failures before users do.

This isn’t a job description. It’s a career architecture. Ten levels. L1-L5 you find problems. L6-L10 you design how organizations evaluate AI. Your statistical rigor is the foundation. Your judgment about what “good” looks like is what makes you irreplaceable.


The 10 Levels

L1
AI QA Engineer, Trainee

Follow annotation guidelines, learn LLM APIs, run basic eval scripts. Not billing-ready — Worca invests in you.

3 months min.View Details →
L2
AI QA Engineer, Intern

Run eval scripts, write test cases, flag issues with evidence. Discount billing — client knows they’re developing you alongside us.

6 months min.View Details →
L3
AI QA & Eval Engineer

Design eval frameworks, build benchmarks, red-team models. Standard rate. The default Worca placement level.

12 months min.View Details →
L4
Senior AI QA & Eval Engineer

Diagnose WHY models fail, not just that they failed. Statistical rigor, root cause analysis, actionable fixes. Premium rate.

18 months min.View Details →
L5
Lead AI QA & Eval Engineer

Coach L1-L4, build QA playbooks, manage eval teams. Can implement light fixes — prompt engineering, data curation. The bridge to architecture.

24 months min.View Details →
L6
AI Eval Architect

The taste gate. Design eval methodology across domains and organizations. Eval as a discipline, not just a task.

24 months min.View Details →
L7
Sr AI Eval Architect

Eval strategy at company level. Regulatory compliance, safety and alignment testing, audit frameworks. Leadership trusts your judgment.

36 months min.View Details →
L8
Lead AI Eval Architect

Multi-company eval standards. Industry benchmarks. The person companies call when they need eval methodology for a new domain.

36 months min.View Details →
L9
Principal AI Eval Engineer

Builds and leads eval organizations. Hires, trains, defines culture. The person who makes eval teams exist.

36 months min.View Details →
L10
Worca Partner, AI Eval

Defines industry evaluation standards. Portfolio of eval frameworks across multiple companies and domains. By invitation only.

LifetimeView Details →

Two Tracks, One Path

L1-L5: QA & Eval. You find problems. The question at every level is how well do you evaluate AI? Annotation quality, benchmark design, red-teaming depth, and growing diagnostic ability. L1-L3 are trainable execution skills — Python fluency, statistical literacy, and discipline get you there. L4 adds root cause analysis. L5 proves you can develop other evaluators.

L6-L10: Eval Architecture. You design how organizations evaluate AI. The question shifts to are you measuring the right things? Methodology design, cross-domain frameworks, regulatory compliance, safety standards. L6 is the methodology gate — going from doing eval to designing eval systems. Good evaluators are everywhere, but architects who can look at a new domain and design the evaluation framework from scratch? That takes a kind of rigor that can’t be taught the same way Python can.

TransitionMinimum TimeCumulative from L1
L1 → L23 months3 months
L2 → L36 months9 months
L3 → L412 months~2 years
L4 → L518 months~3.5 years
L5 → L624 months~5.5 years
L6 → L724 months~7.5 years
L7 → L836 months~10.5 years
L8 → L936 months~13.5 years
L9 → L10By invitation15+ years

No fast-tracking past L7. Methodology, trust, and judgment take time. There are no shortcuts at the top.


The Key Gates

Each level has one checkpoint question. If you can’t answer “yes” with evidence, you’re not ready.

LevelGate
L1Can they follow annotation guidelines accurately and learn LLM APIs in a guided environment?
L2Can they run eval scripts, write test cases, and flag issues with clear evidence?
L3Can they design an eval framework and build benchmarks for a domain they haven’t seen before?
L4Can they diagnose WHY a model fails, not just that it fails — and suggest concrete fixes?
L5Can they coach L1-L4 and make the whole eval team more rigorous?
L6Can they design eval methodology for a new domain from scratch?
L7Can they build company-level eval strategy that satisfies regulatory and safety requirements?
L8Can they define eval standards that multiple companies adopt?
L9Can they build and lead an eval organization from zero?
L10Have they shaped how the industry evaluates AI?

Where Do You Fit?

Click any level above to see its full evaluation criteria, training curriculum, and promotion requirements. Read the descriptions honestly — place yourself where you actually are, not where you want to be. Then work the path.


Why This Path Exists

AI evaluation is one of the fastest-growing disciplines in tech, and one of the least understood. Most companies treat eval as an afterthought — run a few tests, check a dashboard, ship. Then the model hallucinates in production and nobody knows why.

The Worca path is different. Every level you climb:

  • Increases your rate — clients pay more for higher-ranked eval talent
  • Compounds your skills — each level adds new capabilities on top of everything before
  • Builds your portfolio — real eval frameworks deployed at real companies, not academic papers
  • Creates passive income — mentee referral cuts reward you for developing others
  • Makes you irreplaceable — by L6+, you’re not interchangeable. You’re a specific person clients want because you’ve evaluated their domain before.

AI doesn’t replace you. AI is what you evaluate. What AI can’t do is judge itself — decide what “good” means, design the tests that catch real failures, and build the frameworks that make organizations trust their AI systems. That’s what you do. That’s what makes you valuable.


For companies: See why clients choose Worca AI eval talent.

How to Apply

Send your resume and a brief note on evaluation work you’ve done to careers@worca.io.

You don’t need to be experienced. You need to be rigorous, statistically literate, and willing to find problems others miss. Everyone starts at L1. Every Worca Partner, AI Eval once stood where you stand.