Quality Engenieer
Willow runs AI agents on behalf of enterprises. Every agent action touches sensitive systems, executes real tool calls, and gets logged for audit. The product spans a deterministic surface (dashboards, APIs, MCP runtime) and a non-deterministic one (agents, skills, tool behavior across 1000+ integrations). If either side breaks, customers feel it immediately.
Quality is what makes the difference between a platform enterprises trust with their data and one they don't. That's your job - and you own it end to end. From the bug that ships to production to the regression that almost did, it's yours.
You'll be testing the product today and building the system that tests it tomorrow - in parallel, on purpose.
What You'll Do
- Own the ship/no-ship call. Every release goes through you. You decide what's ready and what isn't, and you back the call with evidence - not vibes. When something slips through, you own the post-mortem and the fix that makes sure it can't slip through that way again.
- Test today, automate tomorrow - every week. Be the human safety net for what's shipping now: dashboards, APIs, MCP tool calls, integrations, on-prem deploys. While you're doing it, build the system that does it for you next time. The patterns you catch manually this month should be catching themselves next month.
- Build the quality platform for a product that talks to 10k+ tools. Regression suites for the deterministic stuff. Eval harnesses and cron agents for the non-deterministic stuff - are tool calls correct, are skills behaving, is a Gmail connector still working after the vendor changed their API at 3am. You design how we know the product works, continuously.
- Make agents test the product. We build AI infrastructure; you should be using AI infrastructure to test it. Cron-driven agents that exercise real flows end to end, LLM-as-judge for skill output quality, automated probes for the long tail of integrations no human has time to click through every day. If a regression makes it to a customer, the agent that should have caught it gets built before the next release.
- Cover the surface no one else will. SaaS, dedicated cloud, on-prem (Kubernetes via Helm) - they all have to work. The MCP runtime authenticating real OAuth tokens, the encryption envelope decrypting under load, the SCIM provisioning flow doing what Okta expects. You go where the gaps are. No one is going to assign you tickets.
- Remove yourself as a bottleneck. Document the patterns. Templatize the eval harnesses. Write the test utilities devs can use without asking you. Your job isn't to be the only person who knows if something works - it's to make "is this broken?" answerable by anyone on the team in under a minute.
You'll be a fit if
- You've built real automation in code. Playwright, Cypress, pytest, vitest, custom harnesses - whatever the stack, you've written and maintained test suites that survived contact with a real codebase. Manual QA experience alone isn't enough. Low-code tools aren't enough. This is non-negotiable.
- You've owned a release process end-to-end. You've been the person who said "ship it" or "don't ship it" - and lived with the consequences either way. Not a reviewer, not a gate someone else managed. The call was yours. This is non-negotiable.
- You're proactive. You don't wait for tickets. You see what's broken, what's fragile, what's about to break, and you go fix it. If something falls between two engineers' desks, you pick it up. The phrase "that's not my job" doesn't exist in your vocabulary.
- You think in systems, not test cases. A bug isn't just a bug - it's a signal about which part of the system has no safety net. You ask why it wasn't caught before you ask how to fix it.
- You move fast. Decisions in hours, not weeks. You'd rather ship the test suite that catches 80% of regressions today than the perfect framework in three months.
- You've owned quality at a startup or small team - not one slice of QA at a big company. You know what it means when there's no one else to escalate to.
Bonus points
- You've worked on non-deterministic systems - AI agents, LLM applications, ML pipelines, anything where the same input doesn't always produce the same output. You know that "did it work" is a harder question than it sounds.
- You've built evals. LLM-as-judge, golden datasets, scoring harnesses, drift detection. You've shipped a system that answered "is the model still good?" and trusted its answer enough to make decisions on it.
- You've operated infrastructure for AI workloads - MCP servers, tool-calling pipelines, sandboxed agent execution, anything where the "system under test" is itself an autonomous process.
- You've been the first or second quality hire somewhere before.
Why Willow
You'll build the quality layer for the AI era.
AI adoption is accelerating faster than any technology shift before it. Soon, every employee will delegate thousands of tasks to AI agents - running in parallel, accessing sensitive data, performing actions across internal systems and SaaS tools. Today, this layer has zero governance, and almost no one is testing it the way it needs to be tested.
Willow is building the control layer for AI agents: zero-trust authentication, app-aware permissions, centralized audit trails. We enable adoption without compromising security. Every agent, every tool, one control plane - and one person making sure all of it actually works.
We're already in production with forward-thinking enterprises. The market is moving fast - and we're positioned to lead it.
What else:
- Founding impact: You're not joining a team - you're building the quality function from scratch. Early equity, real ownership, direct influence on how the product is built and shipped.
- Ship to real customers: No theoretical exercises. Enterprises are using what we build today, and the bugs you catch (or miss) reach them in hours.
- Exceptional team: Small, senior, low-ego. Everyone here can build.
- The right moment: Strong traction, active fundraising, and a market that's exploding. This is the window.
Don't see your role?
The best crew makes their own path. If you believe in what we're building, reach out.
