Hi, I'm Steel.

I build AI agents and the evaluation systems that keep them trustworthy at scale.

Most interested in AI safety, evaluation methodology, and the production end of the agentic stack.

Now: building AI agent memory and eval systems at Shopify and Basin.

Browse by skill

Click a skill to see what I've actually shipped against it.

Click a skill above to see context and examples.

Hover for details. Want to dig deeper? .

Personalization and conversational AI for tens of millions of users.

$4M projected annual savings 30M+ users Multimodal: in flight

Python LLM-as-judge DSPy/GEPA Qwen 4B Distillation

Built a two-step agentic flow generating personalized conversation starters referencing product imagery, with a custom LLM-as-judge eval pipeline gating rollouts.
Distilled the production LLM into Qwen 4B (open weights), shipping >$4M projected annual savings; extending to process product imagery directly (vision integration).
Designed the offline eval framework that gates agent rollouts: automated LLM judges, benchmark suites, observability for agent reasoning loops and tool selection.

Co-founder/CTO at a real Waco, TX climbing gym. I'm the only engineer; I built and run the entire stack.

~14,000 customers iOS app live on App Store Two production AI agents

React Native + Expo FastAPI + Heroku Neon Postgres Supabase S3 Anthropic

Member iOS app — zero to App Store in 5 weeks. Sign-in via OTP, multi-member households, rewards with QR redemption, embedded Cliff chat.
Two production AI agents on multi-agent / MCP architecture: customer-facing Cliff (24/25 tests passing) and Cliff Analytics (91%+ on hard questions, 56-test suite).
CRM with two-way SMS, lead tracking, follow-up engine; agentic Klaviyo flow generator; multi-source data pipeline + executive dashboards.

ML systems for trust and safety at billions-of-users scale.

10M MAU lift 20% abuse reduction Billions of users

Python Deep learning Behavioral ML Large-scale recommenders

Shipped a deep learning classifier detecting coordinated automated behavior (scripted friending), cutting it by 10% through rigorous offline experimentation.
Diagnosed impersonation attack vectors targeting vulnerable users; built detection + a complementary mitigation that hid friend lists from risky accounts — coupling threat modeling, detection, and deployed defense in one loop.
Restructured the integrity-filtering layer of the recommender system in a change that lifted Facebook MAU by 10M.

See About for background, full career timeline, and how to get in touch.