Hi, I'm Steel.
I build AI agents and the evaluation systems that keep them trustworthy at scale.
Most interested in AI safety, evaluation methodology, and the production end of the agentic stack.
Now: building AI agent memory and eval systems at Shopify and Basin.
Selected work
Shopify — Shop.com AI agent + eval framework
Personalization and conversational AI for tens of millions of users.
$4M projected annual savings 30M+ users Multimodal: in flight
Python LLM-as-judge DSPy/GEPA Qwen 4B Distillation
- Built a two-step agentic flow generating personalized conversation starters referencing product imagery, with a custom LLM-as-judge eval pipeline gating rollouts.
- Distilled the production LLM into Qwen 4B (open weights), shipping >$4M projected annual savings; extending to process product imagery directly (vision integration).
- Designed the offline eval framework that gates agent rollouts: automated LLM judges, benchmark suites, observability for agent reasoning loops and tool selection.
Basin Climbing — solo-engineered platform
Co-founder/CTO at a real Waco, TX climbing gym. I'm the only engineer; I built and run the entire stack.
~14,000 customers iOS app live on App Store Two production AI agents
React Native + Expo FastAPI + Heroku Neon Postgres Supabase S3 Anthropic
- Member iOS app — zero to App Store in 5 weeks. Sign-in via OTP, multi-member households, rewards with QR redemption, embedded Cliff chat.
- Two production AI agents on multi-agent / MCP architecture: customer-facing Cliff (24/25 tests passing) and Cliff Analytics (91%+ on hard questions, 56-test suite).
- CRM with two-way SMS, lead tracking, follow-up engine; agentic Klaviyo flow generator; multi-source data pipeline + executive dashboards.
App Store · Basin app showcase · Cliff (customer) showcase · Cliff Analytics showcase
Meta — Facebook Integrity
ML systems for trust and safety at billions-of-users scale.
10M MAU lift 20% abuse reduction Billions of users
Python Deep learning Behavioral ML Large-scale recommenders
- Shipped a deep learning classifier detecting coordinated automated behavior (scripted friending), cutting it by 10% through rigorous offline experimentation.
- Diagnosed impersonation attack vectors targeting vulnerable users; built detection + a complementary mitigation that hid friend lists from risky accounts — coupling threat modeling, detection, and deployed defense in one loop.
- Restructured the integrity-filtering layer of the recommender system in a change that lifted Facebook MAU by 10M.
More about me
See About for background, full career timeline, and how to get in touch.