skip to content
Steel Ferguson

Hi, I'm Steel.

I build AI agents and the evaluation systems that keep them trustworthy at scale.

Most interested in AI safety, evaluation methodology, and the production end of the agentic stack.

Now: building AI agent memory and eval systems at Shopify and Basin.

Selected work

Shopify — Shop.com AI agent + eval framework

Personalization and conversational AI for tens of millions of users.

$4M projected annual savings 30M+ users Multimodal: in flight
Python LLM-as-judge DSPy/GEPA Qwen 4B Distillation
  • Built a two-step agentic flow generating personalized conversation starters referencing product imagery, with a custom LLM-as-judge eval pipeline gating rollouts.
  • Distilled the production LLM into Qwen 4B (open weights), shipping >$4M projected annual savings; extending to process product imagery directly (vision integration).
  • Designed the offline eval framework that gates agent rollouts: automated LLM judges, benchmark suites, observability for agent reasoning loops and tool selection.

Basin Climbing — solo-engineered platform

Co-founder/CTO at a real Waco, TX climbing gym. I'm the only engineer; I built and run the entire stack.

~14,000 customers iOS app live on App Store Two production AI agents
React Native + Expo FastAPI + Heroku Neon Postgres Supabase S3 Anthropic
  • Member iOS app — zero to App Store in 5 weeks. Sign-in via OTP, multi-member households, rewards with QR redemption, embedded Cliff chat.
  • Two production AI agents on multi-agent / MCP architecture: customer-facing Cliff (24/25 tests passing) and Cliff Analytics (91%+ on hard questions, 56-test suite).
  • CRM with two-way SMS, lead tracking, follow-up engine; agentic Klaviyo flow generator; multi-source data pipeline + executive dashboards.
Basin Climbing app feature graphic

App Store · Basin app showcase · Cliff (customer) showcase · Cliff Analytics showcase

Meta — Facebook Integrity

ML systems for trust and safety at billions-of-users scale.

10M MAU lift 20% abuse reduction Billions of users
Python Deep learning Behavioral ML Large-scale recommenders
  • Shipped a deep learning classifier detecting coordinated automated behavior (scripted friending), cutting it by 10% through rigorous offline experimentation.
  • Diagnosed impersonation attack vectors targeting vulnerable users; built detection + a complementary mitigation that hid friend lists from risky accounts — coupling threat modeling, detection, and deployed defense in one loop.
  • Restructured the integrity-filtering layer of the recommender system in a change that lifted Facebook MAU by 10M.

More about me

See About for background, full career timeline, and how to get in touch.