This is a remote position.
The company is an OnchainGPT for autonomous trading. It is an AI-powered mentor, advisor, and companion that helps users explore the onchain world, find alpha, auto-trade, and outsmart the market.
They are looking for Staff Software Engineer (Backend and AI Infra) who will own two critical workstreams :
- the agent runtime and backend infrastructure that powers every trade in their fleet,
- and the migration of model hosting and agent deployment in-house, moving the company off third-party LLM providers and hosted agent platforms to own infrastructure.
NOTE : This role is NOT a pure Devops role but rather a building role in which you will spend 80% of your time writing Go, Python, and TypeScript that ships to production.
You’re writing the backend services, runtime engine, and deployment systems that the entire company's agent fleet runs on.
When you ship, every agent in the fleet immediately gets faster, more reliable, and more autonomous.
You are in charge of managing the very infrastructure you are building, as its designer you are be the best person to operate it.
You’re building the backend for autonomous AI agents that manage real money in real time.
The runtime you build determines whether positions are protected.
The model hosting you stand up determines whether agents can think.
The deployment pipeline you create determines whether the fleet can evolve.
This is foundational infrastructure for a new category of software.
MISSIONS
Agent Runtime & Backend (+/- 50%)
The runtime is the engine that makes every agent work. You’ll own the core systems:
Plugin Runtime — the per-agent process that runs position tracking (10s polling), the RatchetStop exit engine (tiered trailing stops with sub-second evaluation), and DSL state management. Currently Go + Python; migrating to a centralized Go service with Postgres state and real-time websocket price feeds
Scanner Gateway / Rules Engine — a YAML-configurable evaluation layer that sits between scanners and execution. Scanners produce raw signal variables; the rules engine applies gates, scoring, and filters defined in YAML. Users customize trading behavior without touching Python. This is the next major runtime feature
RatchetStop Backend — centralized profit-trailing service that protects positions even when the agent is offline. Evaluates tier upgrades and places stop-loss orders on Hyperliquid via websocket, replacing per-agent polling with condition-based evaluation across all positions
Execution Layer — the MCP (Model Context Protocol) server that bridges agents to the company's 48+ platform tools: position creation, clearinghouse state, market data, Smart Money intelligence. You’ll own auth, rate limiting, and the contract between agents and the exchange
Data Layer — enriched Hyperfeed pipeline (top 1K trader positions, momentum events, market concentration) flowing through Redis, Postgres, and ClickHouse. Real-time ingestion, 4-hour rolling windows, and the APIs that every scanner calls
Model & Agent Hosting Migration (+/- 30%)
The company is moving off third-party hosted agents and external LLM inference to the company's owned infrastructure. You’ll lead the technical execution:
Agent deployment platform — migrate agents from Railway/OpenClaw to own-hosted infrastructure. Each agent needs isolated workspace, cron scheduling, state persistence, MCP connectivity, and Telegram notifications. Target: deploy any skill from a GitHub repo with one command
Model hosting — evaluate and implement the path from external LLM APIs (Anthropic, Google) to self-hosted inference. Options range from proxied external models with full telemetry capture, to fine-tuned models running on own GPUs. You’ll own the decision and execution
Agent telemetry — capture every scanner evaluation, every trade decision, every signal score across all agents. This data feeds the self-reinforcing loop: agents learn from fleet-wide performance, fork winning strategies, and improve autonomously
Deployment pipeline — CI/CD for shipping scanner updates, runtime patches, and skill configs to 50+ live agents without interrupting open positions. Zero-downtime rollouts where downtime = unprotected capital
Infrastructure & Operations (+/- 20%)
Build monitoring and alerting that catches agent failures, orphaned positions, state corruption, and Auth expiration before they cost money
Manage cloud infrastructure (AWS/EKS) with infrastructure-as-code
Own incident response — in a trading system, every minute of downtime is real dollars at risk
Health monitoring for the agent fleet: which agents are scanning, which are stuck, which have the midnight rollover bug