Location: Remote - EST timezone
Remote | Full-time
Compensation: $100K - $130K
We are hiring on behalf of our client who is seeking an exceptional, production-proven Infrastructure & DevOps Engineer to take absolute ownership of the deployment, secure networking, architectural lifecycle, and overall reliability of this distributed agent fleet from day one. The client is engineering a sophisticated infrastructure designed to launch a highly distributed fleet of managed, single-tenant personal artificial intelligence (AI) trading agents. Operating non-stop, these isolated processes execute high-frequency, complex financial workflows natively on blockchain infrastructure, dedicated exclusively to individual user portfolios.
Key Responsibilities
- Fleet Orchestration & Scaling: Architect, provision, and scale the core user agent fleet across a hybrid Railway and AWS ecosystem, ensuring each user retains an isolated, secure, and predictable containerized process with optimized cost tracking and precise lifecycle hooks.
- Secure Network Engineering: Establish, manage, and continuously harden private overlay networks using Tailscale in production, linking disparate user agents securely with core Model Context Protocol (MCP) servers and the underlying live trading runtimes.
- Automated User Provisioning: Design and construct an end-to-end, zero-touch deployment pipeline utilizing advanced infrastructure-as-code and CI/CD best practices, enabling seamless, single-click automated provisioning of containers, secrets management, and environmental configurations for new users.
- Operational Resilience & SRE: Define, build, and maintain comprehensive monitoring, telemetry, alerting, and automated incident response frameworks to guarantee graceful state retention, preserving live in-flight transaction states across sudden host restarts, scheduled key rotations, or regional cloud outages.
- Incident Management: Oversee system health and participate in direct real-incident response and on-call rotations to maintain strict operational continuity for the live global fleet.
Requirements
- Container PaaS Orchestration: Proven professional experience deploying, monitoring, and scaling complex architectures in production utilizing Railway, or equivalent containerized platform-as-a-service frameworks (such as Fly.io, Render, or Northflank).
- Advanced AWS Proficiency: In-depth technical mastery of Amazon Web Services (AWS), with practical expertise spanning Virtual Private Clouds (VPC), Identity & Access Management (IAM), Secrets Manager, and elastic scaling frameworks (ECS / AWS Lambda).
- Production-Grade Tailscale Networking: Demonstrated experience implementing Tailscale within a high-security production environment, with distinct competence configuring Access Control Lists (ACLs), complex s