onsite
Staff/Senior Machine Learning Engineer, Infrastructure AI
Staff/Senior Machine Learning Engineer, Infrastructure AI
As a Staff/Senior Machine Learning Engineer, you will design and implement AI/ML solutions for infrastructure automation, including predictive autoscaling, anomaly detection, and intelligent cost optimization across multi-cloud environments. You will also build AI-driven monitoring systems, develop automated incident response workflows, and integrate AI into CI/CD pipelines to enhance reliability and security.
About the role
About the Role
As a Staff/Senior Machine Learning Engineer focusing on Infrastructure AI at Xsolla, you will be instrumental in designing and implementing cutting-edge AI/ML solutions to optimize and automate our infrastructure. This role involves working across various cloud environments and integrating AI into critical infrastructure operations.
Responsibilities
- Design and implement AI/ML-powered solutions for infrastructure use cases, including predictive autoscaling, anomaly detection, intelligent cost optimization, and automated remediation across GCP and multi-cloud environments.
- Build and maintain AI-driven monitoring and observability systems that correlate logs, metrics, and traces to surface root causes, predict bottlenecks, and reduce mean time to resolution (MTTR).
- Develop and operate automated incident response workflows using AI-powered playbooks that diagnose, contain, and resolve infrastructure issues with minimal manual intervention.
- Integrate AI tooling into CI/CD pipelines to improve deployment reliability, automate test prediction, score release health, and support rollback automation.
- Contribute to the development of internal AI agents and virtual assistants integrated into developer workflows (Slack, IDEs, Confluence) — enabling self-service for provisioning, troubleshooting, and infrastructure guidance.
- Implement AI/ML-based anomaly detection and automated vulnerability management workflows to enhance the security posture of Xsolla's infrastructure.
- Prototype and productionize Generative AI solutions for infrastructure automation, including auto-generation of Terraform/Puppet modules, IaC configurations, runbooks, and change documentation.
- Collaborate with senior engineers and leadership to evolve and execute the infrastructure AI strategy across its implementation phases.
- Maintain clear documentation of AI tools, integrations, and automated workflows; share knowledge and best practices across the team.