onsite
Principal AI/ML Engineer, Platform Engineering
Principal AI/ML Engineer, Platform Engineering
The Principal AI/ML Engineer will design and implement AI/ML-powered solutions to optimize Xsolla's infrastructure, enhance security, and improve developer productivity. This role involves developing predictive autoscaling, anomaly detection, and automated remediation systems, as well as integrating AI into CI/CD pipelines and creating Generative AI solutions for infrastructure automation across GCP and multi-cloud environments.
About the role
About the Role
As a Principal AI/ML Engineer in Platform Engineering, you will be a key contributor to Xsolla's infrastructure innovation. This role involves designing and implementing advanced AI/ML solutions to optimize infrastructure operations, enhance security, and improve developer productivity across multi-cloud environments, with a strong focus on GCP.
Responsibilities
- Design and implement AI/ML-powered solutions for infrastructure use cases, including predictive autoscaling, anomaly detection, intelligent cost optimization, and automated remediation across GCP and multi-cloud environments.
- Build and maintain AI-driven monitoring and observability systems that correlate logs, metrics, and traces to surface root causes, predict bottlenecks, and reduce mean time to resolution (MTTR).
- Develop and operate automated incident response workflows using AI-powered playbooks that diagnose, contain, and resolve infrastructure issues with minimal manual intervention.
- Integrate AI tooling into CI/CD pipelines to improve deployment reliability, automate test prediction, score release health, and support rollback automation.
- Contribute to the development of internal AI agents and virtual assistants integrated into developer workflows (Slack, IDEs, Confluence) — enabling self-service for provisioning, troubleshooting, and infrastructure guidance.
- Implement AI/ML-based anomaly detection and automated vulnerability management workflows to enhance the security posture of Xsolla's infrastructure.
- Prototype and productionize Generative AI solutions for infrastructure automation, including auto-generation of Terraform/Puppet modules, IaC configurations, runbooks, and change documentation.
- Collaborate with senior engineers and leadership to evolve and execute the infrastructure AI strategy across its implementation phases.
- Maintain clear documentation of AI tools, integrations, and automated workflows; share knowledge and best practices across the team.
Requirements
- 8+ years of experience in AI/ML engineering, with a strong focus on infrastructure and platform-related applications.
- Proven track record of designing, building, and deploying production-grade AI/ML systems.
- Extensive experience with cloud platforms, particularly GCP, and familiarity with multi-cloud environments.
- Proficiency in developing solutions for monitoring, observability, and incident response.
- Strong understanding of CI/CD principles and practices.
- Experience with Generative AI technologies and their application in infrastructure automation (e.g., auto-generation of Terraform/Puppet modules, IaC configurations).
- Excellent collaboration and communication skills.