About Xsolla
Xsolla is a global video game commerce company with a robust and powerful set of tools and services designed to help developers and publishers manage, market, and monetize their games worldwide. We are focused on delivering the best solutions for our partners to succeed in the competitive gaming industry.
About the Role
As a Principal AI/ML Engineer, Platform at Xsolla, you will be a key contributor to our infrastructure team, focusing on leveraging Artificial Intelligence and Machine Learning to enhance the reliability, efficiency, and security of our platform across multi-cloud environments, with a strong emphasis on GCP. You will design, implement, and maintain advanced AI/ML solutions that drive our infrastructure automation strategy, from predictive autoscaling to intelligent incident response and generative AI applications for infrastructure as code.
Responsibilities
- Design and implement AI/ML-powered solutions for infrastructure use cases, including predictive autoscaling, anomaly detection, intelligent cost optimization, and automated remediation across GCP and multi-cloud environments.
- Build and maintain AI-driven monitoring and observability systems that correlate logs, metrics, and traces to surface root causes, predict bottlenecks, and reduce mean time to resolution (MTTR).
- Develop and operate automated incident response workflows using AI-powered playbooks that diagnose, contain, and resolve infrastructure issues with minimal manual intervention.
- Integrate AI tooling into CI/CD pipelines to improve deployment reliability, automate test prediction, score release health, and support rollback automation.
- Contribute to the development of internal AI agents and virtual assistants integrated into developer workflows (Slack, IDEs, Confluence) — enabling self-service for provisioning, troubleshooting, and infrastructure guidance.
- Implement AI/ML-based anomaly detection and automated vulnerability management workflows to enhance the security posture of Xsolla's infrastructure.
- Prototype and productionize Generative AI solutions for infrastructure automation, including auto-generation of Terraform/Puppet modules, IaC configurations, runbooks, and change documentation.
- Collaborate with senior engineers and leadership to evolve and execute the infrastructure AI strategy across its implementation phases.
- Maintain clear documentation of AI tools, integrations, and automated workflows; share knowledge and best practices across the team.
Requirements
- Proven experience in designing, developing, and deploying AI/ML solutions in a production environment, specifically for infrastructure or platform engineering.
- Strong expertise in cloud platforms, particularly GCP, and experience with multi-cloud environments.
- Solid understanding of MLOps principles and practices.
- Experience with containerization and orchestration technologies (e.g., Kubernetes).
- Proficiency in programming languages commonly used in AI/ML and infrastructure (e.g., Python, Go).
- Familiarity with infrastructure as code tools (e.g., Terraform, Puppet).
- Excellent problem-solving skills and the ability to work independently and as part of a team.
- Strong communication and collaboration skills.