onsite
IT Operations Engineer - OLIX
Systems Engineer
Lead the design, deployment, and maintenance of scalable, secure infrastructure for AI workloads, leveraging Linux, AWS, automation, monitoring, and container orchestration to ensure high availability and performance.
About the role
Key Responsibilities
- Design, implement, and manage cloud and on‑prem infrastructure for AI training and inference workloads.
- Automate provisioning and configuration using Ansible, Terraform, and CI/CD pipelines.
- Monitor system health with Prometheus, Grafana, and custom alerts; troubleshoot performance bottlenecks.
- Collaborate with hardware and software teams to optimize resource utilization and cost efficiency.
- Ensure security compliance, patch management, and disaster recovery procedures.
Requirements
- 3+ years of experience in Linux system administration and cloud operations.
- Proficiency with AWS services (EC2, S3, EKS, RDS) and container orchestration (Kubernetes).
- Hands‑on experience with automation tools (Ansible, Terraform) and CI/CD pipelines.
- Strong scripting skills in Bash or Python.
- Excellent problem‑solving, communication, and teamwork abilities.
Skills
linuxawsansibleprometheuskubernetes