Senior Network & Site Reliability Engineer - Alembic Pharmaceuticals Ltd
Site Reliability Engineer
Senior Network & Site Reliability Engineer driving high‑availability, scalable infrastructure for a cutting‑edge AI platform, leveraging Kubernetes, Prometheus, Grafana, AWS, Docker and CI/CD pipelines to ensure robust, secure, and performant services.
About the role
Key Responsibilities
Design, implement, and maintain highly available network and infrastructure solutions across on‑prem and cloud environments.
Lead SRE initiatives: monitoring, alerting, incident response, and post‑mortem analysis using Prometheus, Grafana, and custom dashboards.
Automate deployment pipelines with Docker, Kubernetes, Helm, and CI/CD tools to accelerate feature delivery and reduce manual toil.
Collaborate with security, compliance, and DevOps teams to enforce best practices, harden systems, and manage access controls.
Drive capacity planning, performance tuning, and cost optimization for large‑scale AI workloads on AWS and private supercomputing resources.
Requirements
5+ years of experience in network engineering and site reliability roles.
Proficient with Kubernetes, Docker, Helm, and cloud platforms (AWS preferred).
Strong scripting skills (Python, Bash) and experience with CI/CD pipelines.
Hands‑on experience with monitoring/alerting tools such as Prometheus, Grafana, and ELK stack.
Excellent problem‑solving, communication, and collaboration skills in a fast‑paced, high‑impact environment.