onsite
Cloud Operations Lead - SRE / DevOps / P... - Newvision Softcom & Consultancy Pvt. Ltd
Site Reliability Engineer
Lead SRE/DevOps engineer responsible for designing, implementing, and operating highly available cloud infrastructure on AWS, automating deployments with CI/CD pipelines, and ensuring reliability through Kubernetes orchestration and observability tools.
About the role
Key Responsibilities
- Design, build, and maintain scalable, secure AWS cloud environments supporting critical production workloads.
- Develop and manage end‑to‑end CI/CD pipelines to automate application delivery and infrastructure provisioning.
- Architect, deploy, and operate Kubernetes clusters, ensuring high availability and performance.
- Implement comprehensive monitoring, alerting, and logging using Datadog and Prometheus to achieve proactive incident response.
- Lead production support activities, perform root‑cause analysis, and drive continuous improvement of reliability processes.
- Mentor junior engineers and collaborate with cross‑functional teams to align operational practices with business goals.
Requirements
- 9–12 years of hands‑on experience in Linux system administration and AWS cloud services.
- Proven expertise in Kubernetes orchestration and container lifecycle management.
- Strong background in building CI/CD pipelines with tools such as Jenkins, GitLab CI, or similar.
- Deep knowledge of observability platforms, specifically Datadog and Prometheus, for metrics, tracing, and alerting.
- Demonstrated ability to lead production support, perform incident management, and drive reliability improvements.
Skills
linuxawskubernetesdatadogprometheus