onsite
Site Reliability Developer - Oracle
Software Engineer
Develop and automate infrastructure for large‑scale distributed systems, enhancing availability, scalability, and performance using Python, Go, Kubernetes, Terraform, and AWS. Collaborate with SRE teams to design resilient architectures and drive capacity planning and system tuning.
About the role
Key Responsibilities
- Design, develop, and deploy automation scripts and tools in Python and Go to improve system reliability and reduce incident recurrence.
- Build and maintain Kubernetes clusters, Terraform modules, and AWS infrastructure to support high‑availability services.
- Collaborate with SRE teams on full‑stack ownership, capacity planning, and demand forecasting for distributed services.
- Analyze software performance, conduct system tuning, and implement monitoring solutions using CloudWatch and custom metrics.
- Define and enforce architecture standards, best practices, and operational procedures across the organization.
Requirements
- 3+ years of experience in site reliability engineering or DevOps roles.
- Proficiency in Python, Go, Kubernetes, Terraform, and AWS services.
- Strong understanding of distributed systems, performance analysis, and capacity planning.
- Experience with monitoring, logging, and alerting tools (e.g., CloudWatch, Prometheus).
- Excellent problem‑solving skills and ability to work collaboratively in a fast‑paced environment.
Skills
pythongokubernetesterraformaws