onsite
Cloud Infrastructure Engineer (MLOps)
Cloud Infrastructure Engineer (MLOps)
PayPay is hiring a Cloud Infrastructure Engineer with 5+ years of experience to build and maintain cloud infrastructure for AI/ML projects on AWS. This role involves developing, deploying, and optimizing ML models using AWS services, managing data, and collaborating with cross-functional teams to ensure scalable and efficient AI solutions.
About the role
About the Role
PayPay is looking for an experienced Cloud-Based AI and ML Engineer. This role involves leveraging cloud-based AI/ML Services to build infrastructure as well as developing, deploying, and maintaining ML models, collaborating with cross-functional teams, and ensuring scalable and efficient AI solutions particularly on Amazon Web Services (AWS).
Main Responsibilities
- Cloud Infrastructure Management
- Architect and maintain cloud infrastructure for AI/ML projects using AWS tools.
- Implement best practices for security, cost management, and high-availability.
- Monitor and manage cloud resources to ensure seamless operation of ML services.
- Model Development and Deployment
- Design, develop, and deploy machine learning models using AWS services such as SageMaker.
- Collaborate with data scientists and data engineers to create scalable ML workflows.
- Optimize models for performance and scalability on AWS infrastructure.
- Implement CI/CD pipelines to streamline and accelerate the model development and deployment process.
- Set up a cloud-based development environment for data engineers and data scientists to facilitate model development and exploratory data analysis.
- Implement monitoring, logging, and observability to streamline operations and ensure efficient management of models deployed in production.
- Data Management
- Work with structured and unstructured data to train robust ML models.
- Use AWS data storage and processing services like S3, RDS, Redshift, or DynamoDB.
- Ensure data integrity and compliance with set Security regulations and standards.
- Collaboration and Communication
- Collaborate with cross-functional teams including DevOps, Data Engineering, and Product Management teams.
- Communicate technical concepts effectively to non-technical stakeholders.
- Continuous Improvement and Innovation
- Stay updated with the latest advancements in AI/ML technologies and AWS services.
- Provide through Automations means for developers to easily develop and deploy their AI/ML models on AWS.
Tech Stack
- AWS: VPC, EC2, ECS, EKS, Lambda, MWAA, RDS, ElastiCache, DynamoDB, Opensearch, S3, CloudWatch, Cognito, SQS, KMS, Secret Manager, MSK, Amazon Kinesis, CodeCommit, CodeBuild, CodeDeploy, CodePipeline, AWS Lake Formation, AWS Glue, SageMaker and other AI Services.
- Terraform, Github Actions, Prometheus, Grafana, Atlantis
- OSS (Administration experience on these tools): Jupyter, MLFlow, Argo Workflows, Airflow
Required Skills and Experiences
- More than 5+ years of technical experience in cloud-based infrastructure with a focus on AI and ML platforms.
- Extensive technical hands-on experience with computing, storage, and analytical services on AWS.
- Demonstrated skill in programming and scripting languages, including Python, Shell Scripting, Go, and Rust.
- Experience with infrastructure as code (IAC) tools in AWS, such as Terraform, CloudFormation, and CDK.
- Proficient in Linux internals and system administration.
- Experience in production level infrastructure change management and releases for business-critical systems.
- Experience in Cloud infrastructure and platform systems availability, performance and cost management.
- Strong understanding of cloud security best practices and payment industry compliance standards.
- Experience with cloud services monitoring, detection, and response, as well as performance tuning and cost control.
- Familiarity with cloud infrastructure service patching and upgrades.
- Excellent oral, written, and interpersonal communication skills.
Preferred Qualifications
- Bachelor’s degree and above in a technology related field.
- Experience with other cloud service providers (e.g. GCP, Azure).
- Experience with Kubernetes.
- Experience with Event-Driven Architecture (Kafka preferred).
- Experience using and contributing to Open Source tools.
- Experience in managing IT compliance and security risk.
- Published papers / blogs / articles.
- Relevant and verifiable certifications.