About the Role
A talented MLOps Engineer will help operationalize machine learning models at scale. The ideal candidate will have a strong background in machine learning, software engineering, and DevOps practices, with expertise in deploying, monitoring, and maintaining ML models in production environments. Candidate should have worked or have at least a good understanding of LangChain, LangGraph, LangSmith, grounding techniques, RAG, embeddings, and related areas to build GENAI based solutions along with Python and MLOPS skills.
Experience
- 5-7 years of experience in MLOps, DevOps, or related fields.
Requirements
- Proficiency in Python and experience with ML frameworks such as TensorFlow, PyTorch, or Scikit-learn.
- Hands-on experience with cloud platforms (e.g., AWS, GCP, or Azure) and their ML services.
- Knowledge of containerization and orchestration tools (e.g., Docker, Kubernetes).
- Experience with CI/CD tools (e.g., GitHub Actions or Jenkins).
- Familiarity with monitoring tools for ML models (e.g., Dynatrace, Prometheus, Grafana, or MLFlow).
- Strong understanding of version control for models and data (e.g., Git).
- Knowledge in scripting using Python/Unix bash.
- Good understanding of LangChain, LangGraph, LangSmith, grounding techniques, RAG, embeddings to build GENAI based solutions.
Roles & Responsibilities
- Good in communication, coordination, and proactive in nature.
- Self-driven, customer-centric, and innovative.
- Checking deployment pipelines for machine learning models.
- Review Code changes and pull requests from the data science team.
- Triggers CI/CD pipelines after code approvals.
- Monitors pipelines and ensures all tests pass and model artifacts are generated/stored correctly.
- Deploys updated models to production after pipeline completion.
- Works closely with the software engineering and DevOps team to ensure smooth integration.
- Containerize models using Docker and deploy on cloud platforms (like AWS/GCP/Azure).
- Set up monitoring tools to track various metrics like response time, error rates, and resource utilization.
- Establish alerts and notifications to quickly detect anomalies or deviations from expected behavior.
- Analyze monitoring data, log files, and system metrics.
- Collaborate with the data science team to develop updated pipelines to cover any faults.
- Documenting and troubleshoots, changes, and optimization.