onsite

Principal Machine Learning Engineer-AgenticAI, GenAi/LLM, , MCP, Python/Golang

Red Hat is seeking a Principal Machine Learning Engineer to lead the development and delivery of strategic AI agents and MCP Servers. This role involves end-to-end ownership of ML systems, including GenAI development, building and productionizing ML/NLP pipelines, and collaborating with cross-functional teams to integrate AI capabilities. The ideal candidate will have over 5 years of experience in NLP, strong Python skills, and expertise in various ML frameworks and Generative AI applications.

About the role

About The Job

The Data And AI team at Red Hat is a highly focused effort to lead digital-first execution and transformation, leveraging data & AI strategically for our customers, partners, and associates. The engineering team is dedicated to building and delivering strategic AI agents & MCP Servers, designed to augment human capabilities, accelerate business workflows, and scale operations across the enterprise.

As a Principal Machine Learning Engineer, you will take ownership of end-to-end ML systems, champion best practices, and deliver impactful, production-grade models. You will work autonomously, mentor others, and collaborate with data, engineering, and product teams to bring AI & agentic capabilities into production.

What You Will Do?

GenAI Development: Lead the research and implementation of advanced algorithms and tools for NLP/GenAI tasks. Drive the development of next-generation AI/ML applications in a highly collaborative environment.
Solution Delivery: Contribute to the design, implementation, and delivery of AI platform capabilities & agentic solutions from concept to production.
Build ML/NLP Pipelines: Design, build, and evolve ML pipelines covering data ingestion, preprocessing, feature engineering, training, validation, deployment, and monitoring. Ensure successful training and evaluation of NLP models, refining them based on statistical analysis.
Productionize Models: Translate research prototypes and models into production-quality code, ensuring robustness, scalability, and maintainability.
Model Evaluation & Tuning: Select appropriate algorithms and modeling techniques, perform hyperparameter tuning, and conduct comparative experimentation. Evaluate and validate model performance using advanced metrics (e.g., ROC-AUC, precision/recall curves, calibration, fairness, drift) and set up continuous validation/regression checks.
Build & Operate Systems: Design, build, and evolve MCP servers and Agents that enable and empower Red Hatters to do business efficiently. Instrument models and systems with monitoring, logging, alerting, and automated healing or scaling mechanisms.
Troubleshoot & Support: Troubleshoot and resolve production incidents, root-cause errors, data drifts, performance regressions, or infrastructure issues.
Collaborate & Mentor: Collaborate with cross-functional teams, including finance, operations, sales, and marketing, to understand and meet business needs. Collaborate closely with software engineers, data engineers, and SRE/DevOps to integrate ML services into broader systems.
Set Standards: Mentor more junior engineers, lead code reviews, and help establish ML lifecycle and quality standards. Stay current with emerging ML research, frameworks, and tooling, and proactively propose improvements or experiments.

What You Will Bring?

Education & Experience: Bachelor’s degree or above in Computer Science, Math, Computational Linguistics, Computer Engineering, or other related fields.
NLP Expertise: 5+ years of professional experience in NLP, with a strong command of Python and frameworks such as Spacy and Hugging Face.
ML Lifecycle Mastery: Proven expertise in designing and delivering NLP applications across all stages of the data science lifecycle.
GenAI & Frameworks: Deep understanding of machine learning frameworks and experience in Generative AI application development. This includes working knowledge of TensorFlow, TensorFlow Serving, Keras, and PyTorch, as well as experience with LLMs, Embedding models, and Vector Databases.
Software Engineering Excellence: Exceptional software engineering skills that lead to an elegant and maintainable data platform. Proficiency in at least one general-purpose programming language (e.g., Python, Go, Java, Rust, etc.).
Experience with LangGraph, LangChain, Autogen and/or Python/Java-based AI libraries for GenAI applications.
Scalable Systems: Experience developing highly scalable backend microservices in AWS.
Enterprise Focus: Past experience in building enterprise data platforms that have a high level of governance and compliance requirements.
Collaborative Mindset: Comfortable working with a small team in a fast-paced, highly collaborative environment.
Communication & Business Acumen: Excellent communication, presentation, and writing skills. Experience in interacting with cross-functional business and engineering teams and capability in undertaking business needs analysis.
Personal Drive: Motivated with a passion for quality, learning, and contributing to collective goals, with a bias for action.
User Empathy: Deep empathy for your platform's users, leading to a constant focus on removing friction, increasing adoption, and delivering business results.

Optional Bonus Skills

Familiarity with building and running MCP servers, Agents etc.
Familiarity with working with LLMs.
Familiarity with open source or inner source development and processes.
Familiarity of data mesh architectural principles.
Experience with Snowflake, Fivetran, dbt, Airflow / Astronomer.

About the role

About The Job

What You Will Do?

GenAI Development: Lead the research and implementation of advanced algorithms and tools for NLP/GenAI tasks. Drive the development of next-generation AI/ML applications in a highly collaborative environment.
Solution Delivery: Contribute to the design, implementation, and delivery of AI platform capabilities & agentic solutions from concept to production.
Build ML/NLP Pipelines: Design, build, and evolve ML pipelines covering data ingestion, preprocessing, feature engineering, training, validation, deployment, and monitoring. Ensure successful training and evaluation of NLP models, refining them based on statistical analysis.
Productionize Models: Translate research prototypes and models into production-quality code, ensuring robustness, scalability, and maintainability.
Model Evaluation & Tuning: Select appropriate algorithms and modeling techniques, perform hyperparameter tuning, and conduct comparative experimentation. Evaluate and validate model performance using advanced metrics (e.g., ROC-AUC, precision/recall curves, calibration, fairness, drift) and set up continuous validation/regression checks.
Build & Operate Systems: Design, build, and evolve MCP servers and Agents that enable and empower Red Hatters to do business efficiently. Instrument models and systems with monitoring, logging, alerting, and automated healing or scaling mechanisms.
Troubleshoot & Support: Troubleshoot and resolve production incidents, root-cause errors, data drifts, performance regressions, or infrastructure issues.
Collaborate & Mentor: Collaborate with cross-functional teams, including finance, operations, sales, and marketing, to understand and meet business needs. Collaborate closely with software engineers, data engineers, and SRE/DevOps to integrate ML services into broader systems.
Set Standards: Mentor more junior engineers, lead code reviews, and help establish ML lifecycle and quality standards. Stay current with emerging ML research, frameworks, and tooling, and proactively propose improvements or experiments.

What You Will Bring?

Education & Experience: Bachelor’s degree or above in Computer Science, Math, Computational Linguistics, Computer Engineering, or other related fields.
NLP Expertise: 5+ years of professional experience in NLP, with a strong command of Python and frameworks such as Spacy and Hugging Face.
ML Lifecycle Mastery: Proven expertise in designing and delivering NLP applications across all stages of the data science lifecycle.
GenAI & Frameworks: Deep understanding of machine learning frameworks and experience in Generative AI application development. This includes working knowledge of TensorFlow, TensorFlow Serving, Keras, and PyTorch, as well as experience with LLMs, Embedding models, and Vector Databases.
Software Engineering Excellence: Exceptional software engineering skills that lead to an elegant and maintainable data platform. Proficiency in at least one general-purpose programming language (e.g., Python, Go, Java, Rust, etc.).
Experience with LangGraph, LangChain, Autogen and/or Python/Java-based AI libraries for GenAI applications.
Scalable Systems: Experience developing highly scalable backend microservices in AWS.
Enterprise Focus: Past experience in building enterprise data platforms that have a high level of governance and compliance requirements.
Collaborative Mindset: Comfortable working with a small team in a fast-paced, highly collaborative environment.
Communication & Business Acumen: Excellent communication, presentation, and writing skills. Experience in interacting with cross-functional business and engineering teams and capability in undertaking business needs analysis.
Personal Drive: Motivated with a passion for quality, learning, and contributing to collective goals, with a bias for action.
User Empathy: Deep empathy for your platform's users, leading to a constant focus on removing friction, increasing adoption, and delivering business results.

Optional Bonus Skills

Familiarity with building and running MCP servers, Agents etc.
Familiarity with working with LLMs.
Familiarity with open source or inner source development and processes.
Familiarity of data mesh architectural principles.
Experience with Snowflake, Fivetran, dbt, Airflow / Astronomer.