About the Role
We are hiring an applied AI engineer to own the intelligence inside Centralize. The product's value depends on AI systems that map stakeholders, analyze deal health, and turn unstructured customer conversations into actions that drive revenue. You will own those systems end to end across the full AI stack: the multi-agent architectures and LLM pipelines, the classical ML and data science work that powers ranking, scoring, and entity resolution, and the eval and data infrastructure that makes all of it better over time.
This is a production engineering role with both an LLM lens and an ML/DS lens. Some problems at Centralize are best solved with a frontier model and a well-designed agent loop. Others are best solved with a classifier, an embedding model, a custom retriever, or a feature pipeline. You'll know which is which, and you'll build whichever one moves the metric.
This role is well-suited to engineers who have shipped LLM-powered products and trained or fine-tuned models in production, who think about evals and reliability before model selection, and who can move fluidly between prompt engineering, fine-tuning, and traditional ML when the problem demands it.
What You Will Do
- Design and ship multi-agent systems that handle the hardest reasoning problems in the product: stakeholder mapping, account research, deal health analysis, conversation intelligence.
- Own the LLM pipelines end to end: prompt engineering, retrieval, tool use, structured outputs, guardrails, and the orchestration glue that ties it all together.
- Build and maintain the ML and DS work that LLMs aren't the right tool for: ranking models, classifiers, embedding models, entity resolution across messy CRM data, signal extraction from sales conversations.
- Fine-tune models when frontier APIs aren't enough. Curate training data, design eval sets, run experiments, and ship the results to production.
- Build the eval infrastructure that lets us ship AI features without breaking them. LLM-as-judge, human-in-the-loop, classical metrics for ML systems, regression suites. We grade on what works in production.
- Own the data flywheel. The product generates rich signal from customer conversations, deal outcomes, and stakeholder interactions. Turn that into training data, eval data, and the feedback loops that compound over time.
- Stay on the frontier. New models drop monthly. You'll know which ones move the needle for our use cases, when to switch, and when to wait.
- Talk to customers. Sit on calls, see what's actually broken, and translate that into the AI capabilities that matter.
What We Are Looking For
- Demonstrated experience shipping LLM-powered products to production with real customers and real evals. We can tell the difference between someone who's built demos and someone who's lived through the operational reality.
- Demonstrated experience training, fine-tuning, or shipping classical ML models in production. Ranking, classification, embeddings, retrieval. You know when a 50ms classifier beats a $0.10 LLM call, and you know when it doesn't.
- Strong fluency with multi-agent systems, tool use, function calling, RAG, and the orchestration patterns that make them reliable. Frameworks are tools, not religion.
- Real expertise in evaluation across both LLM and ML systems. You think about evals before you think about prompts or features, because you've learned the hard way that you can't improve what you can't measure.
- Strong backend engineering fundamentals. Most of this work lives in production services, not notebooks. Python is required; familiarity with TypeScript, Postgres, queues, and AWS is a major plus.
- Sharp instinct for cost, latency, and reliability tradeoffs across the AI stack. You know when to reach for a frontier model, when to fine-tune a smaller one, and when to write a regex.
- Excellent written and verbal English communication. You can write a doc that explains a model behavior to a non-technical PM and a customer demo that closes a deal.
- Demonstrated ability to operate independently. We give you the goal, not the steps.
Preferred Qualifications
- Background as an MLE who has flexed into LLM application work, or as an LLM engineer with deep MLE foundations. The best candidates for this role are fluent in both worlds.
- Experience fine-tuning open or closed models for specific tasks, including data curation, training infrastructure, and post-training evaluation.
- Experience with multi-agent orchestration frameworks (LangGraph, Mastra, custom orchestrators) at production scale.
- Experience with classical ML systems in production: ranking models, embedding models, entity resolution, recommendation systems.
- Open-source contributions, technical writing, or conference presentations.