About Meshy
Headquartered in Silicon Valley, Meshy is the leading 3D generative AI company on a mission to Unleash 3D Creativity by transforming the content creation pipeline. Meshy makes it effortless for both professional artists and hobbyists to create unique 3D assets—turning text and images into stunning 3D models in just minutes. Our world-class team of top experts in computer graphics, AI, and art includes alumni from MIT, Stanford, and Berkeley, as well as veterans from Nvidia and Microsoft. Meshy is trusted by top developers, backed by premiere venture capital firms like Sequoia and GGV, and has successfully raised $52 Million in funding. Meshy is the market leader, recognized as the No.1 in popularity among 3D AI tools and No.1 in website traffic. The platform boasts over 5 Million users and has generated 40 Million models.
About the Role
- This role sits at the intersection of platform engineering, site reliability, and applied ML systems. The function owns the reliability, scalability, and operability of Meshy's AI model serving stack, along with core engineering infrastructure. The team operates a conventional production infrastructure (CI/CD, build systems, deployment, runtime environments) and develops a model-serving platform that connects the models developed by our Research Team to product-facing backend systems. The position is systems-heavy, production-oriented, and focused on turning experimental model artifacts into robust, observable, and cost-efficient services.
Job Responsibilities
- 0-1 to 1-N Infrastructure: Build and scale AI inference infrastructure from the ground up, including inference serving, scheduling, orchestration, and auto-scaling.
- Resource Management: Design and optimize CPU/GPU resource management systems to maximize utilization and cost-efficiency.
- GPU Virtualization: Drive the production-level implementation of GPU scheduling and multiplexing technologies (MIG, MPS, Virtualization).
- Performance Optimization: Optimize the inference pipeline (throughput, latency, and stability) to support high-concurrency and complex business scenarios.
- System Reliability & Governance: Contribute to system stability, disaster recovery design, and cloud cost management/governance.
- AI-Native Evolution: Explore AI-native infrastructure and automated O&M (Operations & Maintenance) to support rapid business growth.
Qualifications
- Experience: 3–5 years of experience in Backend or Infrastructure engineering (Cloud-native or AI platform experience is highly preferred).
- Technical Proficiency: Strong mastery of Go or Python with solid engineering foundations and clean coding habits.
- Core Fundamentals: Deep understanding of Linux internals, networking, and distributed systems.
- Cloud Native: Proven hands-on experience with Kubernetes, Docker, and microservices architecture.
- Domain Expertise: Prior project experience in inference systems, task scheduling, or resource management.
- Soft Skills: Highly self-driven, quick learner, and eager to take on significant ownership in a fast-paced startup environment.
Nice to have
- Experience with GPU platforms or customizing Kubernetes schedulers.
- Familiarity with Ray, model serving frameworks, or distributed inference architectures.
- Practical experience with GPU multiplexing technologies such as MIG, MPS, or vGPU.
- Engineering background in SRE, observability, or cloud cost optimization.
- Active contributor to open-source projects or possess significant technical influence in the community.