remote
AI Infrastructure & Experience Engineer - OSI Engineering, Inc.
Software Engineer
Lead the deployment and optimization of large language and multimodal models on local inference hardware, leveraging CUDA, PyTorch, and TensorRT to achieve low latency and high throughput while implementing custom kernels and quantization strategies.
About the role
Key Responsibilities
- Deploy and fine‑tune multiple large language models (LLMs) and generative multimodal models on local inference hardware.
- Optimize performance metrics such as time‑to‑first‑token (TTFT) and tokens per second through model quantization, caching, and architecture‑specific tuning.
- Develop and maintain custom CUDA kernels to maximize GPU utilization and reduce inference latency.
- Collaborate with research and product teams to integrate new model architectures and evaluate their impact on user experience.
- Monitor and troubleshoot production inference pipelines, ensuring reliability and scalability.
Requirements
- Strong experience with Python, PyTorch, and CUDA programming.
- Proficiency in model deployment tools such as TensorRT and ONNX Runtime.
- Hands‑on knowledge of model quantization techniques and performance profiling.
- Experience with large language models and multimodal AI systems.
- Excellent problem‑solving skills and a passion for building high‑performance AI infrastructure.