remote

AI Infrastructure & Experience Engineer - OSI Engineering, Inc.

Software Engineer

Lead the deployment and optimization of large language and multimodal models on local inference hardware, leveraging CUDA, PyTorch, and TensorRT to achieve low latency and high throughput while implementing custom kernels and quantization strategies.

About the role

Key Responsibilities

Deploy and fine‑tune multiple large language models (LLMs) and generative multimodal models on local inference hardware.
Optimize performance metrics such as time‑to‑first‑token (TTFT) and tokens per second through model quantization, caching, and architecture‑specific tuning.
Develop and maintain custom CUDA kernels to maximize GPU utilization and reduce inference latency.
Collaborate with research and product teams to integrate new model architectures and evaluate their impact on user experience.
Monitor and troubleshoot production inference pipelines, ensuring reliability and scalability.

Requirements

Strong experience with Python, PyTorch, and CUDA programming.
Proficiency in model deployment tools such as TensorRT and ONNX Runtime.
Hands‑on knowledge of model quantization techniques and performance profiling.
Experience with large language models and multimodal AI systems.
Excellent problem‑solving skills and a passion for building high‑performance AI infrastructure.

Skills

pythoncudapytorch

CompanyOSI Engineering, Inc.

DepartmentEngineering

LocationMountain View, United States

Experience3+ years

Tenurefull-time

LevelMid-Level

Posted June 23, 2026