Zoox logo

Senior AI Inference Engineer - Model Optimization & Deployment - Zoox

View Company Profile
Job Title
Senior AI Inference Engineer - Model Optimization & Deployment
Job Location
Foster City, CA
Job Description
The Perception team is pioneering the development of a multi-modality foundation model to drive the next generation of autonomous system intelligence.

As a Model Optimization & Deployment Engineer, you will focus on bringing highly efficient, production-ready large-scale models to our on-vehicle stack. We are looking for experts with hands-on experience in compressing, accelerating, and deploying complex models (LLMs, VLMs, or FMs) for power- and thermal-constrained vehicle SOCs. You will optimize the ML models, write custom CUDA kernels, and build highly concurrent inference code to ensure real-time, deterministic execution on edge devices.
In this role, you will:
  • Optimize large-scale models (Multi-Modal Sensor Fusion models, LLMs, VLMs) using advanced quantization (PTQ, QAT), pruning, mixed-precision inference frameworks, and parameter-efficient fine-tuning (LoRA, QLoRA).
  • Architect and implement model conversion and compilation pipelines using TensorRT for edge deployment.
  • Perform rigorous parity checking, accuracy recovery, and latency benchmarking between PyTorch frameworks and compiled edge binaries.
  • Develop and optimize custom ML OPs and TensorRT Plugins with efficient CUDA kernels to minimize latency and maximize memory bandwidth on AI accelerators.
  • Write production-level, low latency, and memory-safe C++ and CUDA code for real-time inference on vehicle systems.
  • Qualifications:
  • Deep expertise in model quantization (PTQ, QAT) and mixed-precision inference frameworks (INT8, FP8, FP4, BF16/FP16).
  • Proven experience optimizing large-scale models (Multi-Modal Sensor Fusion models, LLMs, VLMs/VLAs) utilizing Efficient Attention mechanisms (e.g., FlashAttention, Linear Attention), KV-cache optimization (e.g., PagedAttention) and Speculative Decoding.
  • Extensive experience with model conversion/compilation pipelines (e.g., ONNX, TensorRT, torch.compile) and performing rigorous latency benchmark and model quality parity valuation.
  • Proficiency in low-level programming for AI accelerators, specifically developing and optimizing custom ML OPs and TensorRT Plugins with efficient CUDA kernel implementations.
  • Production-level C++ (14/17/20) and Python programming skills, with experience developing concurrent, memory-safe, real-time inference code for edge devices.
  • Bonus Qualifications:
  • Familiarity with SOTA autonomous driving perception algorithms (temporal 3D object detection, BEV, 3D Occupancy Networks) and multi-modal sensor processing (Vision, LiDAR, Radar).
  • Experience with distributed training pipelines and model/tensor parallelism (PyTorch Distributed, Ray, DeepSpeed, Megatron-LM) and runtime efficiency optimization for GPU clusters.
  • Experience with end-to-end autonomous driving paradigms (VLM/VLA models, Foundation models) and edge deployment technologies (e.g., TensorRT-LLM).
  • Everything You Need, One Platform.

    From job listings to startups, investors to funding rounds, and everything in between, Employbl puts the power in your hands. Why wait?

    Start your free trial today!


    Stay Ahead of the Curve

    Sign up for our newsletter to stay informed about the latest startups and trends in the tech market. Let Employbl be your guide to success.

    Zoox Headquarters Location

    Foster City, CA

    View on map

    Zoox Company Size

    Between 2,000 - 5,000 employees

    Zoox Founded Year

    2014

    Zoox Total Amount Raised

    $1,005,000,000

    Zoox Funding Rounds

    View funding details
    • Convertible Note

      $200,000,000 USD

    • Series B

      $465,000,000 USD

    • Series A

      $50,000,000 USD

    • Series A

      $250,000,000 USD

    • Seed

      $40,000,000 USD