Zoox logo

Machine Learning Engineer - Multi-Modality Foundation Model - Zoox

View Company Profile
Job Title
Machine Learning Engineer - Multi-Modality Foundation Model
Job Location
Foster City, CA
Job Description
The Perception team is pioneering the development of a multi-modality foundation model to drive the next generation of autonomous system intelligence. As a Multi-modality Foundation Model Engineer, you will focus on building highly efficient, production-ready multi-modality models. We are looking for experts who have hands-on experience building multi-modality foundation models—whether that involves AV-centric modalities (Vision, LiDAR, Radar) or broader domains (Vision, Language, Text, Audio). You will design, train, and deploy these models using Knowledge Distillation (KD) to transfer capabilities from large-scale proprietary teacher models to efficient student models capable of real-time, on-vehicle inference.
In this role, you will:
  • Build, pre-train, and evaluate large-scale multi-modality foundation models from the ground up, successfully aligning diverse data streams (e.g., Vision, LiDAR, Radar, Language, Audio).

  • Define and execute the ML roadmap for deploying these multi-modality representations to the vehicle.

  • Architect and implement Knowledge Distillation pipelines to compress large-capacity multi-modal teacher models into highly efficient, production-ready student models.

  • Build high-quality training and evaluation datasets, applying advanced data-centric techniques to maximize cross-modal representation learning and student model convergence.

  • Collaborate with downstream perception teams to integrate and validate the performance, robustness, and latency of your models in on-board production systems.

  • Qualifications:
  • MS or PhD in Computer Science, Machine Learning, or a related technical field with demonstrated professional experience.

  • Deep, proven expertise in building and training large-scale multi-modality foundation models (e.g., Vision-Language Models (VLMs), Vision-Audio-Text, or Vision-LiDAR-Radar architectures).

  • Strong understanding of cross-modal alignment, multi-modal attention mechanisms, and large-scale pre-training techniques.

  • Proven experience in Knowledge Distillation (KD), model compression, and training highly efficient student models for production environments.

  • Proficiency in ML frameworks (e.g., PyTorch) and experience building large-scale ML training and evaluation pipelines.

  • Bonus Qualifications:
  • Experience in the Autonomous Driving or robotics industry.

  • Experience with model deployment, optimization, and hardware constraints (e.g., C++ for inference, TensorRT, quantization, pruning).

  • Publications in top-tier conferences (CVPR, ICCV, NeurIPS, ICLR, ACL) related to multi-modality foundation models, cross-modal learning, or model compression.

  • Everything You Need, One Platform.

    From job listings to startups, investors to funding rounds, and everything in between, Employbl puts the power in your hands. Why wait?

    Start your free trial today!


    Stay Ahead of the Curve

    Sign up for our newsletter to stay informed about the latest startups and trends in the tech market. Let Employbl be your guide to success.

    Zoox Headquarters Location

    Foster City, CA

    View on map

    Zoox Company Size

    Between 2,000 - 5,000 employees

    Zoox Founded Year

    2014

    Zoox Total Amount Raised

    $1,005,000,000

    Zoox Funding Rounds

    View funding details
    • Convertible Note

      $200,000,000 USD

    • Series B

      $465,000,000 USD

    • Series A

      $50,000,000 USD

    • Series A

      $250,000,000 USD

    • Seed

      $40,000,000 USD