AR DRIVER

SPEC-02.02 · CARLSBAD STUDIO · INTERNAL · 2024

Realtime 3DOF in-car head tracking for an AR driving experience. Collaboration with the Mercedes Carlsbad Design Studio.

Mercedes Carlsbad Design Studio · 2024 · internal · no public repo

An in-car AR system for a driver wearing XReal AR glasses, registering virtual content to the world as the head moves. The interesting part is the sensor stack — no single tracker is robust enough in a moving car under varied lighting, so head pose is fused from a Zed stereo camera, an inertial unit, and MediaPipe landmarks at the same time, then handed to Unity to render the AR scene on-glasses.

Stack

XReal AR glasses: Display surface — birdbath-optic AR glasses that present a Unity render to the driver. Tethered to a host machine; the glasses are dumb optics, the work happens upstream.
Stereolabs Zed camera: Stereo vision for head-tracking + depth. Mounted in-cabin facing the driver. Gives translation and rotation at ~60 Hz with low jitter but high latency relative to the IMU.
IMU / accelerometers: Low-latency rotational deltas at hundreds of Hz. Filled in the gap between Zed frames and absorbed the perceptual latency that would otherwise show as AR content sliding behind head motion.
MediaPipe: Face landmarks for a third sanity-check signal on head pose and as a fallback when the Zed lost lock under glare or rapid motion.
Unity: Scene composition. The fused 3DOF pose drove the camera; AR content was registered against world-anchored points (windshield-fixed and cabin-fixed) so visual elements stayed locked while the driver moved their head.
Sensor fusion (complementary filter): IMU high-pass + Zed low-pass for rotation; MediaPipe as a redundancy check. Drift correction triggered when the slow signal disagreed with the fast one beyond threshold.

Process

3DOF, not 6DOF, on purpose. The brief was driver-facing AR — yaw, pitch, roll only; translation was constrained by the seat. Six-DOF was technically achievable from the Zed but added latency and failure modes we didn't need. Cutting to 3DOF let the IMU carry the fast path and reduced everything downstream.

Why three sensors instead of one. Any single source failed somewhere. The Zed gave great absolute pose but had ~50 ms end-to-end latency — perceptible as AR content "swimming" when the driver turned their head fast. The IMU was fast but drifted within seconds. MediaPipe was good for landmark-level rotation but occasionally jumped frames or lost the face entirely under glare from the windshield. Fusion was the only path to "feels locked": IMU drove the high-frequency response, Zed corrected drift, MediaPipe acted as a tiebreaker / fallback. A complementary filter on rotation was enough — no full Kalman implementation needed for the 3DOF case.

The hard part was calibration, not fusion. Every sensor had a different reference frame: Zed's in its own camera origin, IMU's in glasses-local axes, MediaPipe's in image-space. Aligning them required a one-shot calibration where the driver looked at three known points and the system solved for the rotation between frames. Once calibrated for that vehicle + glasses combination, the fused pose stayed consistent for the session.

Carlsbad Design Studio collaboration. The Mercedes Carlsbad Design Studio drove the experience and content side — what should AR mean for a driver, what reads through tinted glass, what UI doesn't compete with the road. The engineering side was the head-tracking pipeline that made any of those experiences feel locked-in rather than approximate.

What it shipped to. Internal prototyping rig — the kind of thing that goes into the design studio's stack of "experiences we've validated are achievable" rather than out to a vehicle program. The tracking pipeline is the part that transfers; the AR content is studio-specific.