Google Unveils LiteRT, a New On-Device AI Inference Framework

Google has announced LiteRT, a universal inference framework designed to overcome the limitations of TensorFlow Lite with current models. Its goal is to standardize on-device AI execution, prioritizing speed and energy efficiency. LiteRT promises a unified workflow that automatically leverages specialized hardware like NPUs, maintaining compatibility with the .tflite format and offering direct support for PyTorch and JAX.

A smartphone with a glowing core, connected to an NPU chip and AI symbols, on a background of code and circuits.

Technical Pillars and Extended Multiplatform Support 🤖

LiteRT is based on four pillars: higher inference speed, a unified acceleration flow, robust support for open generative models, and integration with popular frameworks. It extends GPU acceleration to iOS, macOS, Windows, Linux, and Web, achieving, according to Google, 1.4 times better performance than its previous GPU delegate. To reduce real latency, it implements asynchronous execution and zero-copy techniques, minimizing overhead in data movement between CPU and accelerators.

Goodbye to the excuses of it works slowly on my device 😅

With LiteRT, the classic developer justification when a model crawls on the mobile might be numbered. Now, if the app responds at the speed of a snail, we can no longer comfortably blame the inference delegate. Google takes away a very dear scapegoat from us, forcing us to look for new and creative excuses, like the user has too many cats open in the background. The pressure increases.