Your Phone Can Run Real AI Now, Here's What That Actually Means

Have you ever opened a photo on your phone and thought: “I wish I could just ask AI what this is, without that photo going anywhere”? Or wished you had a smart assistant that still worked when your signal dropped somewhere mid-flight?

That’s the itch Google AI Edge Gallery is trying to scratch.

Released in early April 2026, the app lets you run real, capable AI models, including the newly launched Gemma 4, entirely on your phone, with no internet connection, no server costs, and no data leaving your device.

Think of it as Ollama or LM Studio (if you’ve heard of those), but built specifically for mobile hardware.

Since this post isn’t a hype piece (I’m not really into those), I will let you know of the real limitations and expectations to set before you go download it, expecting a full ChatGPT replacement.

If you go in informed, there’s a lot here worth exploring, whether you’re a developer curious about where AI is heading or just someone who’d like a private, offline AI assistant on their phone.

By the end of this, you’ll know:

What the app actually does
How the tech under the hood makes it possible (in a digestible way)
What each feature is good (and not so great) for
How to get up and running without the usual setup frustration

So What Is Google AI Edge Gallery?

At its core, Google AI Edge Gallery is an experimental, open-source playground for running Large Language Models (LLMs) directly on your device’s hardware (CPU, GPU, or NPU) instead of sending your queries off to a remote server.

The tagline is “100% on-device privacy,” and that’s not marketing fluff.

Once a model is downloaded, the app runs entirely offline. Your prompts, images, and voice recordings never touch a server.

The best analogies are desktop tools you might already know:

Ollama or LM Studio are popular apps that let you download and run AI models locally on your laptop. Google AI Edge Gallery is the same concept, but optimized from the ground up for mobile hardware.
Compare that to apps like the standard ChatGPT app or the Gemini app, which are cloud-dependent. Every response is generated on a remote server and piped back to you over the internet.

The difference matters more than it might seem. We’ll get into exactly why in a bit.

Note ⚠️
The app is currently in active development with Google explicitly calling it experimental. That means you should expect some rough edges. This isn’t a finished product; it’s a capable sandbox with ambitious goals.

What “On-Device AI” Actually Means (And Why It’s a Big Deal)

Imagine downloading a small “brain” to your phone. Once it’s there, all thinking happens locally. Your device does all the processing, so there’s no middleman.

Contrast that with the usual cloud AI setup, where: you type something → your message travels to a data center → a massive computer processes it → the answer travels back to you.

That round-trip is why there’s always some latency, why you need a connection, and why your data passes through someone else’s infrastructure.

On-device AI eliminates that dependency entirely.

Why this Matters in Practice

The privacy angle is the most obvious. If you’re summarizing a sensitive document, processing medical notes, or writing something personal, the usual cloud model means that content passes through (and is likely stored by) a third-party service.

With on-device AI, it literally never leaves your phone.

But privacy isn’t the only reason this is interesting. Think about latency.

When AI runs locally, the response speed is limited only by your device’s processor, not by server load, internet speed, or geographic distance.

On a modern phone, that translates to responses that feel snappy rather than laggy ⚡

And then there’s cost. Cloud AI isn’t free. Most serious usage requires a subscription or pays per API call.

With local AI, once you download a model to your device, running it is completely free.

Finally, there’s the offline factor. No signal? Airplane mode? Rural area? Doesn’t matter. The model is already on your device.

Tip 👀
On-device AI isn’t here to replace cloud AI. The two have different strengths. Cloud models like GPT-4 or Gemini Ultra will still outperform local models on complex tasks since they have vastly more compute. Think of on-device AI as filling in the gaps: private tasks, offline scenarios, cost-sensitive workflows, and fast responses on simple queries.

Explore: How To Get Siri AI Mode In 15 Minutes On Any iPhone No Apple Intelligence

The Tech Under the Hood (Non-Scary Version)

You don’t need to understand this to use the app, but knowing the basics will help you understand why it works on your phone.

Because, hey, running a billion-parameter AI model on a mobile device is quite a challenge 💁‍♀️

LiteRT-LM and MediaPipe: The Runtime Engines

The app runs on two key technologies built by Google:

LiteRT-LM (formerly TensorFlow Lite Runtime) is the inference engine that actually executes the AI model on your device.

Think of it as the translator between the model’s math and your phone’s processor.

It’s specifically designed for low-power, resource-constrained hardware, and it supports acceleration across CPU, GPU, and NPU (Neural Processing Units) depending on your device’s chipset.

MediaPipe sits on top of LiteRT as a higher-level pipeline framework.

When you tap “Audio Scribe,” for example, you’re not just hitting a raw model since MediaPipe handles all the audio preprocessing, text tokenization, and result formatting before the model ever sees your input.

It’s what makes each feature feel like a product rather than a raw API.

Quantization: How Giant Models Fit on a Phone

Modern AI models are huge. Gemma 4’s full-size versions need server-level hardware.

So how does a 2-billion-parameter model run in under 1.5GB of RAM on a phone?

The answer: Quantization.

A standard AI model stores its weights as 32-bit floating point numbers that are highly precise but very memory-hungry.

Quantization compresses those weights down to 4-bit or even 2-bit integers, which cuts memory usage by roughly 4–8x with only a modest drop in output quality for most tasks.

If that still sounds confusing, think of it like photo compression. A raw camera photo might be 25MB, with every pixel captured with full detail. A compressed JPEG of the same photo might be 2MB and still look nearly identical on your screen.

You gave up a little data at the edges in exchange for something that’s now practical to store and share.

Quantization does the same thing to an AI model’s internal numbers by giving good enough precision on a reduced footprint.

Gemma 4 E2B with 4-bit quantization fits in roughly 1.5GB of RAM and that’s what makes it viable on a modern flagship phone. Without this compression, you’d need gigabytes of dedicated RAM that no mobile device can allocate to a single app.

Hardware Acceleration: Why Your Phone’s Chip Matters

Bear in mind that not all phones will perform equally. The app leverages whatever hardware acceleration your chipset supports:

Qualcomm and MediaTek NPUs (Neural Processing Units, dedicated AI chips found in flagship Android devices) give the biggest boost
Apple Silicon NPUs on iPhone 15 Pro and newer give strong performance on iOS
GPU acceleration is available on most modern devices via ML Drift
CPU fallback always works, but is the slowest option

The difference between devices is real. On a Pixel 8 Pro or iPhone 15 Pro, Gemma 4 in quantized form runs with response latency in the single-digit seconds for typical prompts.

On a mid-range device without a dedicated NPU, you might be waiting noticeably longer.

For iOS, Apple recommends iPhone 15 Pro or newer for the best experience since the A17 Pro chip’s Neural Engine makes a meaningful difference.

The Real Trade-offs: Honest Pros and Cons

Let’s not sugarcoat this. On-device AI is exciting, but it comes with real limitations worth understanding before you go in.

The Pros

1. Privacy, for real

Not “we promise we protect your data” privacy, but actually no-data-ever-leaves-the-device privacy. That’s a meaningful distinction.

2. No latency from network round-trips

Responses are generated at the speed of your hardware. Once the model is warm and loaded, there’s no server round-trip adding delays.

3. Completely free to run

Download the model once, use it forever with no API costs or subscription fees.

4. Works offline

Airplane mode, bad signal, underground (’cause why not) doesn’t matter. The model runs locally regardless.

The Cons

1. You need a reasonably modern device

The app recommends at least 4GB of RAM and a recent flagship chipset for smooth performance. Budget phones or older devices will either struggle or not work at all.

Tip: If you download the app and model but then your screen freezes (or phone downright crashes) while chatting, then it’s a definitive sign your device is no-bueno 🙅‍♀️

2. Models eat storage

Gemma 4 E2B takes around 1.5GB. E4B is closer to 3–4GB. If you want multiple models for different features, that adds up fast.

Storage is, honestly, one of my top struggles and limitations.

3. Battery drain is real

Local AI inference is computationally intensive. Running long sessions will drain your battery faster than typical app usage.

So, keep this in mind if you’re happy to have offline local AI available but, also, need a phone to contact anyone in case anything happens and there’s no charging available.

4. Local models aren’t as capable as cloud giants

A 2-billion-parameter model running on your phone is impressive for what it is, but it won’t match something like a cloud GPT or Gemini model on complex reasoning tasks. Calibrate expectations accordingly.

5. No real chat history, yet

From my testing, it appears that the app doesn’t currently save actual conversation history across sessions.

What looks like “history” in the interface is just the last message you sent, not a recoverable chat log.

If you accidentally tap into a new chat, that conversation is gone.

It’s clearly being built as a sandbox and testing environment rather than a daily driver, so this will hopefully change. But for now, don’t rely on it to remember anything 💁‍♀️

Feature Breakdown: What You Can Actually Do

Okay, with pros/cons out of the way, here’s where things get interesting. Google AI Edge Gallery isn’t just “AI chat on your phone.”

It has six distinct feature modes, each powered by different models, each designed for different tasks.

Let’s walk through all of them:

AI Chat with Thinking Mode

The core feature is a conversational AI that you chat with like any other chatbot. What makes it interesting is the Thinking Mode.

Toggle it on, and you’ll see the model’s step-by-step reasoning process before it gives you the final answer, similarly to how Claude’s extended thinking or ChatGPT’s reasoning mode works. It shows the model breaking down the problem, considering approaches, and working through its logic.

This is useful for two reasons:

You can verify why the model reached a conclusion
It’s educational because watching how a model reasons through a problem is a great way to learn how to prompt better

Best for: Q&A, writing help, reasoning tasks, explanations

Worth knowing: The model’s knowledge has a training cutoff. It won’t know about events after that date unless you use Agent Skills (more on that below).

Agent Skills: The Standout Feature

This is the thing that makes the app more than just an “offline chatbot.”

Agent Skills transforms the model from a conversationalist into something that can take actions and complete multi-step tasks.

As of the April 2026 release, powered by Gemma 4, the built-in skills include:

Wikipedia querying: the model can pull from Wikipedia in real-time to ground its answers in current, factual information
Interactive visualizations: transform text, data, or a video summary into interactive charts or flashcard sets
Interactive maps: display location-based information visually
Community skills: load custom skills from a URL or browse contributions on the GitHub Discussions page

The Wikipedia skill is significant because it directly addresses the training cutoff limitation. Instead of relying purely on what was baked in during training, the agent can pull live information to supplement its answers.

One thing I noticed was that simple queries like the following worked fine, whereas more complex questions that might look into comparisons or more nuanced things would produce an error.

Note 👇
Using Agent Skills like Wikipedia requires an internet connection. The skill reaches out to Wikipedia’s servers to fetch data. This doesn’t contradict the “offline” claim for the model itself (your prompts still never leave the device), but the skill is pulling external data. If you’re fully offline, those skills won’t work. Pure AI Chat and Prompt Lab, though, run completely without a connection.

What about full web browsing?

The short answer is: not yet, and by design.

Unlike cloud AI providers, there’s no general internet browsing capability built in since the model itself is offline.

Agent Skills are modular extensions that can be loaded on top of the model, but each skill is purpose-built for a specific task. The format these skills need to follow is based on the LiteRT function-calling API. You can find examples and a guide in the GitHub repository.

If you’re familiar with how tool calling works in something like Claude or OpenAI’s function calling, it’s a similar concept, only scoped to what runs locally.

Tip 🤓
If you’re still scratching your head, saying “I don’t get it”, then think of the Wikipedia skill as a controlled bridge to external facts. It does not introduce the complexities of full web browsing that compromise the app’s core mission of privacy, offline reliability, and resource efficiency.

Best for: Research, studying (the flashcard generation is solid), fact-grounding, developers testing agentic AI behavior on-device

Ask Image: Multimodal Visual AI

Point your camera at something or upload an image from your gallery, and the model will analyze it by:

describing what it sees
identifying objects
answering questions about the image
helping you understand what you’re looking at

This uses Gemma 4’s multimodal capabilities, so the same model that handles text also handles image input. You can ask questions like “What plant is this?” or “What does this sign say?” or hand it a photo of a math problem on a whiteboard and ask it to solve it.

And, because I know everyone’s dying to know the result, here is the same prompt across three image generators:

Perfect likeness? No. But that maybe just needs iteration 😅

Best for: Object identification, photo descriptions, solving visual problems (instructions on a label, handwritten notes, signage in an unfamiliar language), accessibility use cases

Worth knowing: Image understanding has limits.

Complex scenes, ambiguous objects, or very small text in images will challenge any model at this size.

As with the other features, you’ll need to download the appropriate vision-capable model and, don’t worry, the app will tell you if your current model doesn’t support image input!

Audio Scribe: On-Device Transcription and Translation

Record a voice note or import an audio file, and Audio Scribe will transcribe it to text in real time entirely offline. It also supports translation, which is extremely useful for multilingual workflows.

Here’s how the translation piece works:

Speak (or play audio) in one language
Ask Audio Scribe to translate the transcription into your target language

So if you’re in a country where you don’t speak the local language and want to understand what someone said, you can record it, transcribe it, and then prompt the translation all offline.

A quick translation sanity check:

“Roasted chicken, without salt, in small pieces or crumbled, should be hot, not hot, with a little caper.”

Definitely lost me toward the end with contradiction (“hot, not hot”) and I’m not giving my dog capers 🤦‍♀️

I acknowledge here that it’s a matter of how clear and accurate the recorded audio was, since that’s what’s transcribed.

Under the hood, this uses Gemma 3n with audio support (added in September 2025), covering over 140 languages.

Best for: Voice notes, travel translation, accessibility, transcribing short recordings offline

Real talk: The audio processing window is around 30 seconds per prompt so this isn’t built for transcribing a full hour-long meeting in one shot. It works best on shorter recordings or clips.

Don’t expect Whisper-level accuracy either, but for a free, fully offline tool that also translates, it punches above its weight for most casual use.

Prompt Lab: Developer Sandbox

Prompt Lab is a dedicated workspace for testing and experimenting with prompts. It’s single-turn (one prompt, one response), but it gives you granular control over model parameters:

Temperature: controls creativity/randomness.
- Higher = more varied outputs
- Lower = more predictable
Top-K: controls how many token options the model considers at each step

I like to think of temperature like a dial between “reliable and focused” on the low end and “creative and unpredictable” on the high end.

If you’re writing a legal summary, you probably want it low. If you’re brainstorming ideas, crank it up.

This is where developers and power users will spend a lot of time. It’s great for understanding how the model behaves, A/B testing prompt variations, and learning the effects of different parameter settings.

For a regular user, aside from brief curiosity, I don’t see why you’ll venture much into this feature.

Best for: Developers building apps, anyone learning how LLMs respond to different inputs, and rapid prompt iteration

Mobile Actions: AI That Actually Does Things

Mobile Actions runs on FunctionGemma 270m, which is a tiny, fine-tuned model specifically designed for function calling (executing predefined actions, not just generating text).

You give it natural language commands, and it translates them into real device actions.

What kind of actions? Things like:

“Show me the San Francisco airport on a map”
“Create a calendar event for 2:30 PM tomorrow for cooking class”
“Turn on the flashlight”

This is scoped to OS-level controls and app intents so it can trigger built-in device functions, open apps to specific screens, and interact with calendar and map features.

What it can’t do is take over an app and type for you. So “compose a Gmail message to my boss” would open Gmail, but it won’t write and send the message on your behalf.

Think of it as a voice-controlled shortcut launcher with natural language understanding, not a full automation agent.

That limitation is worth knowing, so you don’t expect more than what’s there right now. But as a proof of concept for what’s possible, it’s compelling!

This is a 270-million parameter model running locally, translating plain English into structured function calls in milliseconds 🤯

Tiny Garden: An Experiment Worth Poking At

Tiny Garden is a simple, natural-language-driven gardening game where you plant and harvest a virtual garden by talking to it. It also runs on FunctionGemma 270m.

On the surface, it looks like just a fun distraction. But what it’s actually demonstrating is that a very small on-device model can interpret natural language and map it to a structured sequence of actions entirely without a server.

That’s not a trivial thing.

The same mechanism that makes Tiny Garden work is what you’d use to build an on-device AI assistant for a game, a productivity app, or any interactive experience where you want language control without cloud dependency.

It’s not a feature you’ll use daily, but if you’re a developer, take a few minutes with it. It’s a quick mental model for what on-device function calling looks like in practice.

Model Management and Benchmarking

When I first opened the Model Management section, I didn’t expect to spend much time there. It looked like a list of downloads that turned out to be more useful than it appeared.

The model library lets you browse available models, see their size on disk and minimum hardware requirements before you download, and switch between them depending on what task you’re working on.

Gemma 4 E2B is the fastest option; E4B is slower but smarter. There’s also Phi-4-mini, Qwen2.5, and others for comparison testing.

The benchmarking tool is where developers will want to spend time. Hit the benchmark button on any model, and you’ll get real tokens-per-second numbers on your actual device, not theoretical specs from a data sheet.

It’s useful if you’re evaluating whether to build something on LiteRT-LM since you can validate performance against real hardware before committing to the architecture.

You can only download the listed models from within the app directly. If you want to load a custom or third-party model, you’ll need a Hugging Face account to access those.

For everything in Google’s curated list, no Hugging Face login is required.

Real Ways to Use This Day-to-Day

Features are great on paper, but let’s talk about actual use cases where this is meaningfully better than alternatives.

Travelers

Audio Scribe with translation support and offline AI chat is genuinely useful in areas without a data connection.

On a long flight, in a rural region, or anywhere the signal is spotty, you can still transcribe voice notes and get AI assistance.

And if you’re somewhere you don’t speak the language (or, like me, in a household that revolves around three different languages), speak into Audio Scribe and ask it to translate into English. No SIM needed.

Students (on mobile)

Use Agent Skills to turn a short set of notes you’ve typed or pasted on your phone into a flashcard set offline, for free, with no data leaving your device.

This is a solid study tool for people who prefer mobile over laptop, and especially useful for sensitive material (medical, legal) where you’d rather not paste content into a cloud service.

Professionals with sensitive documents on mobile

Need to quickly summarize or rewrite something confidential while away from your desk? Paste it into AI Chat or Prompt Lab, and the content stays on the device.

No third-party service touches it.

But, please remember that the current version (as of my writing and testing) doesn’t store things in history, so don’t put something you might lose!

Developers building mobile apps

This is probably the most powerful use case. Use the app as a live benchmark and sandbox.

Test how Gemma 4, Phi-4, Qwen, or your own custom models actually perform on mobile hardware before you build them into your app via LiteRT-LM. Real-device benchmarks beat spec sheets every time.

What You Should Know Before You Download

A few honest caveats, because I want you to go in with the right expectations.

It’s alpha software. Google is explicit about this. Expect occasional crashes, incomplete features, and things changing between updates. The April 2026 release brought a lot of improvements, but this is still fundamentally an experimental project.

Each feature may need its own model. When you first open the app, you’re not downloading one model that does everything. Audio Scribe needs an audio-capable model. Ask Image needs a vision-capable model. Keep this in mind and plan your storage accordingly!

Training cutoffs are real. Gemma 4’s training data has a cutoff so it won’t know about things that happened after that point. The Wikipedia Agent Skill addresses this for factual queries, but it requires the internet.

Is This the Only Way to Run AI Locally on Your Phone?

Not quite. Google’s own AICore (baked into newer Android devices) gives developers API-level access to Gemma without a separate app.

Apple’s on-device models via Core ML run inside iOS apps. And there are third-party apps like Private LLM and PocketPal that take similar approaches on iOS and Android.

Google AI Edge Gallery is the most developer-focused sandbox of the bunch, but you’ve got options.

Getting Started: Step-by-Step

Alright, let’s actually set it up:

Step 1: Check Your Device

Before downloading, verify you have:

At least 4GB of RAM
At least 8GB of free storage (more if you plan to download multiple models)
Android 12 or later, or iOS 17 or later

Flagship devices from the last 2–3 years work best: Pixel 8+, Samsung Galaxy S23+, iPhone 15 Pro or newer.

Mid-range devices may work, but will be slower.

Step 2: Download the App

Android: Google Play Store — search “Google AI Edge Gallery” (Open Beta)
iOS: App Store — “Google AI Edge Gallery”
GitHub: github.com/google-ai-edge/gallery for the source or direct APK

Step 3: Sign In with Google

You’ll need to sign in with a Google account to use the app. But you might be asking: If the AI is running locally and my data never leaves the device, why do I need to log in at all?

Fair question. The sign-in is for app-level access management, things like downloading models, syncing app settings, and tracking open beta enrollment.

The model inference itself (i.e., the part where your prompts are processed) still runs entirely on-device. Your prompts, images, and audio don’t go to Google.

Think of it like signing into the Play Store to download an app, where logging in to access the store doesn’t mean the app you download is watching everything you do.

That said, it’s worth noting that, like any app with account sign-in, some usage metadata may be collected.

For the most privacy-sensitive use cases, reviewing Google’s privacy policy for the app is worth the few minutes.

Step 4: Download Your First Model

Once in the app, go to Models and start with Gemma 4 E2B. It’s Google’s own model, optimized specifically for mobile, and it’s the best balance of speed and capability for most tasks.

Download it over WiFi since it’s around 1.5GB.

Tip: Download over WiFi, not mobile data. These are large files!

Step 5: Try Prompt Lab First

Before diving into other features, head to Prompt Lab. It’s the simplest feature and gives you a clean feel for how the model responds.

Try a few prompts at different temperature settings. This is your “getting to know you” phase with the model.

Step 6: Explore Agent Skills

Once you’re comfortable with how the model responds, try Agent Skills with the Wikipedia tool.

Ask it something factual that requires current information to see what sets this app apart from a plain chatbot. (Just make sure you’re connected to the internet for this one.)

It’s a Wrap

Google AI Edge Gallery isn’t just another AI app. It’s a working demonstration that AI doesn’t have to live in the cloud to be useful.

The fact that you can download it today, for free, and run real models on your own hardware is something worth paying attention to.

Right now, the models are smaller, and the capabilities are limited compared to cloud giants. But the direction things are going is what matters.

These models are getting better fast, the hardware improves every year, and the tooling (LiteRT-LM especially) is maturing quickly.

What feels like an experiment today is the foundation of how mobile AI will work a few years from now.

Go download it and poke around. If you hit something surprising or find a use case you didn’t expect, drop it in the comments. I’d love to hear what you discover!

Bye friends 👋

AI AI Agents Google Edge On-Device AI Tools