Why we put the LLM on the phone, not in the cloud

Most AI products treat the cloud LLM as the only option. When we built iForgetalot, we treated it as the fallback. Here’s what that changes — and what it doesn’t.

The economics break differently

If every chat turn in your app hits a frontier API, your gross margin scales inversely with engagement. The more your users love the product, the more it costs you. That’s a structural problem that no amount of prompt optimization fixes.

When the median request runs locally — on the user’s own phone, with weights they already downloaded — that per-token cost drops to zero. The frontier API becomes a premium fallback, not a per-token tax. For an indie productivity app that wants a free tier, that’s the difference between a viable business and a money pit.

Latency is now the user’s hardware

Cloud inference has a fixed cost floor — network round-trip, queue time, model load — even before the first token. On-device, the only latency is your prompt size and the user’s silicon. On a modern iPhone, a quantized 3B-parameter model produces ~30 tokens per second once warmed up. The first token comes back in under 200ms. That’s competitive with the best cloud APIs for a coaching-style conversation, and noticeably faster on slow networks.

Privacy stops being a liability

iForgetalot’s value lives in personal data: your goals, your tasks, your photos, your conversations with the coach. The default cloud-LLM design ships all of that off the device to be processed by a third party. Even with the best vendor policies, that’s a data-exposure surface and a compliance question.

On-device inference reframes the question. The model reads your data on your own phone, in a process you control. The model weights themselves are public — there’s no vendor-side log of your queries because there’s no vendor request. The cloud only sees the requests you explicitly route through the fallback proxy.

When we still fall back to Claude

Local-first isn’t local-only. We route to Claude when:

The JS heap is above 70%. Running a quantized model alongside a streaming UI, Sentry session replay, and a typical React Native bundle puts real pressure on Hermes. Before each inference we check the heap; if it’s too tight, we release the context and route to the cloud instead of risking a crash.
The user hasn’t downloaded a model yet. On first launch, the local stack isn’t ready. Cloud fallback keeps the experience functional during the download.
The task needs frontier-grade reasoning. Quiz generation, complex multi-step planning, or long-form summarization sometimes warrant a stronger model. Those workflows opt into cloud explicitly.
The user picked it. Power users can flip a setting to prefer cloud for everything.

What this isn’t

Local-first isn’t a magic wand. The model is smaller, so it’s worse at long-context reasoning. Quantization costs you some quality. Inference burns battery (we cap concurrent inferences and release the context after 30 seconds idle). And every device has different memory headroom, so you have to plan for the lowest-spec phone you support, not the highest.

But the tradeoffs go the right way for the right kind of product. If your app’s value is in personal-data conversations that happen frequently — coaching, productivity, journaling, learning — on-device is the right default.

If you’re weighing local vs. cloud for a product you’re building, we do this exact analysis as part of our Architecture Review engagement.

Why we put the LLM on the phone, not in the cloud

The economics break differently

Latency is now the user’s hardware

Privacy stops being a liability

When we still fall back to Claude

What this isn’t

Published by jos @ jstar

Leave a comment Cancel reply

The economics break differently

Latency is now the user’s hardware

Privacy stops being a liability

When we still fall back to Claude

What this isn’t

Share this:

Related

Published by jos @ jstar

Leave a comment Cancel reply