The 70% heap rule: when to bail out of local inference

You can run a frontier-quality model on a phone. You just have to know when not to.


The bug we shipped

Build 49 of iForgetalot crashed on TestFlight devices about thirty seconds into a long coaching conversation. The Sentry stack trace pointed at Hermes — the React Native JavaScript engine — running out of heap while a streaming LLM response was being appended token-by-token to a state variable. Memory pressure also pulled Sentry’s session replay into the picture, which was eating its own chunk of the heap recording UI changes for diagnostics. Three things competing for the same megabytes, and the loser was the user.

The fix wasn’t to remove any of them. The fix was to teach the app to notice.


The rule

Before any local inference call, the app reads the JS heap state via HermesInternal.getRuntimeProperties(). Two numbers come back: js_heapSize (currently used) and js_heapSizeLimit (the cap before Hermes aborts the process).

If the ratio is above 70 percent, we don’t start local inference. We release the llama context to free its memory, log a warning, and throw a catchable error that the router converts into a fallback to the Claude API proxy. The user never sees a crash; they see a slightly slower response that one time, and a heap that’s now back below threshold for next time.

const heapUsage = heapUsed / heapLimit;
if (heapUsage > 0.7) {
  await releaseContext();
  throw new Error('JS heap usage too high for local inference. Falling back.');
}

Five lines. Build 49’s crash never came back.


Why 70%, not 90%

By the time Hermes reports 90% heap usage, you’ve usually already crashed. The garbage collector runs lazily, and a streaming LLM response is an allocation pattern designed to surprise it — many small string concatenations per second. The actual peak during a long response can be 1.5–2x the steady-state reading. 70% gives the GC time to catch up before you hit the wall.

You can tune the threshold per device class if you measure. We didn’t. 70% works across every iPhone from the SE on up and every Android above 6 GB RAM. Below that, we don’t ship local inference at all.


What also helped

  • Idle-release the model. After 30 seconds without inference, we release the llama context entirely. The weights go away; the heap goes back to baseline. Next inference re-loads (mmap makes this fast).
  • Cap streaming token append rate in React. Don’t call setState on every token if you’re getting 30/sec — batch updates per animation frame.
  • Don’t run Sentry session replay during inference. Or, conditionally disable it when heap pressure is detected.
  • Measure on the worst-supported device, not the best. The crash never reproduced on the developer’s iPhone 16. It reproduced reliably on the iPhone SE 3.

The bigger lesson

Shipping AI on consumer hardware means you don’t control the runtime. Some users have the latest pro-class device; some are on a five-year-old phone. Memory is the constraint that bites first. The instinct is to do less — smaller model, fewer features. The better answer is to know how much you’re using and back off gracefully when you’re approaching the limit. The cloud is your overflow buffer; use it.


Memory-aware fallback is one of the patterns we ship with every Agentic System Build. If you’re getting OOM crashes on a quantized model in production, we can almost certainly help.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.