Virtual AI Assistant (iForgetalot)
A mobile productivity coach that runs an LLM on the phone — and only falls back to Claude when it has to.
Status: Private TestFlight beta, iOS + Android · Stack: React Native (Expo), SQLite, llama.cpp via JSI, AWS Lambda + DynamoDB, Stripe Connect, WebRTC · Role: Architecture, full-stack engineering, infrastructure.
The problem
Most productivity apps assume you’ll do the cognitive work yourself: triage the inbox, categorize the task, decide what’s important today. That’s a lot of friction. The goal of iForgetalot was simpler: tell your phone what you need to do, and let the system organize, prioritize, and coach you through it.
That sounds like a wrapped chatbot — until you sit down to build it. You need an agent that can read a photo of a receipt, understand it, and create an expense task. You need it to remember context across conversations without blowing up the token budget. You need it to take actions on the user’s data without being a code interpreter. You need it to work without an internet connection. And you need it to do all of this for a price the user is willing to pay, which rules out routing every word through a frontier API.
The architecture
On-device LLM as the primary brain
iForgetalot ships with llama.cpp via the llama.rn React Native binding. The user picks a quantized GGUF model — Qwen, Gemma, DeepSeek-R1, Phi-3 — which downloads on first use to the app’s documents directory. Native C++ loads the weights via mmap, runs inference on Metal (iOS) or CPU (Android), and streams tokens back to JavaScript through JSI — a direct C++→JS function call per token, no IPC overhead. The user watches their response type out at the speed the model produces it.
Multimodal: photos to structured tasks
A separate photo model (with a CLIP projector / mmproj file) handles vision. Take a photo of a receipt, and the agent extracts vendor, line items, and total into a structured expense task — categorized, dated, ready to file. The same path handles whiteboards, notes, anything where the text or layout carries meaning.
Agent as router, not monolith
Rather than one giant system prompt that tries to handle every conversation, iForgetalot routes each user turn to one of ten intents: task_action, goal_inquiry, progress_review, coaching, task_breakdown, task_creation, feed_discovery, quiz, casual_chat, create_item. Each intent gets a focused prompt module composed from a small set of reusable parts (core identity, action tags, progress rules, etc.). The result: smaller prompts, faster local inference, fewer hallucinations.
Action tags: tool use without a code interpreter
The agent doesn’t get arbitrary code execution. It emits narrow, auditable tags in its response — [MARK_DONE id:"..." type:"step"], [CREATE_TASK title:"..." category:"..."], [SET_REMINDER datetime:"..."], [NAVIGATE target:"goal" id:"..."] — and a deterministic dispatcher parses them, validates the parameters, and applies the mutation. The user sees only the natural language; the side-effects come from a closed set of safe operations.
Memory-aware fallback to Claude
Local inference is the default. But running a 4-bit quantized model + a streaming response + React Native’s JS heap puts real pressure on memory. Before each inference, the app checks the Hermes JS heap usage. If it’s above 70%, the local context is released and the request is routed to a Claude proxy on AWS instead. The user never sees a crash; they just see a slightly slower response that one time.
Adaptive context budgeting across 16 workflows
Different agent workflows have different context needs. Classifying a task title fits in 1k tokens; coaching a user through a stuck goal might need 12k. The system tracks per-workflow budgets, trims chat history pair-by-pair to fit, and only truncates the system prompt (with a visible “context trimmed” note) when the user’s data wouldn’t otherwise fit. Token usage stays predictable.
The infrastructure
- AWS Lambda freemium proxy for cloud LLM access, with DynamoDB-backed quota management per device.
- Stripe Connect Standard marketplace for coach payouts — 30/40/30 milestone escrow with refund tiers and 1099-NEC tax form generation.
- DIY 1:1 WebRTC video calling between users and coaches: custom signaling lambda, VoIP push, CallKit on iOS, full background-state coverage.
- WebSocket signaling channel for real-time presence, chat, and call signals — with self-healing decryption and channel-id refactoring for partner invites.
- Software config management: every lambda deploy stamps its git tree hash; every client build embeds the expected hash; the device build’s preflight refuses to compile if the deployed lambda and client expect different versions. No more shipping a client against a stale API.
- CloudWatch alarms, canaries, and access logs, with preflight scripts that catch regressions before the slow native build runs.
The lessons
- The hard parts aren’t the model. The model is the easy part. The hard parts are memory management, fallback orchestration, action-tag parsing, version control between client and server, and operational visibility.
- Local-first changes the economics. When the median request never hits an API, your unit economics stop scaling with usage. The frontier model becomes a premium fallback, not a per-token tax.
- Tool use is a UX problem. Action tags work because they’re narrow, predictable, and visible in your debugging logs. Letting the model call arbitrary functions sounds powerful and is debugging hell.
- Observability is what separates demos from products. Knowing what the agent saw, what intent it routed to, which model version replied, and what mutation it triggered — that’s the difference between a system you can debug and a system you can’t.
Want something like this?
The iForgetalot stack is the basis for our AI Coach / Vertical Agent in a Box offering — white-label deployments tailored to your domain. Or, if you’d rather build it custom, that’s what our Agentic System Builds engagement is for.
