Running large language models directly on a phone felt like science fiction just two years ago. But here we are in 2026, and it's genuinely practical. Libraries like react-native-executorch and Callstack's react-native-ai let you ship a fully offline chatbot, speech recognizer, or image classifier inside your React Native app — no API keys, no cloud costs, and complete data privacy for your users.
That last point matters more than most people realize.
This guide walks you through both libraries end-to-end: installation, model selection, building a chat UI, tool calling, and the tradeoffs you should weigh before picking one over the other. I've spent a good chunk of time testing both, so I'll share some honest takes along the way.
Why On-Device AI Matters for Mobile Apps
Sending every prompt to a cloud API comes with real costs. Each request adds latency, racks up token charges, and ships potentially sensitive user data off-device. On-device inference eliminates all three problems at once:
- Privacy — Prompts and responses never leave the phone. This is critical for health, finance, journaling, and enterprise apps that handle personal data.
- Latency — Token generation starts immediately. No round-trip to a server, no cold-start waiting for a function to spin up.
- Cost — You pay zero per token. The user's own hardware does the work.
- Offline support — The model works without a network connection, so your AI features don't break in spotty coverage areas.
The tradeoff? On-device models are smaller and less capable than frontier cloud models. A 3-billion-parameter model running on a phone won't match GPT-4 or Claude Opus on complex reasoning tasks. But for summarization, translation, quick Q&A, image descriptions, and tool-augmented workflows, the quality is more than enough — and it's improving fast.
The On-Device AI Landscape in 2026
Three major libraries dominate the React Native on-device AI space right now:
| Library | Maintained By | Engine | Vercel AI SDK | Key Strength |
|---|---|---|---|---|
| react-native-executorch | Software Mansion | Meta ExecuTorch | No | Broad model support (LLM, vision, speech, OCR) |
| react-native-ai | Callstack | llama.rn / MLC / Apple | Yes | Drop-in Vercel AI SDK compatibility |
| llama.rn (standalone) | Community | llama.cpp | Via react-native-ai | Low-level control, GGUF models |
Each takes a different approach. ExecuTorch gives you the widest model coverage and a React-hooks-first API. React Native AI gives you Vercel AI SDK compatibility so you can swap between cloud and local models with a one-line change.
So, let's build with both.
Option 1: react-native-executorch with Expo
Prerequisites
- Expo SDK 54 or later
- New Architecture enabled (required — the library doesn't support the old bridge)
- A physical device or an emulator with at least 4 GB of RAM allocated
Installation
Install the core package and the Expo resource fetcher:
npx expo install react-native-executorch @react-native-executorch/expo-resource-fetcher expo-file-system expo-asset
Then build the native layer. Since ExecuTorch includes native code, you need a development build — Expo Go won't cut it here:
# iOS
npx expo run:ios
# Android
npx expo run:android
Initialize ExecuTorch
Before using any hooks, call initExecutorch once at the top of your app. This registers the resource fetcher that downloads and caches model binaries:
// app/_layout.tsx
import { initExecutorch } from 'react-native-executorch';
import { ExpoResourceFetcher } from '@react-native-executorch/expo-resource-fetcher';
initExecutorch({
resourceFetcher: ExpoResourceFetcher,
});
Building a Chat Screen with useLLM
The useLLM hook is the main interface for text generation. It handles model loading, token streaming, conversation history, and cleanup — basically everything you'd otherwise wire up yourself:
import React from 'react';
import {
View,
Text,
TextInput,
FlatList,
Pressable,
ActivityIndicator,
StyleSheet,
} from 'react-native';
import { useLLM, QWEN3_0_6B, Message } from 'react-native-executorch';
export default function ChatScreen() {
const [input, setInput] = React.useState('');
const llm = useLLM({ model: QWEN3_0_6B });
React.useEffect(() => {
llm.configure({
chatConfig: {
systemPrompt: 'You are a concise, helpful mobile assistant.',
},
generationConfig: {
temperature: 0.7,
topp: 0.9,
},
});
}, [llm.isReady]);
const handleSend = () => {
if (!input.trim() || llm.isGenerating) return;
llm.sendMessage(input.trim());
setInput('');
};
if (!llm.isReady) {
return (
<View style={styles.center}>
<ActivityIndicator size="large" />
<Text>Loading model... {Math.round(llm.downloadProgress * 100)}%</Text>
</View>
);
}
return (
<View style={styles.container}>
<FlatList
data={llm.messageHistory}
keyExtractor={(_, i) => String(i)}
renderItem={({ item }) => (
<View style={[
styles.bubble,
item.role === 'user' ? styles.userBubble : styles.aiBubble,
]}>
<Text>{item.content}</Text>
</View>
)}
/>
{llm.isGenerating && (
<Text style={styles.streaming}>{llm.response}</Text>
)}
<View style={styles.inputRow}>
<TextInput
style={styles.input}
value={input}
onChangeText={setInput}
placeholder="Ask anything..."
/>
<Pressable onPress={handleSend} style={styles.sendButton}>
<Text style={styles.sendText}>Send</Text>
</Pressable>
</View>
</View>
);
}
const styles = StyleSheet.create({
container: { flex: 1, padding: 16 },
center: { flex: 1, justifyContent: 'center', alignItems: 'center' },
bubble: { padding: 12, borderRadius: 12, marginVertical: 4, maxWidth: '80%' },
userBubble: { alignSelf: 'flex-end', backgroundColor: '#DCF8C6' },
aiBubble: { alignSelf: 'flex-start', backgroundColor: '#E8E8E8' },
streaming: { padding: 12, color: '#666', fontStyle: 'italic' },
inputRow: { flexDirection: 'row', alignItems: 'center', marginTop: 8 },
input: { flex: 1, borderWidth: 1, borderColor: '#ccc', borderRadius: 8, padding: 10 },
sendButton: { marginLeft: 8, backgroundColor: '#007AFF', borderRadius: 8, padding: 12 },
sendText: { color: '#fff', fontWeight: '600' },
});
The sendMessage method appends the user's message to messageHistory, runs inference, and streams the assistant's reply into the same array. The response property gives you the in-progress text so you can show a live typing indicator. Honestly, it's a pretty slick developer experience.
Model Selection Guide
Choosing the right model depends on your target devices and the complexity of what you need:
| Model | Parameters | RAM Needed | Best For |
|---|---|---|---|
| SmolLM2 135M | 135M | ~0.5 GB | Simple text completion, autocomplete |
| Qwen3 0.6B | 600M | ~1 GB | Quick Q&A, summarization on low-end devices |
| Llama 3.2 1B | 1B | ~2 GB | General chat, solid quality/speed balance |
| Qwen3 4B (quantized) | 4B | ~3 GB | Best quality for flagship phones |
| LFM2.5-VL 1.6B | 1.6B | ~2.5 GB | Vision-language tasks (image descriptions) |
A good rule of thumb: keep the model under half your target device's total RAM. A 4B quantized model runs well on phones with 8 GB or more, but it'll crash on a 4 GB device. Don't ask me how I found that out.
Adding Tool Calling
Tool calling lets the model invoke functions you define — checking the weather, toggling a flashlight, querying a local database. ExecuTorch supports this with models from the Hammer 2.1 and Qwen 3 families:
import { useLLM, QWEN3_0_6B } from 'react-native-executorch';
const weatherTool = {
name: 'get_weather',
description: 'Get the current weather for a given city',
parameters: {
type: 'dict',
properties: {
city: { type: 'string', description: 'City name' },
},
required: ['city'],
},
};
function ToolCallingChat() {
const llm = useLLM({ model: QWEN3_0_6B });
React.useEffect(() => {
if (!llm.isReady) return;
llm.configure({
chatConfig: {
systemPrompt: 'You are a helpful assistant with access to tools.',
},
toolsConfig: {
tools: [weatherTool],
executeToolCallback: async (call) => {
if (call.toolName === 'get_weather') {
// Replace with a real API call or local data lookup
return JSON.stringify({
city: call.parameters.city,
temp: '22°C',
condition: 'Sunny',
});
}
return null;
},
displayToolCalls: false,
},
});
}, [llm.isReady]);
// ... same chat UI as above
}
When the user asks "What's the weather in Tokyo?", the model generates a structured tool call instead of hallucinating an answer. Your callback executes the function, and the result gets fed back into the conversation for the model to format into a natural response. It's a surprisingly clean pattern for extending what a small local model can do.
Option 2: Callstack's react-native-ai with Vercel AI SDK
If your project already uses the Vercel AI SDK — or you want the flexibility to swap between a cloud provider and a local model — Callstack's react-native-ai is probably the better fit. It exposes the same generateText and streamText functions you already know.
Installation
The library offers three providers. Pick the one that matches your use case:
# Llama provider (GGUF models from HuggingFace, iOS + Android)
npm install @react-native-ai/llama llama.rn react-native-blob-util ai
# Apple provider (iOS 26+ with Apple Intelligence)
npm install @react-native-ai/apple ai
# MLC provider (compiled models, iOS + Android)
npm install @react-native-ai/mlc ai
Generating Text with the Llama Provider
The Llama provider downloads GGUF-format models from HuggingFace and runs them via llama.rn (a React Native binding for llama.cpp):
import { llama } from '@react-native-ai/llama';
import { generateText, streamText } from 'ai';
// Model ID follows HuggingFace format: owner/repo/filename.gguf
const model = llama.languageModel(
'ggml-org/SmolLM3-3B-GGUF/SmolLM3-Q4_K_M.gguf'
);
async function runChat() {
// Step 1: Download the model (runs once, cached afterward)
await model.download((progress) => {
console.log(`Download: ${progress.percentage}%`);
});
// Step 2: Load into memory
await model.prepare();
// Step 3: Generate a response
const { text } = await generateText({
model,
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Summarize the benefits of on-device AI in three bullets.' },
],
});
console.log(text);
// Step 4: Free memory when done
await model.unload();
}
Streaming Responses
For a chat UI where you want tokens appearing one by one, use streamText instead:
import { llama } from '@react-native-ai/llama';
import { streamText } from 'ai';
const model = llama.languageModel(
'bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_M.gguf'
);
async function streamChat(userMessage: string) {
await model.prepare();
const result = streamText({
model,
messages: [
{ role: 'system', content: 'You are a concise assistant.' },
{ role: 'user', content: userMessage },
],
});
for await (const chunk of result.textStream) {
process.stdout.write(chunk); // or update your state
}
await model.unload();
}
Using Apple Foundation Models (iOS 26+)
On devices running iOS 26 with Apple Intelligence, you can use Apple's built-in on-device model with zero download time. This is kind of a game-changer for iOS apps:
import { apple } from '@react-native-ai/apple';
import { generateText, embed } from 'ai';
// Text generation — no model download needed
const { text } = await generateText({
model: apple(),
prompt: 'Explain the key differences between React Native and Flutter.',
});
// Embeddings (available from iOS 17+)
const { embedding } = await embed({
model: apple.textEmbeddingModel(),
value: 'React Native on-device AI',
});
This is the fastest path to on-device AI on iOS because the model is already there. No download, no storage cost, no RAM management headaches. The tradeoff is that it only works on recent iPhones and iPads with Apple Intelligence support.
Switching Between Cloud and On-Device
One of the most powerful patterns with react-native-ai is seamlessly falling back between cloud and local inference. Since the API follows the standard Vercel AI SDK interface, switching is just a model swap:
import { llama } from '@react-native-ai/llama';
import { apple } from '@react-native-ai/apple';
import { openai } from '@ai-sdk/openai'; // cloud provider
import { generateText } from 'ai';
import { Platform } from 'react-native';
import NetInfo from '@react-native-community/netinfo';
async function getModel() {
const networkState = await NetInfo.fetch();
if (networkState.isConnected) {
// Use cloud when online for best quality
return openai('gpt-4o-mini');
}
if (Platform.OS === 'ios') {
// Use Apple Intelligence on iOS (no download)
return apple();
}
// Fall back to local GGUF model
const local = llama.languageModel(
'ggml-org/SmolLM3-3B-GGUF/SmolLM3-Q4_K_M.gguf'
);
await local.prepare();
return local;
}
// Usage stays identical regardless of provider
const model = await getModel();
const { text } = await generateText({
model,
prompt: 'What is the capital of France?',
});
Choosing Between ExecuTorch and React Native AI
Both libraries are production-ready, but they serve different architectural preferences. Here's how they stack up:
| Criteria | react-native-executorch | react-native-ai |
|---|---|---|
| API style | React hooks (useLLM, useWhisper) | Vercel AI SDK (generateText, streamText) |
| Model formats | .pte (ExecuTorch) | .gguf (Llama), MLC compiled, Apple built-in |
| Beyond LLMs | Whisper (ASR), CLIP (vision), OCR | Embeddings, TTS, transcription (via Apple) |
| Cloud fallback | Not built-in | Native — swap any Vercel AI SDK provider |
| Managed conversation | Built-in messageHistory with context strategies | You manage state yourself |
| Tool calling | Built-in with configure() | Via Vercel AI SDK tools parameter |
| Hardware acceleration | CoreML, Vulkan, mobile GPU backends | llama.cpp optimizations, Metal/GPU on Apple |
Go with ExecuTorch if you want a self-contained React hooks API, need non-LLM models (speech-to-text, image classification, OCR), or prefer Meta's ecosystem with hardware-optimized backends.
Go with React Native AI if your codebase already uses the Vercel AI SDK, you want to switch between cloud and local models transparently, or you'd like to leverage Apple Intelligence on iOS without downloading anything.
Personally, I think the Vercel AI SDK compatibility of react-native-ai is its killer feature. Being able to prototype with a cloud model and then swap to local for production (or vice versa) without changing your UI code is incredibly convenient.
Performance Tips and Gotchas
Memory Management
Running an LLM on a phone is memory-intensive. Here are practical steps to avoid crashes:
- Always unload models when leaving a screen. With react-native-ai, call
model.unload(). With ExecuTorch, interrupt generation before unmounting — failing to do so will crash the app. - Monitor memory using Xcode Instruments (iOS) or Android Studio Profiler. Watch for memory spikes during model loading.
- Use quantized models whenever possible. A 4-bit quantized 3B model uses roughly half the memory of its full-precision version with minimal quality loss.
- Set context window limits in ExecuTorch using
SlidingWindowContextStrategyto prevent conversation history from consuming unbounded memory.
Battery and Thermal Considerations
AI inference is computationally heavy. During extended generation sessions, devices will get warm and battery drain accelerates. A few things worth keeping in mind:
- Limit maximum generation length (256 tokens is usually enough for quick answers).
- Show users a clear indicator when the model is working so they understand what's happening.
- Avoid running inference in the background — your users' battery life will thank you.
Model Download Strategy
Models range from 100 MB to 3+ GB. Please don't download them on first app launch. Instead:
- Let users opt in to the AI feature and trigger the download explicitly.
- Show download progress and allow cancellation.
- Cache models on disk so subsequent launches are instant.
- For the Apple provider, there's no download — the model ships with the OS.
A Real-World Architecture: Offline AI Assistant
Here's how you might architect an offline-capable AI assistant that works across platforms:
// ai-provider.ts
import { Platform } from 'react-native';
import NetInfo from '@react-native-community/netinfo';
type AIProvider = 'apple' | 'local-llama' | 'cloud';
export async function selectProvider(): Promise<AIProvider> {
const { isConnected } = await NetInfo.fetch();
// Prefer cloud when online for best quality
if (isConnected) return 'cloud';
// Use Apple Intelligence on iOS (zero download)
if (Platform.OS === 'ios') return 'apple';
// Fall back to downloaded local model
return 'local-llama';
}
export function getModelConfig(provider: AIProvider) {
switch (provider) {
case 'cloud':
return { provider: 'cloud', modelId: 'gpt-4o-mini' };
case 'apple':
return { provider: 'apple', modelId: 'default' };
case 'local-llama':
return {
provider: 'local-llama',
modelId: 'ggml-org/SmolLM3-3B-GGUF/SmolLM3-Q4_K_M.gguf',
};
}
}
This pattern keeps your UI components completely decoupled from the inference backend. Your chat screen calls generateText with whatever model the provider function returns. Users get the best available experience — cloud quality when online, instant local inference when offline.
Frequently Asked Questions
Can I run on-device AI in Expo Go?
No. Both react-native-executorch and react-native-ai include native code that Expo Go doesn't bundle. You'll need a development build created with npx expo run:ios or npx expo run:android, or use EAS Build to create a custom development client.
What's the minimum device requirement for on-device LLMs?
For a small model like SmolLM2 (135M parameters), almost any device from the last five years will work. For 1B–3B models, you need at least 4 GB of total device RAM. For 4B quantized models, target 8 GB or higher. iPhones from the iPhone 12 onward handle 1B–3B models comfortably.
Does react-native-executorch work with the old React Native architecture?
Nope. ExecuTorch requires the New Architecture (Fabric and TurboModules). Since React Native 0.76 made the New Architecture the default and Expo SDK 55+ always enables it, most new projects already meet this requirement. If you're on an older version, you'll need to migrate first.
How large are the model downloads?
It varies quite a bit. SmolLM2 135M is around 100 MB, Llama 3.2 1B is about 1 GB, and a 4B quantized model can be 2–3 GB. Apple Foundation Models require no download at all since they ship with the operating system. Plan your app's storage budget accordingly and always let users opt in before downloading.
Can I use on-device AI for real-time speech recognition?
Yes! react-native-executorch provides a useWhisper hook powered by OpenAI's Whisper model running locally via ExecuTorch. React Native AI offers transcription through Apple's SpeechAnalyzer on iOS 26+. Both approaches keep audio data entirely on-device.