React Native On-Device AI: Run LLMs Locally with ExecuTorch and React Native AI

Learn how to run LLMs and AI models directly on mobile devices using react-native-executorch and Callstack's react-native-ai. Includes working code examples for building offline AI features in React Native apps.

Running large language models directly on a phone felt like science fiction just two years ago. But here we are in 2026, and it's genuinely practical. Libraries like react-native-executorch and Callstack's react-native-ai let you ship a fully offline chatbot, speech recognizer, or image classifier inside your React Native app — no API keys, no cloud costs, and complete data privacy for your users.

That last point matters more than most people realize.

This guide walks you through both libraries end-to-end: installation, model selection, building a chat UI, tool calling, and the tradeoffs you should weigh before picking one over the other. I've spent a good chunk of time testing both, so I'll share some honest takes along the way.

Why On-Device AI Matters for Mobile Apps

Sending every prompt to a cloud API comes with real costs. Each request adds latency, racks up token charges, and ships potentially sensitive user data off-device. On-device inference eliminates all three problems at once:

  • Privacy — Prompts and responses never leave the phone. This is critical for health, finance, journaling, and enterprise apps that handle personal data.
  • Latency — Token generation starts immediately. No round-trip to a server, no cold-start waiting for a function to spin up.
  • Cost — You pay zero per token. The user's own hardware does the work.
  • Offline support — The model works without a network connection, so your AI features don't break in spotty coverage areas.

The tradeoff? On-device models are smaller and less capable than frontier cloud models. A 3-billion-parameter model running on a phone won't match GPT-4 or Claude Opus on complex reasoning tasks. But for summarization, translation, quick Q&A, image descriptions, and tool-augmented workflows, the quality is more than enough — and it's improving fast.

The On-Device AI Landscape in 2026

Three major libraries dominate the React Native on-device AI space right now:

LibraryMaintained ByEngineVercel AI SDKKey Strength
react-native-executorchSoftware MansionMeta ExecuTorchNoBroad model support (LLM, vision, speech, OCR)
react-native-aiCallstackllama.rn / MLC / AppleYesDrop-in Vercel AI SDK compatibility
llama.rn (standalone)Communityllama.cppVia react-native-aiLow-level control, GGUF models

Each takes a different approach. ExecuTorch gives you the widest model coverage and a React-hooks-first API. React Native AI gives you Vercel AI SDK compatibility so you can swap between cloud and local models with a one-line change.

So, let's build with both.

Option 1: react-native-executorch with Expo

Prerequisites

  • Expo SDK 54 or later
  • New Architecture enabled (required — the library doesn't support the old bridge)
  • A physical device or an emulator with at least 4 GB of RAM allocated

Installation

Install the core package and the Expo resource fetcher:

npx expo install react-native-executorch @react-native-executorch/expo-resource-fetcher expo-file-system expo-asset

Then build the native layer. Since ExecuTorch includes native code, you need a development build — Expo Go won't cut it here:

# iOS
npx expo run:ios

# Android
npx expo run:android

Initialize ExecuTorch

Before using any hooks, call initExecutorch once at the top of your app. This registers the resource fetcher that downloads and caches model binaries:

// app/_layout.tsx
import { initExecutorch } from 'react-native-executorch';
import { ExpoResourceFetcher } from '@react-native-executorch/expo-resource-fetcher';

initExecutorch({
  resourceFetcher: ExpoResourceFetcher,
});

Building a Chat Screen with useLLM

The useLLM hook is the main interface for text generation. It handles model loading, token streaming, conversation history, and cleanup — basically everything you'd otherwise wire up yourself:

import React from 'react';
import {
  View,
  Text,
  TextInput,
  FlatList,
  Pressable,
  ActivityIndicator,
  StyleSheet,
} from 'react-native';
import { useLLM, QWEN3_0_6B, Message } from 'react-native-executorch';

export default function ChatScreen() {
  const [input, setInput] = React.useState('');
  const llm = useLLM({ model: QWEN3_0_6B });

  React.useEffect(() => {
    llm.configure({
      chatConfig: {
        systemPrompt: 'You are a concise, helpful mobile assistant.',
      },
      generationConfig: {
        temperature: 0.7,
        topp: 0.9,
      },
    });
  }, [llm.isReady]);

  const handleSend = () => {
    if (!input.trim() || llm.isGenerating) return;
    llm.sendMessage(input.trim());
    setInput('');
  };

  if (!llm.isReady) {
    return (
      <View style={styles.center}>
        <ActivityIndicator size="large" />
        <Text>Loading model... {Math.round(llm.downloadProgress * 100)}%</Text>
      </View>
    );
  }

  return (
    <View style={styles.container}>
      <FlatList
        data={llm.messageHistory}
        keyExtractor={(_, i) => String(i)}
        renderItem={({ item }) => (
          <View style={[
            styles.bubble,
            item.role === 'user' ? styles.userBubble : styles.aiBubble,
          ]}>
            <Text>{item.content}</Text>
          </View>
        )}
      />
      {llm.isGenerating && (
        <Text style={styles.streaming}>{llm.response}</Text>
      )}
      <View style={styles.inputRow}>
        <TextInput
          style={styles.input}
          value={input}
          onChangeText={setInput}
          placeholder="Ask anything..."
        />
        <Pressable onPress={handleSend} style={styles.sendButton}>
          <Text style={styles.sendText}>Send</Text>
        </Pressable>
      </View>
    </View>
  );
}

const styles = StyleSheet.create({
  container: { flex: 1, padding: 16 },
  center: { flex: 1, justifyContent: 'center', alignItems: 'center' },
  bubble: { padding: 12, borderRadius: 12, marginVertical: 4, maxWidth: '80%' },
  userBubble: { alignSelf: 'flex-end', backgroundColor: '#DCF8C6' },
  aiBubble: { alignSelf: 'flex-start', backgroundColor: '#E8E8E8' },
  streaming: { padding: 12, color: '#666', fontStyle: 'italic' },
  inputRow: { flexDirection: 'row', alignItems: 'center', marginTop: 8 },
  input: { flex: 1, borderWidth: 1, borderColor: '#ccc', borderRadius: 8, padding: 10 },
  sendButton: { marginLeft: 8, backgroundColor: '#007AFF', borderRadius: 8, padding: 12 },
  sendText: { color: '#fff', fontWeight: '600' },
});

The sendMessage method appends the user's message to messageHistory, runs inference, and streams the assistant's reply into the same array. The response property gives you the in-progress text so you can show a live typing indicator. Honestly, it's a pretty slick developer experience.

Model Selection Guide

Choosing the right model depends on your target devices and the complexity of what you need:

ModelParametersRAM NeededBest For
SmolLM2 135M135M~0.5 GBSimple text completion, autocomplete
Qwen3 0.6B600M~1 GBQuick Q&A, summarization on low-end devices
Llama 3.2 1B1B~2 GBGeneral chat, solid quality/speed balance
Qwen3 4B (quantized)4B~3 GBBest quality for flagship phones
LFM2.5-VL 1.6B1.6B~2.5 GBVision-language tasks (image descriptions)

A good rule of thumb: keep the model under half your target device's total RAM. A 4B quantized model runs well on phones with 8 GB or more, but it'll crash on a 4 GB device. Don't ask me how I found that out.

Adding Tool Calling

Tool calling lets the model invoke functions you define — checking the weather, toggling a flashlight, querying a local database. ExecuTorch supports this with models from the Hammer 2.1 and Qwen 3 families:

import { useLLM, QWEN3_0_6B } from 'react-native-executorch';

const weatherTool = {
  name: 'get_weather',
  description: 'Get the current weather for a given city',
  parameters: {
    type: 'dict',
    properties: {
      city: { type: 'string', description: 'City name' },
    },
    required: ['city'],
  },
};

function ToolCallingChat() {
  const llm = useLLM({ model: QWEN3_0_6B });

  React.useEffect(() => {
    if (!llm.isReady) return;
    llm.configure({
      chatConfig: {
        systemPrompt: 'You are a helpful assistant with access to tools.',
      },
      toolsConfig: {
        tools: [weatherTool],
        executeToolCallback: async (call) => {
          if (call.toolName === 'get_weather') {
            // Replace with a real API call or local data lookup
            return JSON.stringify({
              city: call.parameters.city,
              temp: '22°C',
              condition: 'Sunny',
            });
          }
          return null;
        },
        displayToolCalls: false,
      },
    });
  }, [llm.isReady]);

  // ... same chat UI as above
}

When the user asks "What's the weather in Tokyo?", the model generates a structured tool call instead of hallucinating an answer. Your callback executes the function, and the result gets fed back into the conversation for the model to format into a natural response. It's a surprisingly clean pattern for extending what a small local model can do.

Option 2: Callstack's react-native-ai with Vercel AI SDK

If your project already uses the Vercel AI SDK — or you want the flexibility to swap between a cloud provider and a local model — Callstack's react-native-ai is probably the better fit. It exposes the same generateText and streamText functions you already know.

Installation

The library offers three providers. Pick the one that matches your use case:

# Llama provider (GGUF models from HuggingFace, iOS + Android)
npm install @react-native-ai/llama llama.rn react-native-blob-util ai

# Apple provider (iOS 26+ with Apple Intelligence)
npm install @react-native-ai/apple ai

# MLC provider (compiled models, iOS + Android)
npm install @react-native-ai/mlc ai

Generating Text with the Llama Provider

The Llama provider downloads GGUF-format models from HuggingFace and runs them via llama.rn (a React Native binding for llama.cpp):

import { llama } from '@react-native-ai/llama';
import { generateText, streamText } from 'ai';

// Model ID follows HuggingFace format: owner/repo/filename.gguf
const model = llama.languageModel(
  'ggml-org/SmolLM3-3B-GGUF/SmolLM3-Q4_K_M.gguf'
);

async function runChat() {
  // Step 1: Download the model (runs once, cached afterward)
  await model.download((progress) => {
    console.log(`Download: ${progress.percentage}%`);
  });

  // Step 2: Load into memory
  await model.prepare();

  // Step 3: Generate a response
  const { text } = await generateText({
    model,
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
      { role: 'user', content: 'Summarize the benefits of on-device AI in three bullets.' },
    ],
  });

  console.log(text);

  // Step 4: Free memory when done
  await model.unload();
}

Streaming Responses

For a chat UI where you want tokens appearing one by one, use streamText instead:

import { llama } from '@react-native-ai/llama';
import { streamText } from 'ai';

const model = llama.languageModel(
  'bartowski/Llama-3.2-3B-Instruct-GGUF/Llama-3.2-3B-Instruct-Q4_K_M.gguf'
);

async function streamChat(userMessage: string) {
  await model.prepare();

  const result = streamText({
    model,
    messages: [
      { role: 'system', content: 'You are a concise assistant.' },
      { role: 'user', content: userMessage },
    ],
  });

  for await (const chunk of result.textStream) {
    process.stdout.write(chunk); // or update your state
  }

  await model.unload();
}

Using Apple Foundation Models (iOS 26+)

On devices running iOS 26 with Apple Intelligence, you can use Apple's built-in on-device model with zero download time. This is kind of a game-changer for iOS apps:

import { apple } from '@react-native-ai/apple';
import { generateText, embed } from 'ai';

// Text generation — no model download needed
const { text } = await generateText({
  model: apple(),
  prompt: 'Explain the key differences between React Native and Flutter.',
});

// Embeddings (available from iOS 17+)
const { embedding } = await embed({
  model: apple.textEmbeddingModel(),
  value: 'React Native on-device AI',
});

This is the fastest path to on-device AI on iOS because the model is already there. No download, no storage cost, no RAM management headaches. The tradeoff is that it only works on recent iPhones and iPads with Apple Intelligence support.

Switching Between Cloud and On-Device

One of the most powerful patterns with react-native-ai is seamlessly falling back between cloud and local inference. Since the API follows the standard Vercel AI SDK interface, switching is just a model swap:

import { llama } from '@react-native-ai/llama';
import { apple } from '@react-native-ai/apple';
import { openai } from '@ai-sdk/openai'; // cloud provider
import { generateText } from 'ai';
import { Platform } from 'react-native';
import NetInfo from '@react-native-community/netinfo';

async function getModel() {
  const networkState = await NetInfo.fetch();

  if (networkState.isConnected) {
    // Use cloud when online for best quality
    return openai('gpt-4o-mini');
  }

  if (Platform.OS === 'ios') {
    // Use Apple Intelligence on iOS (no download)
    return apple();
  }

  // Fall back to local GGUF model
  const local = llama.languageModel(
    'ggml-org/SmolLM3-3B-GGUF/SmolLM3-Q4_K_M.gguf'
  );
  await local.prepare();
  return local;
}

// Usage stays identical regardless of provider
const model = await getModel();
const { text } = await generateText({
  model,
  prompt: 'What is the capital of France?',
});

Choosing Between ExecuTorch and React Native AI

Both libraries are production-ready, but they serve different architectural preferences. Here's how they stack up:

Criteriareact-native-executorchreact-native-ai
API styleReact hooks (useLLM, useWhisper)Vercel AI SDK (generateText, streamText)
Model formats.pte (ExecuTorch).gguf (Llama), MLC compiled, Apple built-in
Beyond LLMsWhisper (ASR), CLIP (vision), OCREmbeddings, TTS, transcription (via Apple)
Cloud fallbackNot built-inNative — swap any Vercel AI SDK provider
Managed conversationBuilt-in messageHistory with context strategiesYou manage state yourself
Tool callingBuilt-in with configure()Via Vercel AI SDK tools parameter
Hardware accelerationCoreML, Vulkan, mobile GPU backendsllama.cpp optimizations, Metal/GPU on Apple

Go with ExecuTorch if you want a self-contained React hooks API, need non-LLM models (speech-to-text, image classification, OCR), or prefer Meta's ecosystem with hardware-optimized backends.

Go with React Native AI if your codebase already uses the Vercel AI SDK, you want to switch between cloud and local models transparently, or you'd like to leverage Apple Intelligence on iOS without downloading anything.

Personally, I think the Vercel AI SDK compatibility of react-native-ai is its killer feature. Being able to prototype with a cloud model and then swap to local for production (or vice versa) without changing your UI code is incredibly convenient.

Performance Tips and Gotchas

Memory Management

Running an LLM on a phone is memory-intensive. Here are practical steps to avoid crashes:

  • Always unload models when leaving a screen. With react-native-ai, call model.unload(). With ExecuTorch, interrupt generation before unmounting — failing to do so will crash the app.
  • Monitor memory using Xcode Instruments (iOS) or Android Studio Profiler. Watch for memory spikes during model loading.
  • Use quantized models whenever possible. A 4-bit quantized 3B model uses roughly half the memory of its full-precision version with minimal quality loss.
  • Set context window limits in ExecuTorch using SlidingWindowContextStrategy to prevent conversation history from consuming unbounded memory.

Battery and Thermal Considerations

AI inference is computationally heavy. During extended generation sessions, devices will get warm and battery drain accelerates. A few things worth keeping in mind:

  • Limit maximum generation length (256 tokens is usually enough for quick answers).
  • Show users a clear indicator when the model is working so they understand what's happening.
  • Avoid running inference in the background — your users' battery life will thank you.

Model Download Strategy

Models range from 100 MB to 3+ GB. Please don't download them on first app launch. Instead:

  • Let users opt in to the AI feature and trigger the download explicitly.
  • Show download progress and allow cancellation.
  • Cache models on disk so subsequent launches are instant.
  • For the Apple provider, there's no download — the model ships with the OS.

A Real-World Architecture: Offline AI Assistant

Here's how you might architect an offline-capable AI assistant that works across platforms:

// ai-provider.ts
import { Platform } from 'react-native';
import NetInfo from '@react-native-community/netinfo';

type AIProvider = 'apple' | 'local-llama' | 'cloud';

export async function selectProvider(): Promise<AIProvider> {
  const { isConnected } = await NetInfo.fetch();

  // Prefer cloud when online for best quality
  if (isConnected) return 'cloud';

  // Use Apple Intelligence on iOS (zero download)
  if (Platform.OS === 'ios') return 'apple';

  // Fall back to downloaded local model
  return 'local-llama';
}

export function getModelConfig(provider: AIProvider) {
  switch (provider) {
    case 'cloud':
      return { provider: 'cloud', modelId: 'gpt-4o-mini' };
    case 'apple':
      return { provider: 'apple', modelId: 'default' };
    case 'local-llama':
      return {
        provider: 'local-llama',
        modelId: 'ggml-org/SmolLM3-3B-GGUF/SmolLM3-Q4_K_M.gguf',
      };
  }
}

This pattern keeps your UI components completely decoupled from the inference backend. Your chat screen calls generateText with whatever model the provider function returns. Users get the best available experience — cloud quality when online, instant local inference when offline.

Frequently Asked Questions

Can I run on-device AI in Expo Go?

No. Both react-native-executorch and react-native-ai include native code that Expo Go doesn't bundle. You'll need a development build created with npx expo run:ios or npx expo run:android, or use EAS Build to create a custom development client.

What's the minimum device requirement for on-device LLMs?

For a small model like SmolLM2 (135M parameters), almost any device from the last five years will work. For 1B–3B models, you need at least 4 GB of total device RAM. For 4B quantized models, target 8 GB or higher. iPhones from the iPhone 12 onward handle 1B–3B models comfortably.

Does react-native-executorch work with the old React Native architecture?

Nope. ExecuTorch requires the New Architecture (Fabric and TurboModules). Since React Native 0.76 made the New Architecture the default and Expo SDK 55+ always enables it, most new projects already meet this requirement. If you're on an older version, you'll need to migrate first.

How large are the model downloads?

It varies quite a bit. SmolLM2 135M is around 100 MB, Llama 3.2 1B is about 1 GB, and a 4B quantized model can be 2–3 GB. Apple Foundation Models require no download at all since they ship with the operating system. Plan your app's storage budget accordingly and always let users opt in before downloading.

Can I use on-device AI for real-time speech recognition?

Yes! react-native-executorch provides a useWhisper hook powered by OpenAI's Whisper model running locally via ExecuTorch. React Native AI offers transcription through Apple's SpeechAnalyzer on iOS 26+. Both approaches keep audio data entirely on-device.

About the Author Editorial Team

Our team of expert writers and editors.