Llama 2: Why Local Inference in C Matters for Node Devs

Felipe Hlibco

July 31, 2023

Two weeks ago Meta released Llama 2 with a commercial license. That alone was significant — the first truly open large language model you could legally ship in a product. But the thing that got me out of my chair was what Andrej Karpathy did with it eight days later.

He wrote Llama 2 inference in ~500 lines of pure C. No libraries. No frameworks. No PyTorch, no CUDA, no nothing. Just C and math. The repo is called llama2.c, and it runs the 7B parameter model at about 18 tokens per second on an M1 MacBook Air.

Why should Node developers care about C code?

The Dependency Wall #

If you’ve tried running any LLM locally, you know the drill. Install Python 3.10 (not 3.11, that breaks things). Set up a virtual environment. Install PyTorch — the right version, for your hardware, with or without CUDA. Download model weights. Convert them to the right format. Pray.

Georgi Gerganov’s llama.cpp project (started back in March) proved that quantized inference on consumer hardware was possible and practical. But it’s still a substantial C++ codebase with its own build system and model format.

Karpathy stripped it down to the absolute minimum. Five hundred lines. One file. Compile with gcc and run. That’s it.

For the Node ecosystem, this matters enormously. A single C file with zero external dependencies is trivially wrappable via N-API. You write a native addon, compile it as part of npm install, and suddenly your Node application has local LLM inference without spawning Python subprocesses or managing separate services.

Three Paths for Node #

I see three realistic integration paths here:

N-API native addon. Take llama2.c, wrap the inference function in a Node native addon using node-addon-api. The model loads into the same process; inference calls are just function calls. Latency is minimal. The downside is you need a C compiler at install time, but node-gyp already handles that for plenty of popular packages.

WebAssembly. Compile llama2.c to Wasm using Emscripten. Now it runs in any JavaScript runtime — Node, Deno, even the browser. Performance takes a hit (Wasm is typically 1.5-2x slower than native C), but the portability is unbeatable. No native compilation step; just ship a .wasm file.

Edge runtime deployment. Cloudflare Workers and similar runtimes support Wasm. A quantized small model running in Wasm at the edge means per-request inference without round-tripping to a GPU cluster. The 7B model is too large for most edge environments today, but smaller fine-tuned models (or aggressive quantization) could fit.

Why This Isn’t Just Academic #

The practical unlock here is about privacy, latency, and cost. Every API call to GPT-4 sends your user’s data to a third-party server. It adds 200-500ms of network latency. It costs money per token. For a lot of production use cases — code completion in an IDE, local document summarization, on-device chat in a mobile app — those constraints are dealbreakers.

Local inference eliminates all three. The data never leaves the device. Latency is just compute time. And after you download the model weights, inference is free.

Llama 2’s commercial license makes this legal. Karpathy’s llama2.c makes it simple. The Node ecosystem just needs someone to build the bridge.

I’ve been tinkering with the N-API path over the past few days. It’s early, and the 7B model needs ~13GB of RAM without quantization (which is a problem). But it works. And working is the important part; optimization comes later.

If you’re a Node developer who’s been watching the LLM space from the sidelines because the tooling felt too Python-centric, this is your on-ramp.