Curated Local AI - Tools for Running AI Models Locally

Inference Engines

Ollama

Get up and running with large language models locally. Simple CLI with model management.

macOSLinuxWindowsMIT

llama.cpp

LLM inference in C/C++. The foundational project for GGUF-based local inference.

Cross-platformMIT

vLLM

High-throughput and memory-efficient inference and serving engine for LLMs.

LinuxApache-2.0

TensorRT-LLM

NVIDIA's library for optimizing LLM inference on NVIDIA GPUs.

NVIDIAApache-2.0

MLX

Array framework for machine learning on Apple silicon, by Apple.

Apple SiliconMIT

MLC LLM

Universal LLM deployment engine with ML compilation.

Cross-platformApache-2.0

ExLlamaV2

Fast inference library for running LLMs locally on modern consumer GPUs.

LinuxWindowsMIT

LocalAI

Drop-in replacement REST API compatible with OpenAI. No GPU required.

Cross-platformMIT

Llamafile

Distribute and run LLMs with a single file. By Mozilla.

Cross-platformApache-2.0

candle

Minimalist ML framework for Rust with a focus on performance.

Cross-platformMIT

llama-cpp-python

Python bindings for llama.cpp with OpenAI-compatible API server.

Cross-platformMIT

LMDeploy

Toolkit for compressing, deploying, and serving LLMs with high throughput.

LinuxApache-2.0

SGLang

Fast serving framework for large language and vision models.

LinuxApache-2.0

Desktop & Web UIs

Open WebUI

Feature-rich, self-hosted web UI for LLMs. Supports Ollama and OpenAI-compatible APIs.

Self-hostedMIT

LM Studio

Desktop app to discover, download, and run local LLMs.

DesktopFree

GPT4All

Open-source large language model chatbot ecosystem. Run models entirely offline.

DesktopMIT

Jan

Open-source alternative to ChatGPT that runs 100% offline.

DesktopAGPL-3.0

Text Generation WebUI

Gradio-based web UI for running large language models.

Self-hostedAGPL-3.0

KoboldCpp

Easy-to-use AI text generation with GGUF support.

DesktopAGPL-3.0

SillyTavern

LLM frontend for power users with advanced character features.

Self-hostedAGPL-3.0

AnythingLLM

All-in-one desktop and Docker AI app with built-in RAG and agents.

DesktopMIT

Chatbox

Desktop client for ChatGPT, Claude, and local models.

DesktopGPL-3.0

LibreChat

Enhanced ChatGPT clone supporting many AI providers.

Self-hostedMIT

Lobe Chat

Modern-design ChatGPT/LLM UI with Ollama support.

Self-hostedMIT

Model Hubs & Formats

Hugging Face Hub

The largest open-source ML model repository.

Model Hub

GGUF Format

Binary format for fast loading and saving of models used by llama.cpp.

Format

GGML

Tensor library for ML. Foundation behind GGUF and llama.cpp.

Library

Ollama Library

Curated models optimized and packaged for Ollama.

Model Library

TheBloke

GGUF, GPTQ, and AWQ versions of popular models.

Quantized Models

SafeTensors

Safe way to store and distribute tensors by Hugging Face.

Format

Quantization Tools

GPTQ

Accurate post-training quantization for generative pre-trained transformers.

4-bit

AutoGPTQ

Easy-to-use LLM quantization package with user-friendly APIs.

4-bit

AWQ

Activation-aware weight quantization for efficient LLM compression.

4-bit

bitsandbytes

Lightweight CUDA wrapper for 8-bit and 4-bit quantization.

4/8-bit

QLoRA

Efficient finetuning of quantized LLMs.

Fine-tune

HQQ

Half-Quadratic Quantization -- fast and accurate.

Post-training

QuIP#

Extreme LLM compression with incoherence processing.

2-bit

AQLM

State-of-the-art additive quantization for 2-bit compression.

2-bit

Fine-tuning

Unsloth

Finetune LLMs 2-5x faster with 80% less memory.

LoRAQLoRA

Axolotl

Streamlined tool for fine-tuning LLMs with many methods.

FullLoRA

LLaMA-Factory

Unified framework for fine-tuning 100+ LLMs with a web UI.

FullLoRA

PEFT

Parameter-Efficient Fine-Tuning by Hugging Face.

LoRAAdapters

TRL

Transformer Reinforcement Learning: RLHF, DPO, PPO.

RLHFDPO

torchtune

PyTorch-native library for LLM fine-tuning.

FullLoRA

LitGPT

Pretrain, finetune, and deploy 20+ LLMs. By Lightning AI.

FullLoRA

Local RAG

PrivateGPT

Interact with your documents using LLMs, 100% privately.

Ollama

Quivr

Your second brain, powered by generative AI.

Multiple

LocalGPT

Chat with your documents on your local device.

HuggingFace

Khoj

Personal AI assistant for notes, documents, and images.

Multiple

RAGFlow

Open-source RAG engine based on deep document understanding.

Multiple

kotaemon

Clean, customizable RAG UI for chatting with your documents.

Multiple

Haystack

LLM orchestration for RAG, agents, and search pipelines.

Multiple

Voice & Speech

Whisper.cpp

High-performance inference of OpenAI's Whisper in C/C++.

STT

Piper

Fast, local neural text-to-speech. Optimized for Raspberry Pi.

TTS

Coqui TTS

Deep learning toolkit for text-to-speech with many voices.

TTS

Bark

Text-prompted generative audio model by Suno.

Audio

faster-whisper

Whisper reimplementation using CTranslate2, up to 4x faster.

STT

Vosk

Offline speech recognition supporting 20+ languages.

STT

StyleTTS 2

Human-level TTS through style diffusion.

TTS

Image Generation

ComfyUI

Modular Stable Diffusion GUI with node-based workflow.

PyTorch

Automatic1111

The most popular Stable Diffusion web UI.

PyTorch

Fooocus

Offline, open source Midjourney-like experience.

PyTorch

InvokeAI

Professional creative engine with polished node-based UI.

PyTorch

Forge

Optimized A1111 fork with better memory management.

PyTorch

FLUX.1

State-of-the-art text-to-image model by Black Forest Labs.

PyTorch

kohya-ss

Training scripts for LoRA, DreamBooth, and textual inversion.

PyTorch

Hardware Guides

GPU Benchmarks

Comprehensive GPU benchmarks for local LLM inference.

Benchmarks

MLX Benchmarks

Apple Silicon performance examples and benchmarks.

Apple Silicon

AMD ROCm

AMD's open-source GPU computing platform for ML.

AMD

RTX AI Toolkit

NVIDIA's toolkit for AI on RTX GPUs.

NVIDIA

Inference Engines

Desktop & Web UIs

Model Hubs & Formats

Quantization Tools

Fine-tuning

Local RAG

Voice & Speech

Image Generation

Hardware Guides

Want to add a tool?