AI & Machine Learning ★ 14.0k Python

lyogavin/airllm

by @lyogavin · March 16, 2026

14.0k Stars

1.4k Forks

127 Issues

Python Language

AirLLM is a memory-optimized inference framework that runs 70B+ parameter LLMs on a single 4GB GPU without quantization. Decomposes models into layer-wise shards that load and unload dynamically. Supports Llama, Mistral, QWen, ChatGLM architectures with optional 4-bit/8-bit compression for 3x speed gains. Cross-platform including macOS. Makes massive AI models accessible on consumer hardware.

@lyogavin Project maintainer on GitHub

View Profile

View on GitHub

git clone https://github.com/lyogavin/airllm.git

Quick Start Example

python

from airllm import AutoModel

# Run 70B model on 4GB GPU
model = AutoModel.from_pretrained(
    "meta-llama/Llama-2-70b-hf"
)

input_text = "Explain quantum computing"
output = model.generate(input_text, max_length=200)
print(output)

lyogavin/airllm

Quick Start Example

Tags

Related Projects