AirLLM is a memory-optimized inference framework that runs 70B+ parameter LLMs on a single 4GB GPU without quantization. Decomposes models into layer-wise shards that load and unload dynamically. Supports Llama, Mistral, QWen, ChatGLM architectures with optional 4-bit/8-bit compression for 3x speed gains. Cross-platform including macOS. Makes massive AI models accessible on consumer hardware.
git clone https://github.com/lyogavin/airllm.git
from airllm import AutoModel
# Run 70B model on 4GB GPU
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-70b-hf"
)
input_text = "Explain quantum computing"
output = model.generate(input_text, max_length=200)
print(output)