AirLLM enables running 70B parameter large language models with inference on a single 4GB GPU. Uses layer-by-layer computation and memory-efficient techniques to run massive models on consumer hardware that would normally require expensive multi-GPU setups. Makes state-of-the-art AI accessible to researchers and developers with limited GPU resources.
git clone https://github.com/0xSojalSec/airllm.git
from airllm import AutoModel
# Load a 70B model on 4GB GPU
model = AutoModel.from_pretrained(
"meta-llama/Llama-2-70b-hf"
)
# Generate text
output = model.generate(
"Explain quantum computing",
max_length=200
)
print(output)