Hugging Face with llamacpp Usage

2026-05-20
5 min read

Hugging Face Hub is the default marketplace for open-weight models. For local inference with llama.cpp, you rarely need the full PyTorch checkpoint—you usually want a curated GGUF build, a clear model lineage, and a short list of organizations worth following.

This note covers four workflows:

  1. Subscribe to key organizations so new releases surface in your feed.
  2. Read the model tree and tags before you download anything.
  3. Pick the right GGUF files (text weights vs. vision projector).
  4. Download with huggingface-cli and serve with llama-server.

Following organizations on the Hub

The Hub groups models by publisher (organization/model-name). Following an org is the fastest way to notice new base models, quantizations, and community repacks.

Subscribe to these organizations first—they cover most models you will run locally:

Organization Typical models Why follow
google Gemma, Gemma 3/4, T5, SigLIP Google open-weight releases
meta-llama Llama 2/3/3.x Meta Llama family
Qwen Qwen2, Qwen2.5, Qwen3, QwQ Strong multilingual OSS LLMs
deepseek-ai DeepSeek-V2/V3, R1 distillations Reasoning and MoE checkpoints
mistralai Mistral, Mixtral, Codestral Efficient European OSS models
microsoft Phi-3/4, Florence, Phi-vision Small models and vision stacks
stabilityai Stable Diffusion, SDXL Image generation
openai whisper, CLIP (legacy OSS) ASR and embedding baselines
nvidia Nemotron, NIM-related weights Enterprise-oriented releases
unsloth Fast fine-tunes and GGUF repacks Community-optimized quantizations
bartowski Llama/Mistral GGUF quantizations Widely used Q4_K_M builds
lmstudio-community GGUF bundles for desktop tools Pre-quantized desktop-friendly sets

On any org page, click Follow (top right). Your home feed then shows new repos, updated quantizations, and trending forks without manual search.

Model tree and lineage

When a checkpoint is a fine-tune, LoRA merge, adapter, or quantized derivative, authors can declare a base model in the model card metadata. The Hub renders this as a model tree: upstream base at the root, derivatives as children.

img

What to look for on the tree:

  • Base model — the original full-precision checkpoint (e.g. google/gemma-2-9b-it).
  • Adapters / fine-tunes — task-specific heads or instruction-tuned forks.
  • Quantizations — GGUF/GPTQ/AWQ builds; often the tag you care about for llama.cpp.
  1. Quantization nodes

    The quantizations tag (and linked GGUF repos) is where local-inference users spend most of their time. A single base model may have dozens of community quantizations (Q4_K_M, Q5_K_S, IQ4_XS, etc.). Prefer publishers you trust (official org quant, unsloth, bartowski) and check:

    • file size vs. your VRAM/RAM budget;
    • whether the build is text-only or includes a separate mmproj for vision;
    • commit date and download count (stale or empty repos are common).

    If the tree shows no quantizations, search the Hub for gguf plus the base model name—quantizers often publish under their personal org.

Reading model tags on the model card

Tags are shorthand for task, modality, framework, and license. They appear as colored pills below the model name.

Domain Tag Meaning Example
NLP text-generation Causal LM, chat, completion Llama, Gemma, Qwen
  text-classification Single-label or multi-label classification BERT-style heads
  fill-mask Masked language modeling BERT, RoBERTa
  token-classification NER, POS tagging  
  question-answering Extractive QA  
  summarization Abstractive / extractive summary  
  translation Machine translation NLLB, M2M100
  feature-extraction Embeddings sentence-transformers
Computer image-classification Image classification ViT, ConvNeXt
vision object-detection Bounding-box detection DETR, YOLO exports
  image-segmentation Semantic / instance segmentation  
  image-to-image Style transfer, enhancement  
  depth-estimation Monocular depth  
Multimodal image-text-to-text Image + text in, text out Gemma 3/4, LLaVA
  text-to-image Text in, image out Stable Diffusion
  visual-question-answering VQA  
  document-question-answering Doc QA over PDFs / layouts  
Audio automatic-speech-recognition Speech-to-text Whisper
  text-to-speech TTS  
  audio-classification Audio event / scene labels  

Other tags worth noting:

  • Licenseapache-2.0, mit, llama2, gemma, etc.; governs commercial use.
  • Librarytransformers, gguf, safetensors; tells you which runtime to expect.
  • Size — parameter count band (7B, 70B); sanity-check against your hardware.

For llama.cpp, confirm the repo actually ships .gguf (or follow a linked quantization repo) before you clone multi-gigabyte Safetensors by mistake.

GGUF format: text weights vs. vision projector

GGUF is the native container for llama.cpp. Multimodal models are often split into two files:

File role Typical name pattern Contents Quantization strategy
Main (LLM) *-Q4_K_M.gguf Transformer language model Aggressive (Q4/Q5) for VRAM
Projector (vision) mmproj-*.gguf, *-mmproj-* Vision encoder → LLM hidden states Prefer F16/BF16; avoid heavy quant
  1. Why two quantization strategies?

    The language model dominates memory and compute, so aggressive quantization (e.g. Q4_K_M) is usually acceptable with modest quality loss.

    The vision projector is smaller but more sensitive: aggressive quant can collapse alignment between image patches and token embeddings. Community practice:

    • Main LLM: Q4_K_M or Q5_K_M for a balance of size and quality.
    • mmproj: F16 or BF16 when available; only fall back to mild quant if RAM is tight.

    Download both files from the same publisher and revision when possible—mixed sources sometimes break template or image-size assumptions.

Downloading models with the Hugging Face CLI

Install the CLI and point it at a mirror if huggingface.co is slow or blocked.

pip install -U "huggingface_hub[cli]"

# macOS / Linux — mirror (optional, common in CN)
export HF_ENDPOINT=https://hf-mirror.com

# Windows PowerShell
$env:HF_ENDPOINT = "https://hf-mirror.com"

# Download only GGUF shards into a fixed directory
hf download google/gemma-3-27b-it \
  --local-dir ./models/gemma-3-27b-it \
  --include "*.gguf"

Useful flags:

  • --include / --exclude — glob filters (*.gguf, *mmproj*).
  • --local-dir-use-symlinks False — on older hub versions, avoid symlink issues on Windows.
  • HF_TOKEN — for gated models (Llama, Gemma license gates); run huggingface-cli login once.

Example: fetch both LLM and projector for a vision model:

hf download unsloth/gemma-3-27b-it-GGUF \
  --local-dir ./models/gemma-3-27b-it \
  --include "*Q4_K_M.gguf" "*mmproj*f16*"

Verify file sizes after download; incomplete transfers are common on flaky networks.

Serving with llama.cpp

Point llama-server at the text GGUF and pass the projector separately. Text-only models omit --mmproj.

./llama-server \
  -m /path/to/text_model-Q4_K_M.gguf \
  --mmproj /path/to/mmproj-f16.gguf \
  --port 8080 \
  -c 32768 \
  -ngl 99
  • -m — main language-model weights.
  • --mmproj — vision projector (multimodal only).
  • -c — context length; lower if you hit OOM.
  • -ngl — layers offloaded to GPU (99 ≈ all layers on CUDA/Metal).

For a full Gemma + Opencode setup (context sizing, nohup, provider JSON), see Running LlamaCpp with Gemma-4-26B Model.

#Ai