Universal Vision-Language Model Loader | UVLM by Urban Geo Analytics

What is UVLM?

UVLM is an open-source Python package that provides a unified interface for loading, configuring, and evaluating multiple Vision-Language Model (VLM) architectures on custom image analysis tasks. It abstracts the substantial architectural differences between VLM families behind a single inference function, enabling researchers and practitioners to compare models using identical prompts and evaluation protocols — without writing model-specific code.

VLMs can interpret images and respond to arbitrary natural language queries about their content — counting objects, classifying scenes, estimating measurements, detecting features. But each model family (LLaVA-NeXT, Qwen2.5-VL, and others) requires its own processor classes, tokenisation logic, generation configuration, and output parsing. UVLM solves this by routing every inference call through a single backend-agnostic function, regardless of the underlying model.

Since v3.0.0 (April 2026), UVLM is a pip-installable package that runs anywhere Python runs — Google Colab, local Jupyter notebooks, or plain scripts. No GPU ownership required for Colab; for local use, any NVIDIA GPU with 8+ GB VRAM handles models up to 7B parameters in 4-bit quantisation.

Installation

Google Colab — zero install. Open the Colab notebook and everything is set up automatically.

Local install — one command: pip install git+https://github.com/perezjoan/UVLM.git

Note: PyTorch with CUDA must be installed separately to match your GPU driver (e.g., `pip install torch torchvision –index-url https://download.pytorch.org/whl/cu128` for CUDA 12.8+).

How It Works

UVLM follows the same three-step workflow across all deployment modes:

Step 1: Load model. Select from 11 checkpoints (7 LLaVA-NeXT + 4 Qwen2.5-VL, from 3B to 110B parameters). Choose precision mode (FP16, 8-bit, or 4-bit quantisation). The loader auto-detects the model backend and returns a context object used by all subsequent functions.

Step 2: Define tasks. Configure up to 10 analysis tasks through a widget-based form (in notebooks) or as a list of task specification dictionaries (in scripts). Each task has a column name, prompt (role + task + theory + format), task type (numeric, category, boolean, text), and optional consensus validation. A max-token slider goes up to 1,500 for custom reasoning strategies.

Step 3: Run analysis. Point to an image folder — Google Drive on Colab, a local path on Jupyter, or any filesystem path in a script. The batch engine processes all images sequentially, writing results to CSV as it goes. Supports resume mode, schema upgrading (add tasks between runs), checkpoint saves every 3 images, and built-in truncation detection.

Three Deployment Modes

Google Colab — Open the notebook, select a GPU runtime, and start working. The package installs itself in the first cell. Images are loaded from Google Drive. Best for: quick experiments, no-setup access, free GPU.

Local Jupyter Notebook — Same widget-based interface running on your own hardware. Images are read from local folders. Data never leaves your machine. Best for: sensitive data, faster model loading, reliable GPU access.

Python Script — Full programmatic control with three lines of code:

from uvlm import load_model, run_inference, parse_response

ctx = load_model(“[Qwen] Qwen2.5-VL 7B Instruct”, precision=”4bit”)
raw, tokens = run_inference(“photo.jpg”, “Count the cars”, ctx)
result = parse_response(raw, “numeric”)

Best for: pipeline integration, automation, batch scripting.

Key Features

Dual-backend abstraction. Automatically routes to the correct inference pipeline based on the loaded model family.
Multi-task prompt builder. Modular prompts: role + task + theory + format. Four response types with type-specific parsers.
Consensus validation. Majority voting across 2–5 repeated inferences per task. NA-aware filtering. Agreement ratio tracking.
Flexible reasoning. Token budget up to 1,500 for user-defined chain-of-thought prompts. Built-in CoT reference mode at 1,024 tokens for standardised benchmarking.
Truncation detection. Exact token counting from the model output tensor. Per-task _truncated CSV columns with console warnings.
Quantization support. FP16, 8-bit, and 4-bit precision via BitsAndBytes. Consumer-grade GPUs (T4, L4) for models up to 34B.
Modular package. Eight Python modules with clean APIs. No global state — all functions accept and return explicit arguments. Easy to extend with new model families.

Supported Models

Family: LLaVA-NeXT

Mistral 7B | Params: 7B | Min GPU: T4 (free Colab)
Vicuna 7B | Params: 7B | Min GPU: T4
Vicuna 13B | Params: 13B | Min GPU: A100
34B | Params: 34B | Min GPU: A100
LLaMA3 8B | Params: 8B | Min GPU: L4
72B | Params: 72B | Min GPU: Multi-GPU
110B | Params: 110B | Min GPU: Multi-GPU

Family: Qwen2.5-VL

3B Instruct | Params: 3B | Min GPU: T4
7B Instruct | Params: 7B | Min GPU: T4
32B Instruct | Params: 32B | Min GPU: A100
72B Instruct | Params: 72B | Min GPU: Multi-GPU

Get Started

GitHub: https://github.com/perezjoan/UVLM

Run in Colab: Open the notebook directly from the GitHub repository. Select a GPU runtime, load a model, define your tasks, and run.

Run locally: pip install git+https://github.com/perezjoan/UVLM.git

Blog post: UVLM v3.0.0: From Colab Notebook to Python Package

Citations & Publications

If you use UVLM in your work, please cite:

Perez, J. & Fusco, G. (2026). UVLM: A Universal Vision-Language Model Loader for Reproducible Multimodal Benchmarking. arXiv:2603.13893

Read the publication