
What is UVLM?
UVLM is an open-source framework that provides a unified interface for loading, configuring, and benchmarking multiple Vision-Language Model (VLM) architectures on custom image analysis tasks. It abstracts the substantial architectural differences between VLM families behind a single inference function, enabling researchers and practitioners to compare models using identical prompts and evaluation protocols — without writing model-specific code.
VLMs can interpret images and respond to arbitrary natural language queries about their content — counting objects, classifying scenes, estimating measurements, detecting features. But each model family (LLaVA-NeXT, Qwen2.5-VL, and others) requires its own processor classes, tokenization logic, generation configuration, and output parsing. UVLM solves this by routing every inference call through a single backend-agnostic function, regardless of the underlying model.
UVLM is implemented as a Google Colab notebook and runs entirely in the browser. No local installation, no infrastructure, no GPU ownership required. Models up to 34B parameters can be loaded on a free-tier T4 GPU with 4-bit quantization.

How It Works
UVLM is organised into three sequential blocks, each handling a distinct stage of the workflow:
Step 1: Load model. Select from 11 checkpoints (7 LLaVA-NeXT + 4 Qwen2.5-VL, from 3B to 110B parameters). Choose precision mode (FP16, 8-bit, or 4-bit quantization). The block auto-detects the model backend and loads the appropriate processor.
Step 2: Define tasks. Configure up to 10 analysis tasks through a widget-based form. Each task has a column name, prompt (role + task + theory + format), task type (numeric, category, boolean, text), and optional consensus validation. A max-token slider goes up to 1,500 for custom reasoning strategies.
Step 3: Run analysis. Point to a Google Drive image folder. The engine processes all images sequentially, writing results to CSV as it goes. Supports resume mode, schema upgrading (add tasks between runs), checkpoint saves every 5 images, and built-in truncation detection.

Key Features
- Dual-backend abstraction. Automatically routes to the correct inference pipeline based on the loaded model family.
- Multi-task prompt builder. Modular prompts: role + task + theory + format. Four response types with type-specific parsers.
- Consensus validation. Majority voting across 2–5 repeated inferences per task. NA-aware filtering. Agreement ratio tracking.
- Flexible reasoning. Token budget up to 1,500 for user-defined chain-of-thought prompts. Built-in CoT reference mode at 1,024 tokens for standardised benchmarking.
- Truncation detection. Exact token counting from the model output tensor. Per-task _truncated CSV columns with console warnings.
- Quantization support. FP16, 8-bit, and 4-bit precision via BitsAndBytes. Consumer-grade GPUs (T4, L4) for models up to 34B.
Get Started
GitHub: https://github.com/perezjoan/UVLM
Run in Colab: Open the notebook directly from the GitHub repository. Select a GPU runtime, load a model, define your tasks, and run.
Citations & Publications
If you use UVLM in your work, please cite:
Perez, J. & Fusco, G. (2026). UVLM: A Universal Vision-Language Model Loader for Reproducible Multimodal Benchmarking. arXiv:2603.13893
