What is UVLM?

UVLM is an open-source framework that provides a unified interface for loading, configuring, and benchmarking multiple Vision-Language Model (VLM) architectures on custom image analysis tasks. It abstracts the substantial architectural differences between VLM families behind a single inference function, enabling researchers and practitioners to compare models using identical prompts and evaluation protocols — without writing model-specific code.

VLMs can interpret images and respond to arbitrary natural language queries about their content — counting objects, classifying scenes, estimating measurements, detecting features. But each model family (LLaVA-NeXT, Qwen2.5-VL, and others) requires its own processor classes, tokenization logic, generation configuration, and output parsing. UVLM solves this by routing every inference call through a single backend-agnostic function, regardless of the underlying model.

UVLM is implemented as a Google Colab notebook and runs entirely in the browser. No local installation, no infrastructure, no GPU ownership required. Models up to 34B parameters can be loaded on a free-tier T4 GPU with 4-bit quantization.

How It Works

UVLM is organised into three sequential blocks, each handling a distinct stage of the workflow:

Step 1: Load model. Select from 11 checkpoints (7 LLaVA-NeXT + 4 Qwen2.5-VL, from 3B to 110B parameters). Choose precision mode (FP16, 8-bit, or 4-bit quantization). The block auto-detects the model backend and loads the appropriate processor.

Step 2: Define tasks. Configure up to 10 analysis tasks through a widget-based form. Each task has a column name, prompt (role + task + theory + format), task type (numeric, category, boolean, text), and optional consensus validation. A max-token slider goes up to 1,500 for custom reasoning strategies.

Step 3: Run analysis. Point to a Google Drive image folder. The engine processes all images sequentially, writing results to CSV as it goes. Supports resume mode, schema upgrading (add tasks between runs), checkpoint saves every 5 images, and built-in truncation detection.

Key Features

  • Dual-backend abstraction. Automatically routes to the correct inference pipeline based on the loaded model family.
  • Multi-task prompt builder. Modular prompts: role + task + theory + format. Four response types with type-specific parsers.
  • Consensus validation. Majority voting across 2–5 repeated inferences per task. NA-aware filtering. Agreement ratio tracking.
  • Flexible reasoning. Token budget up to 1,500 for user-defined chain-of-thought prompts. Built-in CoT reference mode at 1,024 tokens for standardised benchmarking.
  • Truncation detection. Exact token counting from the model output tensor. Per-task _truncated CSV columns with console warnings.
  • Quantization support. FP16, 8-bit, and 4-bit precision via BitsAndBytes. Consumer-grade GPUs (T4, L4) for models up to 34B.

Get Started

GitHub: https://github.com/perezjoan/UVLM

Run in Colab: Open the notebook directly from the GitHub repository. Select a GPU runtime, load a model, define your tasks, and run.

Citations & Publications

If you use UVLM in your work, please cite:

Perez, J. & Fusco, G. (2026). UVLM: A Universal Vision-Language Model Loader for Reproducible Multimodal Benchmarking. arXiv:2603.13893

Read the publication