
Highlights
- SAGAI v2.0 merges the previous four-module notebook architecture into a single unified Google Colab notebook (SAGAI.ipynb) organized in six sequential blocks.
- The inline LLaVA-only inference code is replaced by the UVLM package (Universal Vision-Language Model Loader), installed automatically from GitHub, providing access to 11 VLM checkpoints across two model families.
- New capabilities include a multi-task prompt builder, consensus validation with majority voting, chain-of-thought reasoning, truncation detection, interactive Folium maps, view-direction filtering, and support for loading an existing study area polygon.
Introduction
SAGAI (Streetscape Analysis with Generative Artificial Intelligence) is an open-source workflow for scoring and mapping street-level urban environments using vision-language models and open geospatial data. Since its initial release, SAGAI has been structured as a set of independent Colab notebooks, one per pipeline stage, each relying on its own dependencies and documentation.
SAGAI v2.0 is a major release that consolidates the entire pipeline into a single notebook and replaces the custom inference code with the UVLM package. Where previous versions were tied to a single LLaVA checkpoint with handwritten inference logic, SAGAI v2.0 delegates all vision-language model loading, prompting, and evaluation to UVLM’s unified interface. This makes the scoring engine model-agnostic: users can select from 11 VLM checkpoints spanning the LLaVA-NeXT and Qwen2.5-VL families, compare their performance on identical tasks, and benefit from features such as consensus validation, reasoning traces, and truncation diagnostics; all within the same notebook.
Beyond the inference engine, v2.0 introduces structural and functional changes across the entire pipeline: a unified six-block architecture, interactive HTML mapping via Folium, view-direction filtering for aggregation, and the ability to load an existing polygon as a study area boundary instead of defining a bounding box manually.
This post details the architectural changes, the UVLM integration, and the new features introduced in SAGAI v2.0.
1. From Four Notebooks to One: The Unified Architecture
Previous SAGAI releases were organized as four independent Colab notebooks — one for street sampling, one for image retrieval, one for VLM inference, and one for aggregation and mapping — each accompanied by a separate NOTICE file documenting its dependencies and usage. This modular design was useful for development but introduced friction in practice: users had to manage file paths between notebooks, track four separate environments, and consult multiple documentation files.
SAGAI v2.0 merges all four stages into a single notebook (SAGAI.ipynb) structured as six sequential blocks. The pipeline flows from study area definition through street sampling, image downloading, VLM scoring, and mapping, with all intermediate data passed directly between blocks in the same runtime session. The separate per-module NOTICE files and the standalone requirements file (requirements_sagai_module_3_v1-0.txt) have been removed — dependency management is now handled automatically by the UVLM package installation.

2. Study Area Definition: Bounding Box or Existing Polygon
In previous versions, the study area was defined exclusively by a bounding box in WGS84 coordinates. SAGAI v2.0 retains this option but adds the ability to draw your own polygon or to load an existing polygon; for example, a GeoPackage representing a neighborhood, municipality, or custom boundary. When a polygon is provided, the street sampling step extracts the OpenStreetMap network within that geometry rather than a rectangular extent. This makes it straightforward to work with irregular administrative boundaries or user-defined study zones without manually computing bounding coordinates.
3. UVLM Integration: From Single-Model Inference to Multi-Model Benchmarking
The most significant change in SAGAI v2.0 is the replacement of the inline inference code with the UVLM package. In previous versions, Blocks 3 through 5 contained custom code for loading a single LLaVA checkpoint, constructing prompts, running inference, and parsing outputs. This logic was tightly coupled to one model architecture and required manual maintenance when Hugging Face APIs or model formats changed.
SAGAI v2.0 installs UVLM directly from its GitHub repository at the start of the notebook. All model loading, prompt formatting, inference execution, response parsing, and batch processing are delegated to UVLM’s API. The inline inference code has been entirely removed.
Through UVLM, SAGAI v2.0 supports 11 VLM checkpoints across two model families:
- LLaVA-NeXT — Mistral 7B, Vicuna 7B, Vicuna 13B, 34B, LLaMA3 8B, 72B, 110B
- Qwen2.5-VL — 3B Instruct, 7B Instruct, 32B Instruct, 72B Instruct
UVLM’s dual-backend abstraction automatically detects the model family and routes inference to the correct pipeline — LlavaNextProcessor for LLaVA models, AutoProcessor with process_vision_info for Qwen models — so users switch between architectures by changing a single model selection, with no modification to the rest of the notebook.
Quantization is handled through UVLM’s built-in support for 4-bit, 8-bit, and FP16 precision via BitsAndBytes. Models up to 34B parameters can run on a single Colab GPU (T4 or A100) with 4-bit quantization.
4. Multi-Task Prompt Builder
UVLM provides a widget-based prompt builder that SAGAI v2.0 exposes directly in the notebook. Users can define up to 10 analysis tasks per run, each with its own prompt, response type (numeric, category, boolean, or text), and label. This replaces the previous approach of selecting from a small set of hardcoded tasks (T1, T2, T3) or manually editing prompt strings in the code.
Tasks are configured interactively before execution and applied uniformly across all images in the batch. Each task produces its own column in the output CSV file.

5. Consensus Validation
SAGAI v2.0 inherits UVLM’s consensus validation mechanism. Each analysis task can be run 2 to 5 times per image, and the final score is determined by majority voting across the repeated inferences. NA values from failed parses are filtered before voting. An agreement ratio is recorded alongside the final score, providing a built-in measure of prediction reliability without any external validation step.
6. Chain-of-Thought Reasoning and Truncation Detection
UVLM supports two approaches to chain-of-thought (CoT) reasoning, both available in SAGAI v2.0. Users can write task prompts that explicitly request step-by-step reasoning and adjust the token budget (up to 1,500 tokens) to allow the model sufficient generation space. Alternatively, a built-in CoT reference mode can be enabled per task, which triggers a standardized reasoning template with a fixed 1,024-token budget. In both cases, the reasoning trace is stored in a dedicated column in the output CSV for inspection.
Truncation detection is performed automatically after every inference call. The exact number of generated tokens is compared against the token limit, and truncated responses are flagged in per-task CSV columns. This allows users to identify tasks where the token budget is insufficient without post-hoc analysis.
7. Interactive Mapping with Folium
Previous SAGAI versions generated static thematic maps using Matplotlib. SAGAI v2.0 replaces these with interactive HTML maps built with Folium. Point-level and street-segment-level scores are rendered as interactive layers that can be panned, zoomed, and queried directly in the browser. This is particularly useful for exploratory analysis and for sharing results with collaborators who do not use GIS software.
8. View-Direction Filtering for Aggregation
Google Street View images are typically downloaded in multiple compass directions at each sampling point (e.g., front, back, left, right). In previous versions, all views were aggregated together when computing point- or street-level scores. SAGAI v2.0 introduces a view filter that allows users to select which directions to include in the aggregation — for example, scoring only left-side and right-side views to focus on building facades, or only front views to capture the pedestrian perspective along the street axis. This filter is applied at the aggregation stage and does not affect the scoring step itself.
9. Resume-Safe Batch Processing
The batch execution engine inherited from UVLM provides resume-safe processing with checkpoint saving every 3 images. If a Colab session is interrupted — due to a timeout, a runtime reset, or a connectivity issue — the notebook can be re-executed and will automatically skip already-processed images. New tasks added between runs trigger automatic CSV schema upgrading, so the output file grows incrementally without losing previous results.
10. References and Links
- SAGAI v2.0 on GitHub: https://github.com/perezjoan/SAGAI
- UVLM on GitHub: https://github.com/perezjoan/UVLM
- Perez, J. and Fusco, G. (2025). Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes. Geomatica, 77(2), 100063. https://www.sciencedirect.com/science/article/pii/S1195103625000199
- Perez, J. and Fusco, G. (2026). UVLM: A Universal Vision-Language Model Loader for Reproducible Multimodal Benchmarking. arXiv:2603.13893. https://arxiv.org/abs/2603.13893
Table of contents

Leave A Comment