Highlights
  • Module 3 v2.0 is the refactored inference engine of SAGAI v1.1, designed for stable and reproducible vision–language analysis of streetscape images
  • The new architecture relies exclusively on Hugging Face–native LLaVA models and APIs, removing dependencies on research codebases.
  • Multimodal prompting, image–text alignment, and inference are handled through standardized Transformers workflows, ensuring long-term compatibility.

Introduction

Module 3 is the inference core of the SAGAI (Streetscape Analysis with Generative Artificial Intelligence) framework. Its role is to transform large collections of street-level images into structured, quantitative outputs using vision–language models (VLMs), enabling systematic streetscape analysis and subsequent geospatial aggregation.

With SAGAI v1.1, Module 3 has been released in a new major version (Module 3 v2.0) that introduces a fully standardized and maintenance-safe inference architecture. This update reflects both the maturation of multimodal model ecosystems and the need for long-term reproducibility in large-scale urban analysis pipelines.

Earlier iterations of Module 3 were developed during a period of rapid evolution in both LLaVA research codebases and execution environments such as Google Colab. As multimodal models transitioned toward Transformers-native implementations distributed via Hugging Face, assumptions embedded in earlier hybrid workflows became increasingly difficult to sustain.

Module 3 v2.0 addresses this evolution by aligning the entire inference pipeline with official Hugging Face multimodal APIs. Model loading, prompt formatting, image–text fusion, and generation are now handled through maintained and versioned components, ensuring compatibility across environments, models, and future updates.

This document details the architectural context motivating the update, the design choices behind the refactored inference engine, and the rationale for releasing Module 3 v2.0 as a long-term, stable component of SAGAI v1.1.

1. Architectural Context of Module 3 in the Previous version: SAGAI v1.0

The initial implementation of Module 3 (SAGAI v1.0) relied on a hybrid architecture that mixed two incompatible sources of LLaVA code, combined with a rapidly evolving execution environment in Google Colab. This design choice made the pipeline fragile and ultimately unsustainable.

First, the pipeline simultaneously depended on the LLaVA GitHub repository (haotian-liu/LLaVA) and on Hugging Face–hosted model checkpoints. The GitHub repository is a research-oriented codebase under active development. Its internal APIs, class structures, and utilities evolve rapidly and are not version-locked. Constructors, module paths, and helper functions may change or disappear without notice, and the repository is not designed to maintain backward compatibility across releases.

At the same time, pretrained model weights were downloaded from Hugging Face. These checkpoints follow the Transformers-native multimodal format, using Hugging Face–specific configuration files, processors, and model classes (e.g., LlavaNextForConditionalGeneration, AutoProcessor, and chat templates). This architecture is fundamentally different from the internal design assumed by the GitHub LLaVA code, which relies on custom token insertion, internal vision tower management, and non-Transformers abstractions.

As a result, the pipeline operated in a structural mismatch: GitHub code expected architectural fields, model attributes, and tokenizer behavior that were not present in Hugging Face checkpoints, while Hugging Face checkpoints expected model wrappers and configuration logic that the GitHub code did not provide.

This fragility was exposed when Google Colab upgraded its backend environment in early 2025. Major changes included Python 3.12, NumPy ≥ 2.0 (introducing ABI-breaking changes for compiled extensions), newer PyTorch releases (≥ 2.2), and updated system libraries. These updates caused widespread failures in binary dependencies and research codebases that were not aligned with the new runtime.

In practice, this led to errors such as NumPy ABI incompatibilities, PyTorch extension failures, missing or renamed modules, and import errors in LLaVA GitHub utilities. Because the pipeline depended on both unstable research code and binary-sensitive extensions, even minor environment updates were sufficient to break execution.

2. Refactoring of the Inference Engine in SAGAI v1.1

Module 3 has been fully refactored to remove any dependency on the original LLaVA GitHub repository. The inference pipeline now relies exclusively on Hugging Face–native LLaVA models and APIs, ensuring long-term stability and compatibility with evolving software environments.

In the previous architecture, the script depended on cloning the LLaVA GitHub repository, installing it in editable mode, and importing internal modules (llava.*). Prompts were manually assembled using LLaVA-specific multimodal tokens (e.g., <im_start>, <image>), custom separators, and internal utilities. Image tokens and embeddings were explicitly inserted into the prompt, tightly coupling the forward pass to a specific implementation of the LLaVA codebase. As a result, updates to Google Colab, PyTorch, NumPy, or the LLaVA repository frequently introduced breaking changes.

The current implementation removes all such dependencies. Prompt formatting and multimodal input construction are now handled entirely through Hugging Face abstractions. Prompts are formatted using processor.apply_chat_template(), while images and text are combined using processor(images=…, text=…). Image embedding alignment, multimodal token placement, and chat formatting are fully managed by the Hugging Face processor and model configuration. Inference is performed using the standard model.generate() API, without any custom token handling or internal utilities.

This refactoring makes the SAGAI inference engine model-agnostic within the Hugging Face LLaVA ecosystem. The same forward pass is compatible with LLaVA-NeXT (v1.6), LLaVA-Interleave, LLaVA-OneVision, and future Hugging Face LLaVA releases that expose a processor and chat template. Switching between models or architectures requires only changing the model_id, with no modification to prompt logic or inference code.

To ensure reliable downstream analysis, Module 3 also includes a dedicated numeric output stabilization step. After decoding the model response, any prompt echoes or metadata—including residual [INST] … [/INST] segments—are removed. The final output is parsed using a simple regular expression to retain only numeric values (e.g., 0, 1, 2, 1.5). This guarantees clean, machine-readable outputs and a stable CSV format across all supported models.

Model loading has been simplified and standardized using Hugging Face–approved APIs. Both the processor and the model are instantiated directly from Hugging Face model cards via from_pretrained, with optional 4-bit quantization enabled through load_in_4bit=True. This eliminates the need for manual vision-tower initialization, deprecated classes, or custom C++ operators, and avoids common incompatibilities related to PyTorch, CUDA, or NumPy upgrades in Google Colab. Official Hugging Face code paths ensure that pretrained weights are always matched with the correct implementation.

Optional authentication using a Hugging Face access token is supported to avoid rate limits and improve download reliability when working with large checkpoints, though public models remain accessible without authentication.

Overall, this refactoring significantly improves robustness, reproducibility, and maintainability, while enabling systematic experimentation across multiple LLaVA variants and quantization settings within a unified inference framework.

3. Rationale for a Long-Term, Stable Release

The refactored inference system in Module 3 is designed as a long-term, maintenance-safe release. This is achieved by aligning the entire pipeline with Hugging Face’s officially supported multimodal APIs and model distribution mechanisms.

First, the new architecture is robust to Google Colab environment updates. All critical dependencies—Python (≥3.12), NumPy (≥2.0), PyTorch (2.x), CUDA wheels, and BitsAndBytes quantization—are now managed through Hugging Face Transformers and its dependency resolution. Because the model code, processor logic, and quantization pathways are maintained upstream, updates to Colab or its underlying libraries no longer break the inference pipeline. As long as Hugging Face continues to support the model card, the code remains functional without manual intervention.

Second, the system relies exclusively on official Hugging Face–maintained components. Core classes such as LlavaNextForConditionalGeneration, LlavaNextProcessor, chat templates, and multimodal preprocessing logic are all part of the Transformers library. These components are actively maintained, versioned, and tested by Hugging Face, providing a level of stability and backward compatibility that is not guaranteed when relying on research repositories or development branches.

Third, the new setup significantly improves reproducibility. Each run explicitly references a fixed Hugging Face model checkpoint via the model_id, ensuring that the same weights, architecture, and prompt template are used across sessions and machines. In addition, generation parameters (sampling strategy, temperature, nucleus sampling, and output length) are explicitly defined, enabling consistent and repeatable results across runs.

Fourth, the architecture is easy to extend and experiment with. Switching between different LLaVA variants now requires changing a single configuration line (model_id). The same inference code supports LLaVA 1.5 models, LLaVA-NeXT (v1.6), Interleave models, OneVision models, and larger checkpoints (e.g., 13B or 34B), including variants based on Mistral, Vicuna, Qwen, or Yi backbones. No changes to prompt construction or forward-pass logic are required.

Finally, the multimodal pipeline is now cleanly abstracted and internally consistent. Hugging Face handles all low-level details, including image preprocessing, chat formatting, positional embeddings, image sequence length management, and attention masking. This eliminates a large class of subtle bugs related to tensor alignment and multimodal token placement, while ensuring that the vision and language components remain synchronized across model updates.

4. References and links

Table of contents