23. September 2025 By Musa Samet Koca
When AI explains how it sees: language models as visual explainers
An innovative research approach for better comprehensibility of AI systems
The explainability of artificial intelligence (AI) has been a central issue in the development of trustworthy systems for years. But while classic methods such as attribution methods only highlight certain pixels or features, which often require the involvement of experts to verify or interpret the results for non-technical users, a new research approach goes one step further. It allows large language models (LLMs) such as ChatGPT to become explainers for visual AI systems themselves – in natural language and with a comprehensible structure.
The concept is called Language Model as Visual Explainer (LVX). It combines vision models (image recognition) with language models to explain visual decisions in a structured way and without additional training.
The key point: why it is not enough to simply observe AI systems without understanding them
Deep learning models in image recognition are highly accurate, but often opaque.
A model classifies an image as a ‘dog,’ but why exactly? Did it recognise the ears? The body shape? Or was it perhaps just the background?
Traditional explanation methods, such as feature attribution, provide technical insights but are rarely understandable. Language-based explanations would be more intuitive, but usually require manual annotations, which is time-consuming and prone to errors.
LVX proposes an elegant middle ground: it combines the precision of visual models with the world knowledge and language understanding of LLMs to generate a natural language tree structure from the features of the vision models. This serves as a kind of decision logic for reference.
How LVX works under the hood
The LVX approach can be divided into three phases: construction, refinement and application.
Construction: From concept to explanation
First, an LLM such as GPT-4 is queried with typical attributes of an object class (‘What makes a dog?’). The answer: ‘A dog has four legs, a wet nose, bushy fur...’.
For each of these attributes, matching images are collected via a text-to-image API (e.g. Bing Image Search or Stable Diffusion). These images are then analysed by the vision model and the resulting embeddings serve as a prototype in a hierarchical decision tree.
Refinement: What the model really recognises
The tree is then adjusted using real training data. If the model regularly recognises a certain attribute (e.g. ‘long ears’), the node is expanded. If another attribute is never activated, it is removed. This creates an individual decision tree for each class, representing the internal structure of the model.
Application: explanations in use
When a new image enters the model, the feature vector navigates through the tree, as in the decision-making process. The path from the root (e.g., ‘dog’) to the leaf (e.g., ‘short, brown fur’) provides a precise, understandable explanation for the model's decision.
Why this is important: Explainability, performance and diagnosis
LVX offers several advantages:
- Plausibility: The decision trees generated are very consistent with human perception, as they describe visual features in a hierarchical structure that is intuitively understandable. This makes model decisions easier to understand.
- Accuracy: The explanations reflect the actual decision logic of the model because they are derived directly from the internal representations and are not merely constructed retrospectively. This prevents explanations from appearing ‘embellished’.
- Stability: Even with slightly altered inputs (e.g. noisy or partially cropped images), the explanations remain consistent. This means that LVX does not randomly provide different explanations, but robustly maintains the same decision grid.
What makes LVX particularly exciting is that the explanation can also be used for model calibration. It uses the tree as a pseudo label, thereby improving the model's selectivity. In experiments, accuracy increased measurably compared to classical methods.
Better understanding of error classifications
A model confuses a white shark with an orca? No problem. LVX shows that both creatures share ‘black dorsal fins’ and ‘white spots’. The difference was in the ‘missing tail fin feature’. Developers can use such clues to specifically retrain the vision model.
GenAI
From idea to implementation
GenAI unlocks potential: Whether it's increasing efficiency, hyper-personalisation, automation or knowledge transfer. Start your GenAI journey with adesso now – for intelligent solutions, precise processes and a sustainable competitive advantage.
Conclusion: Greater transparency thanks to AI for AI
LVX is more than just another tool for explanations. It represents a paradigm shift: the combination of language models and vision models allows neural networks to explain themselves in a form, structure and language that are also intuitive for humans.
For companies that rely on AI, this method offers new opportunities for:
- Transparent decisions in sensitive areas (e.g. autonomous systems)
- Optimisation and debugging in vision models
The study on the LVX method is an impressive step towards trustworthy and transparent AI and shows how far we can get when we not only train models, but also let them have their say.
We support you!
Would you like to not only use AI systems, but also understand them? We support you in developing explainable and trustworthy AI solutions – from selecting the right models to implementing innovative approaches such as Language Model as Visual Explainer.