When AI explains how it sees: language models as visual explainers

An innovative research approach for better comprehensibility of AI systems

The explainability of artificial intelligence (AI) has been a central issue in the development of trustworthy systems for years. But while classic methods such as attribution methods only highlight certain pixels or features, which often require the involvement of experts to verify or interpret the results for non-technical users, a new research approach goes one step further. It allows large language models (LLMs) such as ChatGPT to become explainers for visual AI systems themselves – in natural language and with a comprehensible structure.

The concept is called Language Model as Visual Explainer (LVX). It combines vision models (image recognition) with language models to explain visual decisions in a structured way and without additional training.

The key point: why it is not enough to simply observe AI systems without understanding them

Deep learning models in image recognition are highly accurate, but often opaque.

A model classifies an image as a ‘dog,’ but why exactly? Did it recognise the ears? The body shape? Or was it perhaps just the background?

Traditional explanation methods, such as feature attribution, provide technical insights but are rarely understandable. Language-based explanations would be more intuitive, but usually require manual annotations, which is time-consuming and prone to errors.

LVX proposes an elegant middle ground: it combines the precision of visual models with the world knowledge and language understanding of LLMs to generate a natural language tree structure from the features of the vision models. This serves as a kind of decision logic for reference.

How LVX works under the hood

The LVX approach can be divided into three phases: construction, refinement and application.

Construction: From concept to explanation

First, an LLM such as GPT-4 is queried with typical attributes of an object class (‘What makes a dog?’). The answer: ‘A dog has four legs, a wet nose, bushy fur...’.

For each of these attributes, matching images are collected via a text-to-image API (e.g. Bing Image Search or Stable Diffusion). These images are then analysed by the vision model and the resulting embeddings serve as a prototype in a hierarchical decision tree.

Refinement: What the model really recognises

The tree is then adjusted using real training data. If the model regularly recognises a certain attribute (e.g. ‘long ears’), the node is expanded. If another attribute is never activated, it is removed. This creates an individual decision tree for each class, representing the internal structure of the model.

Application: explanations in use

When a new image enters the model, the feature vector navigates through the tree, as in the decision-making process. The path from the root (e.g., ‘dog’) to the leaf (e.g., ‘short, brown fur’) provides a precise, understandable explanation for the model's decision.

Why this is important: Explainability, performance and diagnosis

LVX offers several advantages:

Plausibility: The decision trees generated are very consistent with human perception, as they describe visual features in a hierarchical structure that is intuitively understandable. This makes model decisions easier to understand.
Accuracy: The explanations reflect the actual decision logic of the model because they are derived directly from the internal representations and are not merely constructed retrospectively. This prevents explanations from appearing ‘embellished’.
Stability: Even with slightly altered inputs (e.g. noisy or partially cropped images), the explanations remain consistent. This means that LVX does not randomly provide different explanations, but robustly maintains the same decision grid.

What makes LVX particularly exciting is that the explanation can also be used for model calibration. It uses the tree as a pseudo label, thereby improving the model's selectivity. In experiments, accuracy increased measurably compared to classical methods.

Better understanding of error classifications

A model confuses a white shark with an orca? No problem. LVX shows that both creatures share ‘black dorsal fins’ and ‘white spots’. The difference was in the ‘missing tail fin feature’. Developers can use such clues to specifically retrain the vision model.

GenAI

From idea to implementation

GenAI unlocks potential: Whether it's increasing efficiency, hyper-personalisation, automation or knowledge transfer. Start your GenAI journey with adesso now – for intelligent solutions, precise processes and a sustainable competitive advantage.

Learn more

Conclusion: Greater transparency thanks to AI for AI

LVX is more than just another tool for explanations. It represents a paradigm shift: the combination of language models and vision models allows neural networks to explain themselves in a form, structure and language that are also intuitive for humans.

For companies that rely on AI, this method offers new opportunities for:

Transparent decisions in sensitive areas (e.g. autonomous systems)
Optimisation and debugging in vision models

The study on the LVX method is an impressive step towards trustworthy and transparent AI and shows how far we can get when we not only train models, but also let them have their say.

We support you!

Would you like to not only use AI systems, but also understand them? We support you in developing explainable and trustworthy AI solutions – from selecting the right models to implementing innovative approaches such as Language Model as Visual Explainer.

Author Musa Samet Koca

Musa Samet Koca is a working student in the Consulting department at adesso and is studying for a master's degree in Applied Computer Science. Through his daily work on various adesso projects, he has already gained extensive experience in the role of consultant. Among other things, he has taken on responsibility as a product owner, Scrum Master and PMO. In addition, he is jointly responsible for the students@adesso community portal, actively contributing to networking and supporting students at adesso.

Category:	AI
Tags:	GenAI Artificial Intelligence (AI)

Our blog posts at a glance

Our tech blog invites you to dive deep into the exciting dimensions of technology. Here we offer you insights not only into our vision and expertise, but also into the latest trends, developments and ideas shaping the tech world.

Our blog is your platform for inspiring stories, informative articles and practical insights. Whether you are a tech lover, an entrepreneur looking for innovative solutions or just curious - we have something for everyone.

To the blog posts