10. September 2024 By Oleg Smolanko
GenAI and its application
Generative Artificial Intelligence (GenAI) has revolutionised the technology experience. This became clear with the release of the Transformer and Attention architectures in ‘Attention ist All You Need’, which marked a decisive turning point. At its core, GenAI enables machine-assisted creativity by enabling systems to generate content on their own, be it in the form of text-to-text, image-to-text/text-to-image or video-to-text/text-to-video. But what exactly is behind this exciting field of AI? As already mentioned, GenAI is concerned with the development of models that are able to generate new data and content. You can find a detailed definition as well as insights into and distinctions between AI, deep learning and machine learning in the article ‘Generative AI what?’.
The importance of GenAI
But why is GenAI so important? The answer lies in the ability of these systems to generate new ideas, develop solutions and drive innovative approaches in various industries. An even more important reason for the great popularity of GenAI is the multimodal capability of these models, that is, a language model that could previously answer our questions can now respond not only to textual data but also to visual data, for example, for an input image this model can perform object recognition, segmentation etc. or for video input the recognition of concepts from visual signals with only general knowledge.
There are already applications of GenAI in many areas, be it in medicine, where models help to make complex diagnoses, or in speech and image processing. But how far can these models go and which use cases can they be used for? This blog post addresses this question in detail and provides an overview of the use cases of GenAI, in particular the handling of visual data and the possible applications in computer vision.
Essentially, GenAI enables the generation of new content based on data analyses. In contrast to conventional AI approaches, which are based on existing data, GenAI enables the generation of new, original content. These models learn patterns and structures from existing data and can independently generate similar but novel content.
One of the most important areas of GenAI is machine vision. A comprehensive definition of computer vision and the categorisation of this discipline in the group of deep learning can be found in this blog post on Computer vision for deep learning - a brief introduction.
These models use multimodal approaches to process both visual and linguistic information. As a result, they can not only recognise objects in an image, for example, but also describe the context and relationship between these objects. This makes it possible to analyse visual content more precisely.
Computer vision and GenAI: an overview
The multimodal capability of these models is being implemented particularly effectively in computer vision to enhance or even completely take over their application possibilities. These models are also known as ‘Visual LLMs’ and are divided into three different categories based on their input data. There are models that:
- 1. work exclusively with static images
- 2. can process both static images and videos without sound
- 3. and can also work with audio data.
An exemplary model in this context is the Video-LLaMa model, a multimodal language model that can understand both the visual and audio content of a video. A practical demonstration of this model with various visual comprehension questions is shown in the attached image.
Depending on the input format - video or image - the Video-LLaMa, which can handle both formats, offers a range of possibilities - from the interpretation of visual data to the capture of temporal dynamics in videos.
In addition to the Video-LLaMa, models that work exclusively with static images include variants such as ChatGPT, in particular Mini-GPT4, the LLaVA model series - including LLaVA-Plus, LLaVA-Med or LLaVA including BLIP-2. As can be seen in the image, these models can handle various machine vision tasks using VideoChat as an example. The model is able to perform tasks such as object recognition or object identification and generate different content based on input images, including recipes, memes, advertising texts or literary texts. In addition, the range of recognised objects for object recognition and object identification is constantly being expanded. As already mentioned in the introduction, GenAI is also used in medicine. One example in this category is the LLaVA-Med model, which is based on visual biomedical data and can perform question answering in the form of conversations or detailed descriptions.
However, there are also models that can process both static images and video data without sound, such as Video-ChatGPT and VideoChat Some examples of these models are shown in the figure below. In addition, these models support answering questions based on image and video input.
As the examples show, these models are versatile and can handle a wide range of tasks. From video understanding and conversational tasks to interpreting memes or recipes, they demonstrate their impressive capabilities with only visual input. Some models are even capable of converting a handwritten draft into a complete website.
GenAI and its future: where is it heading?
But we are only at the beginning of this era. The question of how far these models can still go lies in the future. Their potential so far leaves room for speculation as to what innovative heights they can still reach. On the other hand, if we take a look at the rapid evolution of these models in computer vision, it becomes clear that progress is being made at an astonishing pace. From the traditional computer vision models such as AlexNet or ResNet architectures to Vision-Transformer (ViT) and today's Visual LLMs, the evolution has been remarkably fast. Vision Transformer is an innovative architecture for machine vision based on the Transformer modelling approach. In contrast to CNNs, Vision Transformers do not use fixed hierarchies of feature extraction blocks. Instead, they treat the image as a sequence of patches, allowing the vision transformer model to efficiently capture both global and local information. The Transformer model itself has a broader application and was introduced in 2017 by Vaswani et al. It has proven to be groundbreaking for processing sequences in various applications such as machine translation, text generation and natural language processing tasks. Compared to previous architectures, the Transformer does not use the ‘recurrent’ operator in Recurrent Neural Networks(RNNs) or the ‘convolution’ operator in Convolutional Neural Network (CNNs). Instead, the transformer uses the attention mechanism, which allows the model to access all parts of the input sequence simultaneously. This significantly improves efficiency and parallelisation compared to RNNs. Transformers can be used to process both sequential and non-sequential data, such as images in vision transformer architectures.
In view of this development of visual LLMs, the question arises: Do we still need these traditional models in the context of computer vision? The answer cannot be given in general terms. The ‘state of the art’ models were in fact ViT-based models that performed some machine vision tasks without the need for specific training of CNNs. An example of this is the ViLT model, which, as shown in the paper, can perform object recognition, among other tasks.
On the other hand, it is important to take a closer look at the architectures of all these visual LLMs. ViT-based components are an integral part of these models. These traditional models still play an important role in the ongoing development of visual LLMs. Experiments with architectures have so far only been carried out to a limited extent.
However, a significant and meaningful change to the architecture could help to substantially increase the performance of these visual LLMs. The ability to not only understand visual data, but also to place it in a linguistic context, has pushed the boundaries of what is possible. The momentum of these advances suggests that we can expect many more amazing developments in the future.
Conclusion: GenAI as a trailblazer for the future of AI
GenAI and multimodal models have shown an impressive versatility that manifests itself in various areas such as machine vision or speech processing. These models offer creative solutions and opportunities for innovation. Future development remains open and there is speculation as to what heights these models can still reach in their innovative capacity. The rapid development from traditional computer vision models to visual LLMs means that further progress can be expected. GenAI is not just about technology, but a multifaceted journey through creativity and innovation.
Would you like to find out more about exciting topics from the world of adesso? Then take a look at our previous blog posts.
GenAI
From the idea to implementation
GenAI will change our business lives just as much as the Internet or mobile business. Today, companies of all sizes and from all sectors are laying the foundations for the effective use of this technology in their business.
A key challenge: integrating GenAI applications into their own processes and existing IT landscape. You can find out how to do this and how we can support you on our website.