Visual language models might soon use LLMs to improve prompt learning

Soon, VLMs might learn how to recognize and use the data from our prompts to generate better visuals

Reading time icon 2 min. read


Readers help support Windows Report. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help Windows Report sustain the editorial team Read more

Image of a visual language model infographic generated with Dall E 3 in Copilot

Ai can create visual content from our prompts. However, the result is not always accurate, mainly if we use free visual language models (VLMs). Moreover, when we try to use free VLMs for intricate details, they fail to produce high-quality results. Thus, there is a need for visual language models who can generate better-quality content. For example, we have Sora AI, which is excellent at creating visuals that a Chinese firm already wants to use.

How will the LLMs improve the visual language models?

According to a Microsoft Research Blog, researchers are trying to find a way to use large language models (LLMs) to generate structured graphs for the visual language models. So, to do this, they ask the AI questions, restructure the information, and generate structured graphs afterward. Furthermore, the process needs a bit of organization. After all, the graphs need to feature the entity, its attributes, and the relationship between them. 

To understand the process better, think about a specific animal. Then, ask the AI to provide descriptions based on questions related to the animal. Then, you will have more information about the animal you thought of. Afterward, ask the AI to restructure and categorize your information.

After getting the results, researchers implemented Hierarchical Prompt Tuning (HTP), a framework that organizes content. With it, the visual language models learn to discern different data, such as specific details, categories, and themes from a prompt. Furthermore, this method improves the capability of the VLMs to understand and process various queries.

When the last step is over, the visual language models will be able to generate more accurate images based on your prompts. Additionally, the next time you need to analyze an image, you could use the VLM to create descriptions for it in return.

In a nutshell, the main goal of the research is to use an LLM to teach a visual language model how to understand the details from a prompt to generate more accurate and realistic pictures. Meanwhile, the second goal is to teach the VLM to identify the elements from a picture and create descriptions.

If you want to learn more about the research, check their GitHub page.

What are your thoughts? Are you excited about this research? Let us know in the comments.

More about the topics: AI, artificial intelligence, microsoft