Vision to Text

Vision to Text#

Screenshot

Download the code from github. Make sure to have the models mistral and llava installed with Ollama.

The Interface allows uploading and rearranging images, which will be analyzed by the LLaVA vision model.

The image description will be used as input for the 7B Mistral model to (optional: modify the description) generate a text based on the image description.

The set prompts are basic examples and can be modified. Of course it’s also possible to change the models.