Code Samples#

Preparations#

For using small gradio examples, we need to setup a conda environment and install the required libraries.

# 1. Create a new conda environment with Python 3.10
conda create -n gradio-env python=3.10 -y

# 2. Activate the environment
conda activate gradio-env

# 3. Install Torch (CPU‐only) via pip
pip install torch torchvision torchaudio

# 4. Install the remaining libraries
pip install gradio transformers pillow

Captioning#

Image captioning is the task of automatically generating a natural-language description for a given image. It sits at the intersection of computer vision (to “see” and interpret the picture) and natural language processing (to “speak” or “write” about what is seen).

Create a new file called captioning.py and copy the code below.

Then in a new terminal run the following two commands:

conda activate gradio-env

python captioning.py

import gradio as gr
from PIL import Image
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration

# 1. Load BLIP processor and model
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
model.eval()

# 2. Define a captioning function
def generate_caption(image: Image.Image) -> str:
    """
    Takes a PIL Image from Gradio’s camera input and returns a text caption.
    """
    # Preprocess the image
    inputs = processor(images=image, return_tensors="pt")

    # Generate caption (you can adjust num_beams, max_length, etc. as needed)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            num_beams=5,
            max_length=20
        )
    caption = processor.decode(out[0], skip_special_tokens=True)
    return caption

# 3. Set up Gradio interface with a webcam (camera) input
camera_input = gr.Image(type="pil")

iface = gr.Interface(
    fn=generate_caption,
    inputs=camera_input,
    outputs="text",
    title="BLIP Image Captioning (Webcam)",
    description="Point your webcam at something, click 'Submit', and BLIP will describe what it sees."
)

# 4. Launch the app
if __name__ == "__main__":
    iface.launch()

Visual Question Answering#

Visual Question Answering (VQA) is a multimodal task that combines computer vision and natural language processing to answer questions about images. Given an image and a free‐form question (“What color is the cat?”; “Who is holding an umbrella?”; “How many people are in the kitchen?”), a VQA system must return a concise, correct answer in natural language (often a single word or short phrase).

Create a new file called vqa.py and copy the code below.

Then in a new terminal run the following two commands:

conda activate gradio-env

python vqa.py

import gradio as gr
from PIL import Image
import torch
from transformers import BlipProcessor, BlipForQuestionAnswering

# 1. Load BLIP VQA processor and model
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
model.eval()

# 2. Define a VQA‐style function
def answer_question(image: Image.Image, question: str) -> str:
    """
    Takes a PIL Image and a text question, 
    returns BLIP’s predicted answer.
    """
    # Preprocess both image and text
    inputs = processor(images=image, text=question, return_tensors="pt")

    # Generate answer (you can tune num_beams, max_length, etc.)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            num_beams=5,
            max_length=10
        )
    answer = processor.decode(out[0], skip_special_tokens=True)
    return answer

# 3. Set up Gradio interface with a camera input + text box
camera_input = gr.Image(type="pil", label="Webcam Image")
text_input = gr.Textbox(lines=2, placeholder="Ask a question about the image…", label="Question")

iface = gr.Interface(
    fn=answer_question,
    inputs=[camera_input, text_input],
    outputs="text",
    title="BLIP VQA (Webcam)",
    description="Point your webcam at something, type a question, click 'Submit', and BLIP will answer."
)

# 4. Launch the app
if __name__ == "__main__":
    iface.launch()

Large Multimodal Models#

Larger multimodal models combine the generative qualities of large language models and vision models.

We use LLaVa together with ollama.

First install ollama. See section Ollama.

Create a new file called llava.py and copy the code below.

Then in a new terminal run the following two commands:

ollama pull llava:latest

conda activate gradio-env

pip install ollama

python llava.py

import gradio as gr
from PIL import Image
import tempfile
import ollama

def llava_via_ollama_python(image: Image.Image, question: str) -> str:
    # 1. Save the incoming PIL Image to a temporary PNG on disk
    with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
        image.save(tmp.name)
        image_path = tmp.name

    # 2. Call ollama.chat with model="llava", 
    #    putting the temp‐file path in the "images" list.
    #
    #    According to the Ollama Python docs, you can pass a path-like string
    #    in the "images" field, and the model will load that file as input. :contentReference[oaicite:0]{index=0}
    response = ollama.chat(
        model="llava",
        messages=[{
            "role": "user",
            "content": question,
            "images": [image_path]
        }]
    )

    # 3. Extract LLaVA’s answer from the response object
    #    The .message.content field contains the text output. :contentReference[oaicite:1]{index=1}
    return response.message.content.strip()

# 4. Build a Gradio interface: one Image input + one Textbox, output plain text.
iface = gr.Interface(
    fn=llava_via_ollama_python,
    inputs=[
        gr.Image(type="pil", label="Image"),
        gr.Textbox(lines=2, placeholder="Ask a question about the image…", label="Question")
    ],
    outputs="text",
    title="LLaVA via Ollama (Python SDK)",
    description=(
        "Upload or capture an image and type a question. "
        "Under the hood, this saves the image to a temp file and calls:\n\n"
        "`ollama.chat(model='llava', messages=[{ 'role':'user', 'content':<question>, 'images':[<temp_path>] }])`"
    )
)

if __name__ == "__main__":
    iface.launch()