Code Samples#
Preparations#
For using small gradio examples, we need to setup a conda
environment and install the required libraries.
# 1. Create a new conda environment with Python 3.10
conda create -n gradio-env python=3.10 -y
# 2. Activate the environment
conda activate gradio-env
# 3. Install Torch (CPU‐only) via pip
pip install torch torchvision torchaudio
# 4. Install the remaining libraries
pip install gradio transformers pillow
Captioning#
Image captioning is the task of automatically generating a natural-language description for a given image. It sits at the intersection of computer vision (to “see” and interpret the picture) and natural language processing (to “speak” or “write” about what is seen).
Create a new file called captioning.py
and copy the code below.
Then in a new terminal run the following two commands:
conda activate gradio-env
python captioning.py
import gradio as gr
from PIL import Image
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration
# 1. Load BLIP processor and model
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
model.eval()
# 2. Define a captioning function
def generate_caption(image: Image.Image) -> str:
"""
Takes a PIL Image from Gradio’s camera input and returns a text caption.
"""
# Preprocess the image
inputs = processor(images=image, return_tensors="pt")
# Generate caption (you can adjust num_beams, max_length, etc. as needed)
with torch.no_grad():
out = model.generate(
**inputs,
num_beams=5,
max_length=20
)
caption = processor.decode(out[0], skip_special_tokens=True)
return caption
# 3. Set up Gradio interface with a webcam (camera) input
camera_input = gr.Image(type="pil")
iface = gr.Interface(
fn=generate_caption,
inputs=camera_input,
outputs="text",
title="BLIP Image Captioning (Webcam)",
description="Point your webcam at something, click 'Submit', and BLIP will describe what it sees."
)
# 4. Launch the app
if __name__ == "__main__":
iface.launch()
Visual Question Answering#
Visual Question Answering (VQA) is a multimodal task that combines computer vision and natural language processing to answer questions about images. Given an image and a free‐form question (“What color is the cat?”; “Who is holding an umbrella?”; “How many people are in the kitchen?”), a VQA system must return a concise, correct answer in natural language (often a single word or short phrase).
Create a new file called vqa.py
and copy the code below.
Then in a new terminal run the following two commands:
conda activate gradio-env
python vqa.py
import gradio as gr
from PIL import Image
import torch
from transformers import BlipProcessor, BlipForQuestionAnswering
# 1. Load BLIP VQA processor and model
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")
model.eval()
# 2. Define a VQA‐style function
def answer_question(image: Image.Image, question: str) -> str:
"""
Takes a PIL Image and a text question,
returns BLIP’s predicted answer.
"""
# Preprocess both image and text
inputs = processor(images=image, text=question, return_tensors="pt")
# Generate answer (you can tune num_beams, max_length, etc.)
with torch.no_grad():
out = model.generate(
**inputs,
num_beams=5,
max_length=10
)
answer = processor.decode(out[0], skip_special_tokens=True)
return answer
# 3. Set up Gradio interface with a camera input + text box
camera_input = gr.Image(type="pil", label="Webcam Image")
text_input = gr.Textbox(lines=2, placeholder="Ask a question about the image…", label="Question")
iface = gr.Interface(
fn=answer_question,
inputs=[camera_input, text_input],
outputs="text",
title="BLIP VQA (Webcam)",
description="Point your webcam at something, type a question, click 'Submit', and BLIP will answer."
)
# 4. Launch the app
if __name__ == "__main__":
iface.launch()
Large Multimodal Models#
Larger multimodal models combine the generative qualities of large language models and vision models.
We use LLaVa together with ollama
.
First install ollama
. See section Ollama.
Create a new file called llava.py
and copy the code below.
Then in a new terminal run the following two commands:
ollama pull llava:latest
conda activate gradio-env
pip install ollama
python llava.py
import gradio as gr
from PIL import Image
import tempfile
import ollama
def llava_via_ollama_python(image: Image.Image, question: str) -> str:
# 1. Save the incoming PIL Image to a temporary PNG on disk
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
image.save(tmp.name)
image_path = tmp.name
# 2. Call ollama.chat with model="llava",
# putting the temp‐file path in the "images" list.
#
# According to the Ollama Python docs, you can pass a path-like string
# in the "images" field, and the model will load that file as input. :contentReference[oaicite:0]{index=0}
response = ollama.chat(
model="llava",
messages=[{
"role": "user",
"content": question,
"images": [image_path]
}]
)
# 3. Extract LLaVA’s answer from the response object
# The .message.content field contains the text output. :contentReference[oaicite:1]{index=1}
return response.message.content.strip()
# 4. Build a Gradio interface: one Image input + one Textbox, output plain text.
iface = gr.Interface(
fn=llava_via_ollama_python,
inputs=[
gr.Image(type="pil", label="Image"),
gr.Textbox(lines=2, placeholder="Ask a question about the image…", label="Question")
],
outputs="text",
title="LLaVA via Ollama (Python SDK)",
description=(
"Upload or capture an image and type a question. "
"Under the hood, this saves the image to a temp file and calls:\n\n"
"`ollama.chat(model='llava', messages=[{ 'role':'user', 'content':<question>, 'images':[<temp_path>] }])`"
)
)
if __name__ == "__main__":
iface.launch()