Gemini API Multimodal Developer Guide: Image, Video, and Document Analysis with Code Examples
Why Gemini’s Multimodal API Changes Application Development
Most LLM APIs started as text-in, text-out systems. Image understanding was bolted on later. Gemini was built multimodal from the ground up — it processes text, images, video, audio, and documents natively within the same model. This means you do not need separate OCR services for documents, computer vision models for images, or speech-to-text for audio. One API call to Gemini handles all of it.
For developers, this collapses the infrastructure stack. A receipt scanning application that previously needed Tesseract OCR + a classification model + a text extraction pipeline becomes a single Gemini API call with a photo of the receipt. A video summarization tool that needed frame extraction + image captioning + text summarization becomes one API call with the video file.
The Gemini 2.5 Pro model supports up to 1 million tokens of context, which means you can process hundreds of images, full-length videos, or massive document sets in a single request. This guide covers practical patterns for building multimodal applications.
Getting Started
SDK Installation
# Python pip install google-generativeai # Node.js npm install @google/generative-ai
Basic Setup
import google.generativeai as genai
genai.configure(api_key="your-gemini-api-key")
model = genai.GenerativeModel("gemini-2.5-pro")
Your First Multimodal Call
import PIL.Image
image = PIL.Image.open("product_photo.jpg")
response = model.generate_content([
"Describe this product photo. What is the product, what material "
"is it made of, what color options can you identify, and estimate "
"the price range based on the apparent quality.",
image
])
print(response.text)
Image Analysis Patterns
Product Image Analysis
def analyze_product_image(image_path):
image = PIL.Image.open(image_path)
response = model.generate_content([
"""Analyze this product image and return a JSON object with:
{
"product_type": "category of product",
"material": "primary material",
"colors": ["list of colors visible"],
"condition": "new/used/damaged",
"brand_visible": "brand name if visible, null if not",
"estimated_price_range": "$X - $Y",
"quality_indicators": ["list of quality signals"],
"suggested_tags": ["e-commerce search tags"]
}""",
image
])
return json.loads(response.text)
Multi-Image Comparison
def compare_images(image_paths, comparison_criteria):
images = [PIL.Image.open(p) for p in image_paths]
prompt = f"""Compare these {len(images)} images based on:
{comparison_criteria}
For each image, rate each criterion on a 1-10 scale.
Return as a comparison table in markdown format.
Include an overall recommendation."""
response = model.generate_content([prompt] + images)
return response.text
# Usage
result = compare_images(
["design_a.png", "design_b.png", "design_c.png"],
"Visual appeal, readability, brand consistency, mobile-friendliness"
)
Screenshot Analysis for QA
def qa_screenshot(screenshot_path, expected_behavior):
image = PIL.Image.open(screenshot_path)
response = model.generate_content([
f"""You are a QA engineer reviewing this screenshot.
Expected behavior: {expected_behavior}
Analyze the screenshot and report:
1. Does it match the expected behavior? (PASS/FAIL)
2. Visual issues: misalignment, overlapping elements, cut-off text
3. Content issues: typos, placeholder text, incorrect data
4. Accessibility: contrast issues, missing labels, small text
5. Responsive design: does the layout look correct for this viewport?
Return as a structured QA report.""",
image
])
return response.text
Document Processing
Receipt and Invoice Extraction
def extract_receipt(image_path):
image = PIL.Image.open(image_path)
response = model.generate_content([
"""Extract all information from this receipt/invoice and return
as JSON:
{
"vendor_name": "",
"vendor_address": "",
"date": "YYYY-MM-DD",
"items": [
{"name": "", "quantity": 1, "unit_price": 0.00, "total": 0.00}
],
"subtotal": 0.00,
"tax": 0.00,
"total": 0.00,
"payment_method": "",
"currency": ""
}
If any field is not visible or unclear, set it to null.
For items, extract every line item visible.""",
image
])
return json.loads(response.text)
PDF Document Analysis
def analyze_pdf(pdf_path, questions):
# Upload the file
uploaded_file = genai.upload_file(pdf_path)
prompt = f"""Analyze this PDF document and answer the following questions:
{chr(10).join(f'{i+1}. {q}' for i, q in enumerate(questions))}
For each answer:
- Cite the specific page number and section
- Quote the relevant text
- If the document does not contain the answer, say so explicitly"""
response = model.generate_content([prompt, uploaded_file])
return response.text
# Usage
answers = analyze_pdf(
"contract.pdf",
[
"What is the contract start and end date?",
"What are the payment terms?",
"Are there any automatic renewal clauses?",
"What are the termination conditions?"
]
)
Form Data Extraction
def extract_form(form_image_path):
image = PIL.Image.open(form_image_path)
response = model.generate_content([
"""This is a filled-out form. Extract all field labels and
their corresponding values. Return as a JSON object where
keys are field labels and values are the filled-in data.
For checkboxes, return true/false.
For handwritten text, do your best to read it and flag
any uncertain readings with [uncertain: best guess].
Also identify:
- Any required fields left blank
- Any fields with potentially invalid data
- Signatures (note as "signature present" or "signature missing")""",
image
])
return json.loads(response.text)
Video Analysis
Video Summarization
def summarize_video(video_path, focus=None):
uploaded_video = genai.upload_file(video_path)
# Wait for processing
import time
while uploaded_video.state.name == "PROCESSING":
time.sleep(5)
uploaded_video = genai.get_file(uploaded_video.name)
prompt = "Summarize this video. Include: main topics covered, "
prompt += "key points made, any data or statistics mentioned, "
prompt += "and the overall tone/purpose of the video."
if focus:
prompt += f" Focus especially on: {focus}"
response = model.generate_content([prompt, uploaded_video])
return response.text
Video Content Moderation
def moderate_video(video_path):
uploaded_video = genai.upload_file(video_path)
# Wait for processing
import time
while uploaded_video.state.name == "PROCESSING":
time.sleep(5)
uploaded_video = genai.get_file(uploaded_video.name)
response = model.generate_content([
"""Review this video for content moderation. Check for:
1. Explicit or violent content
2. Hate speech or discriminatory language
3. Dangerous activities or misinformation
4. Copyright-protected music or content
5. Personal information exposure (faces, addresses, phone numbers)
Return a moderation report with:
- Overall safety rating: SAFE / REVIEW_NEEDED / BLOCKED
- Timestamps of any flagged content
- Category of each flag
- Confidence level (HIGH/MEDIUM/LOW)
- Recommended action""",
uploaded_video
])
return response.text
Mixed-Media Pipelines
Multi-Document Comparison
def compare_documents(doc_paths, comparison_focus):
files = [genai.upload_file(p) for p in doc_paths]
# Wait for all to process
import time
for f in files:
while f.state.name == "PROCESSING":
time.sleep(2)
f = genai.get_file(f.name)
prompt = f"""Compare these {len(files)} documents.
Focus on: {comparison_focus}
For each document, extract the key positions and data points
related to the focus area. Then create a comparison matrix
showing where the documents agree and disagree.
Highlight any contradictions between documents."""
response = model.generate_content([prompt] + files)
return response.text
Image + Text Product Listing Generator
def generate_listing(product_images, product_info):
images = [PIL.Image.open(p) for p in product_images]
prompt = f"""Based on these product photos and the following info:
{product_info}
Generate a complete e-commerce listing:
1. Product title (SEO-optimized, under 80 characters)
2. Bullet points (5 key features from the images)
3. Product description (200 words, highlight visible features)
4. Suggested categories and tags
5. Suggested price range based on apparent quality
Base your description on what you can actually see in the images.
Do not make claims about features that are not visible."""
response = model.generate_content([prompt] + images)
return response.text
Production Optimization
Structured Output with JSON Mode
response = model.generate_content(
[prompt, image],
generation_config=genai.GenerationConfig(
response_mime_type="application/json",
response_schema={
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "number"},
"category": {"type": "string"}
}
}
)
)
Batch Processing
async def batch_analyze_images(image_paths, prompt, max_concurrent=5):
import asyncio
semaphore = asyncio.Semaphore(max_concurrent)
results = []
async def process_one(path):
async with semaphore:
image = PIL.Image.open(path)
response = await model.generate_content_async([prompt, image])
return {"path": path, "result": response.text}
tasks = [process_one(p) for p in image_paths]
results = await asyncio.gather(*tasks)
return results
Cost Management
| Input Type | Token Cost | Optimization |
|---|---|---|
| Text | ~1 token per 4 chars | Keep prompts concise |
| Image | ~258 tokens per image | Resize to needed resolution |
| Video | ~258 tokens per frame | Reduce FPS, trim unnecessary segments |
| ~258 tokens per page | Extract only relevant pages |
Frequently Asked Questions
What image formats does Gemini accept?
PNG, JPEG, GIF, WebP, and HEIC. For best results, use JPEG or PNG. Maximum image size is 20 MB per image.
Can Gemini process video in real-time?
Not in real-time. Video must be uploaded and processed before analysis. Processing time depends on video length — typically 1-3 minutes for a 5-minute video.
How accurate is document extraction compared to dedicated OCR?
For clearly printed text, Gemini’s accuracy is comparable to Tesseract and Google Cloud Vision. For handwritten text, it is often more accurate because it uses contextual understanding, not just character recognition.
Can I process multiple images in one API call?
Yes. You can include multiple images in a single request. The 1M token context window can handle hundreds of images. Each image costs approximately 258 tokens.
Is the Gemini API available in all regions?
Check Google’s current API availability. Some features and models may have geographic restrictions. The API is broadly available in most markets.
How does Gemini multimodal compare to GPT-4o Vision?
Both are strong multimodal models. Gemini’s advantages: larger context window (1M vs 128K tokens), native video support, and competitive pricing. GPT-4o’s advantages: slightly better at complex reasoning about images, larger developer ecosystem.