Gemini API Multimodal Developer Guide: Image, Video, and Document Analysis with Code Examples

Why Gemini’s Multimodal API Changes Application Development

Most LLM APIs started as text-in, text-out systems. Image understanding was bolted on later. Gemini was built multimodal from the ground up — it processes text, images, video, audio, and documents natively within the same model. This means you do not need separate OCR services for documents, computer vision models for images, or speech-to-text for audio. One API call to Gemini handles all of it.

For developers, this collapses the infrastructure stack. A receipt scanning application that previously needed Tesseract OCR + a classification model + a text extraction pipeline becomes a single Gemini API call with a photo of the receipt. A video summarization tool that needed frame extraction + image captioning + text summarization becomes one API call with the video file.

The Gemini 2.5 Pro model supports up to 1 million tokens of context, which means you can process hundreds of images, full-length videos, or massive document sets in a single request. This guide covers practical patterns for building multimodal applications.

Getting Started

SDK Installation

# Python
pip install google-generativeai

# Node.js
npm install @google/generative-ai

Basic Setup

import google.generativeai as genai

genai.configure(api_key="your-gemini-api-key")
model = genai.GenerativeModel("gemini-2.5-pro")

Your First Multimodal Call

import PIL.Image

image = PIL.Image.open("product_photo.jpg")

response = model.generate_content([
    "Describe this product photo. What is the product, what material "
    "is it made of, what color options can you identify, and estimate "
    "the price range based on the apparent quality.",
    image
])

print(response.text)

Image Analysis Patterns

Product Image Analysis

def analyze_product_image(image_path):
    image = PIL.Image.open(image_path)

    response = model.generate_content([
        """Analyze this product image and return a JSON object with:
        {
            "product_type": "category of product",
            "material": "primary material",
            "colors": ["list of colors visible"],
            "condition": "new/used/damaged",
            "brand_visible": "brand name if visible, null if not",
            "estimated_price_range": "$X - $Y",
            "quality_indicators": ["list of quality signals"],
            "suggested_tags": ["e-commerce search tags"]
        }""",
        image
    ])

    return json.loads(response.text)

Multi-Image Comparison

def compare_images(image_paths, comparison_criteria):
    images = [PIL.Image.open(p) for p in image_paths]

    prompt = f"""Compare these {len(images)} images based on:
    {comparison_criteria}

    For each image, rate each criterion on a 1-10 scale.
    Return as a comparison table in markdown format.
    Include an overall recommendation."""

    response = model.generate_content([prompt] + images)
    return response.text

# Usage
result = compare_images(
    ["design_a.png", "design_b.png", "design_c.png"],
    "Visual appeal, readability, brand consistency, mobile-friendliness"
)

Screenshot Analysis for QA

def qa_screenshot(screenshot_path, expected_behavior):
    image = PIL.Image.open(screenshot_path)

    response = model.generate_content([
        f"""You are a QA engineer reviewing this screenshot.

        Expected behavior: {expected_behavior}

        Analyze the screenshot and report:
        1. Does it match the expected behavior? (PASS/FAIL)
        2. Visual issues: misalignment, overlapping elements, cut-off text
        3. Content issues: typos, placeholder text, incorrect data
        4. Accessibility: contrast issues, missing labels, small text
        5. Responsive design: does the layout look correct for this viewport?

        Return as a structured QA report.""",
        image
    ])

    return response.text

Document Processing

Receipt and Invoice Extraction

def extract_receipt(image_path):
    image = PIL.Image.open(image_path)

    response = model.generate_content([
        """Extract all information from this receipt/invoice and return
        as JSON:
        {
            "vendor_name": "",
            "vendor_address": "",
            "date": "YYYY-MM-DD",
            "items": [
                {"name": "", "quantity": 1, "unit_price": 0.00, "total": 0.00}
            ],
            "subtotal": 0.00,
            "tax": 0.00,
            "total": 0.00,
            "payment_method": "",
            "currency": ""
        }

        If any field is not visible or unclear, set it to null.
        For items, extract every line item visible.""",
        image
    ])

    return json.loads(response.text)

PDF Document Analysis

def analyze_pdf(pdf_path, questions):
    # Upload the file
    uploaded_file = genai.upload_file(pdf_path)

    prompt = f"""Analyze this PDF document and answer the following questions:

    {chr(10).join(f'{i+1}. {q}' for i, q in enumerate(questions))}

    For each answer:
    - Cite the specific page number and section
    - Quote the relevant text
    - If the document does not contain the answer, say so explicitly"""

    response = model.generate_content([prompt, uploaded_file])
    return response.text

# Usage
answers = analyze_pdf(
    "contract.pdf",
    [
        "What is the contract start and end date?",
        "What are the payment terms?",
        "Are there any automatic renewal clauses?",
        "What are the termination conditions?"
    ]
)

Form Data Extraction

def extract_form(form_image_path):
    image = PIL.Image.open(form_image_path)

    response = model.generate_content([
        """This is a filled-out form. Extract all field labels and
        their corresponding values. Return as a JSON object where
        keys are field labels and values are the filled-in data.

        For checkboxes, return true/false.
        For handwritten text, do your best to read it and flag
        any uncertain readings with [uncertain: best guess].

        Also identify:
        - Any required fields left blank
        - Any fields with potentially invalid data
        - Signatures (note as "signature present" or "signature missing")""",
        image
    ])

    return json.loads(response.text)

Video Analysis

Video Summarization

def summarize_video(video_path, focus=None):
    uploaded_video = genai.upload_file(video_path)

    # Wait for processing
    import time
    while uploaded_video.state.name == "PROCESSING":
        time.sleep(5)
        uploaded_video = genai.get_file(uploaded_video.name)

    prompt = "Summarize this video. Include: main topics covered, "
    prompt += "key points made, any data or statistics mentioned, "
    prompt += "and the overall tone/purpose of the video."

    if focus:
        prompt += f" Focus especially on: {focus}"

    response = model.generate_content([prompt, uploaded_video])
    return response.text

Video Content Moderation

def moderate_video(video_path):
    uploaded_video = genai.upload_file(video_path)

    # Wait for processing
    import time
    while uploaded_video.state.name == "PROCESSING":
        time.sleep(5)
        uploaded_video = genai.get_file(uploaded_video.name)

    response = model.generate_content([
        """Review this video for content moderation. Check for:
        1. Explicit or violent content
        2. Hate speech or discriminatory language
        3. Dangerous activities or misinformation
        4. Copyright-protected music or content
        5. Personal information exposure (faces, addresses, phone numbers)

        Return a moderation report with:
        - Overall safety rating: SAFE / REVIEW_NEEDED / BLOCKED
        - Timestamps of any flagged content
        - Category of each flag
        - Confidence level (HIGH/MEDIUM/LOW)
        - Recommended action""",
        uploaded_video
    ])

    return response.text

Mixed-Media Pipelines

Multi-Document Comparison

def compare_documents(doc_paths, comparison_focus):
    files = [genai.upload_file(p) for p in doc_paths]

    # Wait for all to process
    import time
    for f in files:
        while f.state.name == "PROCESSING":
            time.sleep(2)
            f = genai.get_file(f.name)

    prompt = f"""Compare these {len(files)} documents.
    Focus on: {comparison_focus}

    For each document, extract the key positions and data points
    related to the focus area. Then create a comparison matrix
    showing where the documents agree and disagree.

    Highlight any contradictions between documents."""

    response = model.generate_content([prompt] + files)
    return response.text

Image + Text Product Listing Generator

def generate_listing(product_images, product_info):
    images = [PIL.Image.open(p) for p in product_images]

    prompt = f"""Based on these product photos and the following info:
    {product_info}

    Generate a complete e-commerce listing:
    1. Product title (SEO-optimized, under 80 characters)
    2. Bullet points (5 key features from the images)
    3. Product description (200 words, highlight visible features)
    4. Suggested categories and tags
    5. Suggested price range based on apparent quality

    Base your description on what you can actually see in the images.
    Do not make claims about features that are not visible."""

    response = model.generate_content([prompt] + images)
    return response.text

Production Optimization

Structured Output with JSON Mode

response = model.generate_content(
    [prompt, image],
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json",
        response_schema={
            "type": "object",
            "properties": {
                "product_name": {"type": "string"},
                "price": {"type": "number"},
                "category": {"type": "string"}
            }
        }
    )
)

Batch Processing

async def batch_analyze_images(image_paths, prompt, max_concurrent=5):
    import asyncio

    semaphore = asyncio.Semaphore(max_concurrent)
    results = []

    async def process_one(path):
        async with semaphore:
            image = PIL.Image.open(path)
            response = await model.generate_content_async([prompt, image])
            return {"path": path, "result": response.text}

    tasks = [process_one(p) for p in image_paths]
    results = await asyncio.gather(*tasks)
    return results

Cost Management

Input Type	Token Cost	Optimization
Text	~1 token per 4 chars	Keep prompts concise
Image	~258 tokens per image	Resize to needed resolution
Video	~258 tokens per frame	Reduce FPS, trim unnecessary segments
PDF	~258 tokens per page	Extract only relevant pages

Frequently Asked Questions

What image formats does Gemini accept?

PNG, JPEG, GIF, WebP, and HEIC. For best results, use JPEG or PNG. Maximum image size is 20 MB per image.

Can Gemini process video in real-time?

Not in real-time. Video must be uploaded and processed before analysis. Processing time depends on video length — typically 1-3 minutes for a 5-minute video.

How accurate is document extraction compared to dedicated OCR?

For clearly printed text, Gemini’s accuracy is comparable to Tesseract and Google Cloud Vision. For handwritten text, it is often more accurate because it uses contextual understanding, not just character recognition.

Can I process multiple images in one API call?

Yes. You can include multiple images in a single request. The 1M token context window can handle hundreds of images. Each image costs approximately 258 tokens.

Is the Gemini API available in all regions?

Check Google’s current API availability. Some features and models may have geographic restrictions. The API is broadly available in most markets.

How does Gemini multimodal compare to GPT-4o Vision?

Both are strong multimodal models. Gemini’s advantages: larger context window (1M vs 128K tokens), native video support, and competitive pricing. GPT-4o’s advantages: slightly better at complex reasoning about images, larger developer ecosystem.

Explore More Tools