Gemini Best Practices for Multimodal Prompting: Image, Video, and Document Analysis That Gets Results
Why Gemini’s Multimodal Capability Changes What AI Can Do
Most AI interactions are text-in, text-out. You describe a problem in words, the AI responds in words. But many real-world tasks start with visual information: a screenshot of a bug, a photo of a whiteboard, a PDF of a contract, a video of a product demo. Translating these into text descriptions before asking AI to analyze them loses information and adds effort.
Gemini processes images, videos, PDFs, and audio natively — in the same context as text. You can upload a screenshot and ask “What is wrong with this UI?” Upload a financial report PDF and ask “Summarize the key risks.” Upload a video and ask “What are the main takeaways from this presentation?” The AI sees what you see.
This multimodal capability is not just a convenience — it enables workflows that were previously impossible or impractical. Analyzing hundreds of product photos for quality issues. Extracting data from scanned documents. Reviewing video content for compliance. These tasks required specialized tools or manual labor. Gemini handles them with a prompt.
This guide covers the prompting patterns that produce reliable results across modalities.
Image Analysis Best Practices
The Specific Question Rule
The single most important rule for image analysis: ask a specific question, not a vague one.
VAGUE (produces generic description): "What do you see in this image?" → "This is a screenshot of a web application dashboard showing charts and data tables." SPECIFIC (produces actionable analysis): "This is a screenshot of our analytics dashboard. Identify: 1. Any visual inconsistencies (misaligned elements, truncated text) 2. Data visualization issues (misleading axes, wrong chart types) 3. Accessibility problems (low contrast, missing labels) 4. Suggestions for improving the information hierarchy"
The specific question gets a specific, useful answer because it tells Gemini what to look for.
Image Analysis Patterns by Use Case
UI/UX Review:
"Review this screenshot of our [page/feature]. Evaluate: 1. Visual hierarchy: is the most important information prominent? 2. Consistency: do similar elements look and behave similarly? 3. Whitespace: is the layout balanced or cramped? 4. Typography: are font sizes and weights used hierarchically? 5. Call to action: is the primary CTA obvious? Provide specific, actionable feedback — not general principles."
Product Photo Quality Check:
"This is a product photo for our e-commerce listing. Assess: 1. Lighting: is the product well-lit without harsh shadows? 2. Background: is it clean and appropriate for the product? 3. Focus: is the product sharp where it matters? 4. Color accuracy: do the colors look natural and appealing? 5. Composition: is the product positioned effectively? Rate each dimension 1-5 and suggest specific improvements."
Document Data Extraction:
"Extract all structured data from this receipt/invoice image: - Vendor name - Date - Line items (description, quantity, unit price, total) - Subtotal, tax, total - Payment method Output as JSON."
Competitive Analysis from Screenshots:
"Here are screenshots of our competitor's [page type]. Compare to our version (described below): [describe your page] Identify: 1. Features they have that we don't 2. UI patterns they use that we should consider 3. Areas where our design is stronger 4. Their messaging approach vs. ours"
Multiple Images in One Prompt
Gemini can analyze multiple images simultaneously:
"I'm uploading 5 screenshots of our onboarding flow (screens 1-5 in order). Analyze the complete flow: 1. Is the progression logical? 2. Are there too many steps? 3. Where might users drop off? 4. Is the visual design consistent across screens? 5. What is missing that users would need?"
This is more effective than analyzing each screen individually because Gemini can assess the flow and consistency across screens.
Document Analysis Best Practices
PDF Processing
Gemini’s large context window handles most documents:
"I'm uploading a 30-page quarterly report. Analyze and provide: 1. Executive summary (5 bullet points) 2. Key financial metrics with quarter-over-quarter trends 3. Risks and challenges mentioned 4. Strategic initiatives discussed 5. Any claims that seem unsupported by the data presented 6. Questions a board member should ask based on this report"
Multi-Document Comparison
"I'm uploading two versions of our Terms of Service: - Version A: current (effective January 2026) - Version B: proposed revision Compare them and identify: 1. Every substantive change (not formatting/grammar) 2. New clauses added in Version B 3. Clauses removed or weakened in Version B 4. Changes that could affect user rights 5. Changes that could affect our liability Organize by section for easy review."
Table and Chart Extraction
"This PDF contains several charts and tables. For each chart or table: 1. Extract the data into a structured format (JSON or CSV) 2. Describe what the chart shows 3. Note any trends or anomalies visible in the data 4. Identify if the visualization accurately represents the data (or if it could be misleading)"
Video Analysis Best Practices
Video Content Review
"I'm uploading a 5-minute product demo video. Analyze: 1. Key features demonstrated (timestamp + description) 2. Pacing: too fast, too slow, or just right? 3. Clarity: are the demonstrations easy to follow? 4. Missing content: what features should be included but aren't? 5. Production quality: audio, visuals, transitions 6. Recommended improvements for the next version"
Presentation Analysis
"This is a recorded presentation. Provide: 1. Slide-by-slide summary (what each slide covers) 2. Key claims made by the presenter 3. Quality of evidence for each claim 4. Presentation structure assessment (logical flow?) 5. Action items or decisions requested 6. Questions that should be asked"
Video Compliance Check
"Review this marketing video for compliance: 1. Are all claims substantiated? (flag any unsupported claims) 2. Are disclaimers present where needed? 3. Is pricing displayed accurately and clearly? 4. Are there any potentially misleading representations? 5. Does the video comply with [specific regulation/guideline]?"
Combining Modalities (Text + Image + Document)
The Multi-Source Analysis Pattern
"I'm providing three inputs: 1. [Image]: screenshot of our current dashboard 2. [PDF]: our design system documentation 3. [Text]: user feedback from the last 10 support tickets about the dashboard Analyze all three together: - Does the current dashboard follow our design system? - Do the user complaints align with what you see in the screenshot? - What specific changes would address the user feedback while staying within our design system guidelines?"
This combined analysis is more valuable than analyzing each input separately because the insights are contextualized.
Design-to-Implementation Review
"Inputs: 1. [Image]: Figma design mockup 2. [Image]: screenshot of the implemented page Compare the design to the implementation: 1. Layout differences (spacing, alignment, proportions) 2. Typography differences (font, size, weight, color) 3. Color differences (exact color codes if visible) 4. Missing elements (in design but not implemented) 5. Extra elements (implemented but not in design) 6. Interaction states not visible (hover, focus, active) List every discrepancy with specific pixel-level detail."
Common Multimodal Prompting Mistakes
Mistake 1: Uploading Low-Quality Images
Blurry, small, or heavily compressed images produce poor analysis. Use the highest resolution available. For screenshots, use native resolution (no scaling).
Mistake 2: Not Providing Context
BAD: [upload image] "Analyze this." (Gemini does not know what you care about) GOOD: [upload image] "This is a screenshot of our checkout page. We are seeing a 40% cart abandonment rate. Identify potential UX issues that could cause users to abandon."
Context tells Gemini what matters. Without it, analysis is generic.
Mistake 3: Too Many Images Without Structure
BAD: [upload 20 images] "What do you think?" GOOD: [upload 20 images] "These are 20 product photos for our spring catalog. For each photo (numbered 1-20): - Rate quality 1-5 - Note the primary issue (if quality < 4) - Suggest a specific fix Output as a numbered list matching the image order."
Mistake 4: Expecting OCR-Level Accuracy from Photos
Gemini reads text in images well but is not an OCR engine. For critical text extraction (legal documents, financial data), verify extracted numbers against the source. Use dedicated OCR tools for high-volume document processing.
Production Patterns
Automated Image Quality Pipeline
For each product photo uploaded to the CMS: 1. Send to Gemini with the quality check prompt 2. If rating >= 4: approve automatically 3. If rating 2-3: flag for human review with specific issues 4. If rating 1: reject with improvement suggestions This replaces manual quality review for 80% of photos.
Document Intake Automation
For each document submitted (invoices, contracts, reports): 1. Send to Gemini for classification (invoice/contract/report/other) 2. Extract structured data based on document type 3. Validate extracted data against expected patterns 4. Route to the appropriate team with extracted summary This replaces manual document sorting and initial review.
Video Content Moderation
For each video uploaded to the platform: 1. Send to Gemini for content analysis 2. Check for: policy violations, inappropriate content, copyright concerns, quality standards 3. Auto-approve if all checks pass 4. Flag for human review if any concern is identified This scales content moderation without proportionally scaling the moderation team.
Frequently Asked Questions
What image formats does Gemini support?
JPEG, PNG, GIF, WebP, and HEIC. For best results, use PNG for screenshots (lossless) and JPEG for photographs. Avoid heavily compressed images.
How large can uploaded files be?
Check current Gemini documentation for size limits. Generally: images up to 20MB, PDFs up to 100+ pages (within token limits), videos up to several minutes. For very large documents, consider splitting or summarizing sections.
Is multimodal prompting more expensive than text-only?
Yes. Images and videos consume more tokens than equivalent text. A single high-resolution image may use 1,000-3,000 tokens. For cost-sensitive applications, consider whether the visual input adds enough value to justify the token cost.
Can Gemini generate images from multimodal analysis?
Gemini can analyze images but its image generation capabilities are separate. For analysis tasks, upload images and receive text analysis. For generation, use Gemini’s image generation features or dedicated tools like Midjourney.
How do I handle sensitive images (medical, legal, personal)?
Review Google’s data handling policies for Gemini before uploading sensitive content. For healthcare and legal applications, ensure compliance with relevant regulations (HIPAA, attorney-client privilege). Consider de-identifying sensitive information before upload.
Can Gemini analyze handwritten content?
Yes, with varying accuracy. Clear handwriting on clean backgrounds produces good results. Messy handwriting, unusual scripts, or degraded documents produce inconsistent results. Always verify extracted text from handwritten sources.