ChatGPT Advanced Data Analysis (Code Interpreter) Complete Guide: Upload, Analyze, Visualize
Why Advanced Data Analysis Changes the Way You Work with Data
Data analysis has traditionally required a combination of programming knowledge, statistical understanding, and familiarity with specialized tools. For many professionals, this meant either learning Python and libraries like pandas from scratch or relying on spreadsheet software that quickly hits its limits with larger datasets. ChatGPT Advanced Data Analysis bridges this gap by letting you perform sophisticated data operations using plain English instructions, all within the ChatGPT interface.
Whether you are a marketing analyst trying to make sense of campaign performance data, a finance professional building quarterly reports, or a researcher exploring survey results, Advanced Data Analysis transforms ChatGPT from a text-based assistant into a full-featured data workbench. You upload your files, describe what you want in natural language, and ChatGPT writes and executes Python code behind the scenes to deliver cleaned datasets, statistical summaries, and publication-ready visualizations.
This guide walks you through every aspect of the feature, from initial setup to advanced techniques, with practical examples you can use immediately.
What Is ChatGPT Advanced Data Analysis?
ChatGPT Advanced Data Analysis, originally launched under the name Code Interpreter, is a built-in capability of ChatGPT that allows the model to write and execute Python code in a sandboxed environment. Unlike standard ChatGPT responses that only generate text, Advanced Data Analysis can process uploaded files, run computations, generate charts, and produce downloadable output files.
The feature runs a full Python runtime in an isolated Jupyter-like environment on OpenAI’s servers. When you ask a question about your data, ChatGPT generates Python code, executes it in real time, and returns the results directly in the conversation. You can see the code it wrote, inspect intermediate outputs, and request modifications, all through natural language.
Key capabilities include:
- File ingestion: Upload CSV, Excel, JSON, SQLite, Parquet, and many other file formats directly into the conversation.
- Data manipulation: Clean, filter, merge, pivot, and reshape datasets using pandas under the hood.
- Visualization: Generate bar charts, line graphs, scatter plots, heatmaps, histograms, and more using matplotlib and seaborn.
- Statistical analysis: Run correlation tests, regression models, hypothesis tests, and descriptive statistics.
- Machine learning: Train basic models using scikit-learn for classification, clustering, and prediction tasks.
- File generation: Export results as CSV, Excel, PNG, PDF, and other formats for download.
The sandboxed environment means your data stays within the session and is not used to train OpenAI’s models (subject to OpenAI’s data usage policies for the applicable plan).
How to Access and Enable Advanced Data Analysis
Advanced Data Analysis is available to users on the following ChatGPT plans:
- ChatGPT Plus ($20/month): Full access to Advanced Data Analysis with GPT-4o.
- ChatGPT Team ($25/user/month): Full access with additional workspace features and higher usage limits.
- ChatGPT Enterprise: Full access with enterprise-grade security, longer context windows, and no usage caps.
- ChatGPT Free: Limited access to file uploads and basic code execution with GPT-4o mini, subject to daily usage limits.
To start using Advanced Data Analysis:
- Log in to ChatGPT at chat.openai.com.
- Start a new conversation or open an existing one.
- Select GPT-4o as your model (the default for Plus subscribers).
- Click the attachment (paperclip) icon in the message input area.
- Upload your data file and type your analysis request.
There is no separate toggle or plugin to enable. As of 2026, Advanced Data Analysis is natively integrated into GPT-4o conversations whenever you upload a file or request code execution.
Supported File Formats and Size Limits
Advanced Data Analysis supports a wide range of file formats. Understanding the limits helps you prepare your data before uploading.
Commonly supported file formats:
| Category | Formats |
|---|---|
| Tabular data | CSV, TSV, Excel (.xlsx, .xls), Parquet, Feather |
| Structured data | JSON, XML, SQLite (.db, .sqlite) |
| Text | TXT, Markdown, LaTeX |
| Images | PNG, JPG, GIF, SVG, WebP |
| Documents | PDF (text extraction only) |
| Archives | ZIP (auto-extracted) |
| Code | Python (.py), Jupyter Notebooks (.ipynb) |
Size limits and session constraints:
- Maximum file size per upload: approximately 512 MB per file.
- Total session storage: the sandboxed environment has limited disk space, typically around 1 GB total across all uploaded and generated files.
- Session timeout: the code execution environment resets after a period of inactivity (roughly 10 to 15 minutes without interaction). When this happens, uploaded files and variables are lost, and you must re-upload.
- Execution timeout: individual code executions are limited to approximately 120 seconds. Long-running computations may be interrupted.
- Row and column limits: there is no hard row limit, but extremely large datasets (millions of rows) may cause memory errors. For best performance, keep datasets under 500,000 rows or pre-filter before uploading.
Tip: If your file exceeds size limits, consider compressing it into a ZIP archive, splitting it into smaller chunks, or pre-filtering to include only the columns and rows you need.
Step 1: Uploading and Exploring Your Dataset
The first step in any analysis is getting your data into the environment and understanding its structure. Upload your file by clicking the paperclip icon or dragging the file into the chat window.
Once uploaded, start with exploratory prompts to understand your dataset before diving into analysis.
Prompt: Load the uploaded CSV file and show me the first 10 rows, the column names
and data types, the shape of the dataset, and a summary of missing values per column.
ChatGPT will typically run code similar to df.head(10), df.dtypes, df.shape, and df.isnull().sum(), presenting the results in a readable format.
Prompt: Give me descriptive statistics for all numerical columns, including mean,
median, standard deviation, min, max, and the 25th/75th percentiles.
This triggers a df.describe() call along with additional median calculations, giving you a statistical snapshot of your data.
For more targeted exploration:
Prompt: Show me the unique values and their counts for the "Region" and "Category"
columns, sorted by frequency.
Best practices for the exploration phase:
- Always inspect the first few rows to verify that the file was parsed correctly (correct delimiters, headers, encoding).
- Check data types early. Dates stored as strings, numbers stored as text, and mixed-type columns cause problems downstream.
- Identify missing values immediately so you can decide on a handling strategy before analysis begins.
- Look at the shape of the data to confirm that all expected rows and columns are present.
Step 2: Data Cleaning and Transformation
Real-world data is rarely clean. Advanced Data Analysis excels at handling the tedious but critical work of data preparation.
Handling missing values:
Prompt: For numerical columns, fill missing values with the column median. For
categorical columns, fill missing values with the mode. Show me a before-and-after
count of missing values.
Fixing data types:
Prompt: Convert the "Date" column to datetime format, the "Revenue" column to
numeric (removing any dollar signs and commas), and the "Zip Code" column to string
type. Show any rows where conversion failed.
Removing duplicates:
Prompt: Identify and remove duplicate rows based on the "Order_ID" column, keeping
the first occurrence. Tell me how many duplicates were removed.
Creating derived columns:
Prompt: Create a new column called "Profit_Margin" calculated as (Revenue - Cost) /
Revenue * 100. Also create a "Quarter" column extracted from the "Date" column and a
"Year" column.
Filtering and subsetting:
Prompt: Filter the dataset to include only rows where Region is "North America" and
Revenue is greater than 1000. Save this as a new dataframe and show me the shape.
Merging datasets: If you upload multiple files, you can join them:
Prompt: I uploaded two files: sales_data.csv and customer_info.csv. Merge them on
the "Customer_ID" column using a left join. Show me any Customer_IDs that appear in
sales but not in customer_info.
Advanced Data Analysis handles all of these operations by generating and executing pandas code. You can ask to see the code at any time to learn from it or verify the logic.
Step 3: Exploratory Data Analysis with Natural Language
Once your data is clean, exploratory data analysis (EDA) helps you uncover patterns, trends, and anomalies. The power of Advanced Data Analysis lies in letting you direct this process conversationally.
Distribution analysis:
Prompt: Show me the distribution of the "Revenue" column with a histogram. Use 30
bins, add a vertical line at the mean and median, and include a KDE overlay. Label
the chart clearly.
Group-level comparisons:
Prompt: Calculate the average Revenue, total Orders, and median Profit_Margin for
each Region. Display the results as a formatted table sorted by average Revenue
descending.
Correlation exploration:
Prompt: Create a correlation matrix for all numerical columns and display it as a
heatmap with annotation. Highlight any correlations above 0.7 or below -0.7.
Time-series patterns:
Prompt: Plot monthly total Revenue as a line chart for the past 24 months. Add a 3-
month rolling average overlay. Highlight any months where Revenue dropped more than
20% compared to the previous month.
Cross-tabulation:
Prompt: Create a pivot table showing average Revenue by Region (rows) and Product
Category (columns). Include row and column totals.
You can chain these explorations together in a single conversation. Each response builds on the state of the previous code execution, so variables, dataframes, and transformations persist within the session.
Step 4: Building Charts and Visualizations
Advanced Data Analysis has access to several Python visualization libraries, making it capable of producing a wide range of chart types.
Available libraries:
- matplotlib: The foundational plotting library. Handles virtually any chart type with fine-grained control over axes, labels, colors, and layout.
- seaborn: Built on matplotlib, it provides higher-level statistical visualization functions with better default aesthetics.
- plotly (in some environments): Interactive charts that support hover tooltips, zoom, and pan.
Common chart types and when to use them:
| Chart Type | Best For | Example Prompt |
|---|---|---|
| Bar chart | Comparing categories | ”Bar chart of total sales by region” |
| Line chart | Trends over time | ”Monthly revenue trend for 2025” |
| Scatter plot | Relationships between two variables | ”Scatter plot of price vs. quantity sold” |
| Histogram | Distribution of a single variable | ”Distribution of customer ages” |
| Box plot | Spread and outliers by group | ”Box plot of salary by department” |
| Heatmap | Correlations or matrices | ”Correlation heatmap of all numeric columns” |
| Pie/donut chart | Proportions of a whole | ”Market share breakdown by brand” |
| Stacked bar | Composition across categories | ”Revenue breakdown by product per quarter” |
Creating publication-ready charts:
Prompt: Create a grouped bar chart showing total Revenue and total Profit by Region.
Use a professional color palette, add value labels on top of each bar, set the
figure size to 12x6 inches, include a legend, and add a descriptive title. Use a
white background with light gridlines.
Multi-panel layouts:
Prompt: Create a 2x2 grid of charts: (1) histogram of Revenue, (2) scatter plot of
Revenue vs. Marketing Spend, (3) bar chart of average Revenue by Category, and (4)
line chart of monthly Revenue trend. Use a consistent color scheme across all four
panels and add a main title.
All generated charts can be downloaded as PNG or PDF files. Simply ask ChatGPT to save the figure, and it will provide a download link.
Step 5: Statistical Analysis and Modeling
Advanced Data Analysis supports a full range of statistical methods through scipy, statsmodels, and scikit-learn.
Descriptive statistics:
Beyond basic describe() output, you can request skewness, kurtosis, confidence intervals, and percentile breakdowns for any column.
Hypothesis testing:
Prompt: Perform an independent samples t-test to determine whether there is a
statistically significant difference in average Revenue between Region "East" and
Region "West". Report the t-statistic, p-value, and effect size (Cohen's d). Use a
significance level of 0.05.
Regression analysis:
Prompt: Run a multiple linear regression with Revenue as the dependent variable and
Marketing_Spend, Headcount, and Store_Size as independent variables. Show the
regression summary including R-squared, coefficients, p-values, and confidence
intervals. Check for multicollinearity using VIF scores.
Classification with scikit-learn:
Prompt: Build a Random Forest classifier to predict whether a customer will churn
(Churn column = 1 or 0). Use an 80/20 train-test split, show the classification
report (precision, recall, F1 score), and plot a confusion matrix and feature
importance chart.
Clustering:
Prompt: Perform K-Means clustering on the numerical columns (excluding IDs). Use the
elbow method to suggest the optimal number of clusters. Then assign cluster labels
and create a scatter plot colored by cluster using the two principal components from
PCA.
Available libraries in the environment:
- pandas: Data manipulation and analysis.
- numpy: Numerical computing.
- scipy: Statistical tests, distributions, optimization.
- scikit-learn: Machine learning models, preprocessing, evaluation metrics.
- statsmodels: Regression analysis, time-series models, statistical tests.
- matplotlib and seaborn: Visualization.
Note that deep learning frameworks like TensorFlow and PyTorch are not available in the sandboxed environment. For neural network tasks, you will need a separate development environment.
Step 6: Exporting Results and Generating Reports
One of the most practical aspects of Advanced Data Analysis is its ability to produce downloadable files.
Exporting cleaned data:
Prompt: Save the cleaned and transformed dataframe as a new CSV file called
"cleaned_sales_data.csv" and provide a download link.
Prompt: Export the pivot table results to an Excel file with two sheets: "Summary" for
the pivot table and "Detail" for the underlying filtered data. Format the header row
as bold.
Exporting visualizations:
Prompt: Save all four charts from the previous analysis as high-resolution PNG files
(300 DPI) and also create a single PDF with all charts on separate pages.
Generating summary reports:
Prompt: Create a comprehensive analysis report as a markdown file that includes:
- Executive summary of key findings
- Data quality overview (missing values, duplicates found)
- Top 5 insights with supporting statistics
- All generated charts (embedded as images)
- Recommendations based on the analysis
Save it as analysis_report.md.
You can also ask ChatGPT to generate PowerPoint-style summaries or HTML reports, though the formatting options for these are more limited within the sandbox.
Downloading files: After ChatGPT generates a file, a download link appears in the conversation. Click it to save the file to your local machine. These links expire after the session ends, so download everything you need before the session times out.
Advanced Data Analysis Prompt Templates
Here are ready-to-use prompt templates for common analysis scenarios. Copy and adapt them for your own datasets.
Template 1: Sales Performance Dashboard
Prompt: Analyze the uploaded sales data file. Perform the following:
1. Show total revenue, total orders, and average order value by month for the past 12
months.
2. Identify the top 10 products by revenue and the bottom 10 by units sold.
3. Create a line chart of monthly revenue with a trendline.
4. Create a bar chart of revenue by sales channel.
5. Calculate month-over-month growth rates and highlight any months with negative
growth.
6. Export the monthly summary as a CSV file.
Template 2: Survey Results Analysis
Prompt: Analyze the uploaded survey data (Likert scale responses). For each question:
1. Calculate the mean, median, and standard deviation of responses.
2. Show the distribution of responses (1-5) as a stacked horizontal bar chart.
3. Identify questions with the highest and lowest average scores.
4. Run a Cronbach's alpha reliability test across all question items.
5. Perform a factor analysis to identify underlying dimensions.
6. Export a summary table as an Excel file.
Template 3: Financial Data Cleanup
Prompt: Clean the uploaded financial transactions file:
1. Parse all date columns into proper datetime format.
2. Convert currency columns to numeric, removing symbols and commas.
3. Flag and remove duplicate transaction IDs.
4. Identify outlier transactions (beyond 3 standard deviations from the mean).
5. Categorize transactions by amount ranges: micro (<$10), small ($10-100), medium
($100-1000), large (>$1000).
6. Save the cleaned file and a data quality report.
Template 4: A/B Test Analysis
Prompt: Analyze the A/B test results in the uploaded file. The file has columns for
user_id, variant (A or B), converted (0 or 1), and revenue.
1. Calculate conversion rate and average revenue per user for each variant.
2. Run a chi-square test for conversion rate significance.
3. Run a Mann-Whitney U test for revenue difference significance.
4. Calculate the 95% confidence interval for the difference in conversion rates.
5. Determine the required sample size to detect the observed effect at 80% power.
6. Provide a clear recommendation on which variant to ship.
Template 5: Customer Segmentation
Prompt: Perform RFM (Recency, Frequency, Monetary) analysis on the uploaded
transaction data:
1. Calculate Recency (days since last purchase), Frequency (total purchases), and
Monetary (total spend) for each customer.
2. Score each dimension on a 1-5 scale using quintiles.
3. Create customer segments (Champions, Loyal, At Risk, Lost, etc.) based on combined
RFM scores.
4. Show the size and average metrics for each segment.
5. Visualize segments with a treemap or grouped bar chart.
6. Export the customer-level RFM table with segment labels as a CSV.
Limitations and Workarounds
While Advanced Data Analysis is remarkably capable, it has constraints you should be aware of.
No internet access: The sandboxed environment cannot fetch data from URLs, APIs, or databases during execution. All data must be uploaded as files. Workaround: Download data locally first, then upload it to ChatGPT.
Session volatility: The Python environment resets after roughly 10 to 15 minutes of inactivity. All variables, dataframes, and generated files are lost. Workaround: Save important intermediate results as files early. Download them before stepping away. When resuming, re-upload the saved files.
Memory constraints: The environment has limited RAM (typically a few gigabytes). Very large datasets can cause out-of-memory errors. Workaround: Pre-filter or sample your data before uploading. Use chunked processing for large files. Ask ChatGPT to process data in batches.
No persistent storage: Files and state do not persist between conversations. Each new conversation starts fresh. Workaround: Keep a local folder with your working files and re-upload as needed.
Limited library availability: While the environment includes most popular data science libraries, some specialized packages (e.g., TensorFlow, PyTorch, geopandas, certain NLP libraries) may not be installed. Workaround: Ask ChatGPT whether a specific library is available before building your analysis around it. For unavailable libraries, ask for alternative approaches using the available stack.
Execution time limits: Code that runs longer than approximately 120 seconds is terminated. Workaround: Break complex operations into smaller steps. Use more efficient algorithms or sample the data for initial exploration before running full analyses.
Chart interactivity: Charts generated with matplotlib and seaborn are static images. You cannot hover, zoom, or filter them interactively. Workaround: Request multiple versions of a chart with different zoom levels or filters. For interactive needs, ask ChatGPT to generate plotly HTML files you can open locally.
File format limitations for output: While many input formats are supported, output options are sometimes limited by available libraries. Complex Excel formatting (conditional formatting, merged cells, charts within spreadsheets) may not be fully supported. Workaround: Export data in a clean CSV or simple Excel format, then apply formatting in your local spreadsheet application.
Advanced Data Analysis vs. Manual Python: When to Use Which
Understanding when Advanced Data Analysis is the right tool and when you should use a local Python environment helps you work more efficiently.
Choose Advanced Data Analysis when:
- You need quick, one-off analysis of a dataset without setting up a local environment.
- You are not a programmer but need to perform data analysis beyond what spreadsheets offer.
- You want to prototype an analysis approach before building a production pipeline.
- The dataset is small to medium sized (under 500,000 rows, under 100 MB).
- You need to generate a few charts or summary statistics for a report.
- You are exploring an unfamiliar dataset and want to iterate quickly with natural language.
Choose a local Python environment when:
- You need to process very large datasets (millions of rows, multiple gigabytes).
- Your workflow requires internet access (API calls, database connections, web scraping).
- You need deep learning or GPU-accelerated computing.
- The analysis is part of a production pipeline that needs to run on a schedule.
- You require specialized libraries not available in the sandbox.
- You need persistent state across sessions and version-controlled notebooks.
- Data sensitivity policies prohibit uploading files to third-party services.
A hybrid approach often works best: Use Advanced Data Analysis for exploration and prototyping, then review the generated code and adapt it for your local environment when building the production version. You can ask ChatGPT to show all code it generated during the session, clean it up, and present it as a single script or notebook for you to download.
Frequently Asked Questions
Is my data safe when I upload it to Advanced Data Analysis?
Uploaded files are processed in an isolated sandbox environment. According to OpenAI’s policies, data from ChatGPT Plus, Team, and Enterprise conversations is not used for model training by default. Enterprise and Team plans offer additional data protection guarantees. However, always review your organization’s data policies before uploading sensitive or regulated data. Avoid uploading files containing personally identifiable information (PII), protected health information (PHI), or trade secrets unless your plan’s terms explicitly permit it.
Can I use Advanced Data Analysis with GPT-3.5?
No. Advanced Data Analysis with full file upload and code execution capabilities requires GPT-4o or later models. GPT-3.5 Turbo does not support file uploads or sandboxed code execution within ChatGPT.
What happens when the session times out?
When the code execution environment resets due to inactivity, all uploaded files, variables, and generated outputs are cleared. You will need to re-upload your files and re-run any setup code. ChatGPT retains the conversation text, so you can reference previous messages, but the Python state is gone. This typically happens after 10 to 15 minutes of no interaction with the code environment.
Can I install additional Python packages?
The sandboxed environment does not support installing new packages via pip. You are limited to the pre-installed libraries, which include pandas, numpy, scipy, scikit-learn, statsmodels, matplotlib, seaborn, Pillow, openpyxl, and several others. If you need a library that is not available, ask ChatGPT to suggest an alternative approach using the available packages.
How do I get the code ChatGPT generated?
ChatGPT shows the code it executes in expandable code blocks within the conversation. Click on the code block to view the full script. You can also ask explicitly: “Show me all the Python code you ran in this session as a single consolidated script” and ChatGPT will assemble it for you.
Can Advanced Data Analysis handle real-time or streaming data?
No. The sandbox has no internet access and cannot connect to live data streams, databases, or APIs. It works exclusively with files uploaded by the user. For real-time data needs, extract a snapshot of your data, save it as a file, and upload that snapshot.
What is the difference between Advanced Data Analysis and ChatGPT plugins or GPTs?
Advanced Data Analysis is a native capability built into GPT-4o that runs Python code in a sandboxed environment. Custom GPTs and plugins are separate extensions that can call external APIs and services. Some custom GPTs may incorporate code execution capabilities similar to Advanced Data Analysis, but the native feature is available to all Plus, Team, and Enterprise users without additional configuration.
Can I analyze images or PDFs with Advanced Data Analysis?
Yes, with caveats. Images can be processed using Python libraries like Pillow for pixel-level manipulation, resizing, and basic analysis. PDFs can be parsed for text extraction using built-in libraries, but complex PDF layouts with tables and images may not extract cleanly. For best results with PDFs, convert tables to CSV before uploading.
How accurate are the statistical results?
The statistical computations are executed by well-established Python libraries (scipy, statsmodels, scikit-learn), so the mathematical results are reliable. However, the appropriateness of a given statistical test depends on whether the assumptions (normality, independence, sample size) are met. Always validate that ChatGPT selected the right test for your data and ask it to verify assumptions when running hypothesis tests or building models.