Introduction¶
Do higher film budgets lead to more box office revenue? Let's find out if there's a relationship using the movie budgets and financial performance data that I've scraped from the-numbers.com on May 1st, 2018.
Import Statements¶
Notebook Presentation¶
Read the Data¶
Explore and Clean the Data¶
Dataset Structure and Quality¶
Inspect the shape, null values, duplicates, and column types before any transformations.
(5391, 6)
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 0 | 5293 | 8/2/1915 | The Birth of a Nation | $110,000 | $11,000,000 | $10,000,000 |
| 1 | 5140 | 5/9/1916 | Intolerance | $385,907 | $0 | $0 |
| 2 | 5230 | 12/24/1916 | 20,000 Leagues Under the Sea | $200,000 | $8,000,000 | $8,000,000 |
| 3 | 5299 | 9/17/1920 | Over the Hill to the Poorhouse | $100,000 | $3,000,000 | $3,000,000 |
| 4 | 5222 | 1/1/1925 | The Big Parade | $245,000 | $22,000,000 | $11,000,000 |
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 5386 | 2950 | 10/8/2018 | Meg | $15,000,000 | $0 | $0 |
| 5387 | 126 | 12/18/2018 | Aquaman | $160,000,000 | $0 | $0 |
| 5388 | 96 | 12/31/2020 | Singularity | $175,000,000 | $0 | $0 |
| 5389 | 1119 | 12/31/2020 | Hannibal the Conqueror | $50,000,000 | $0 | $0 |
| 5390 | 2517 | 12/31/2020 | Story of Bonnie and Clyde, The | $20,000,000 | $0 | $0 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5391 entries, 0 to 5390 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 5391 non-null int64 1 Release_Date 5391 non-null object 2 Movie_Title 5391 non-null object 3 USD_Production_Budget 5391 non-null object 4 USD_Worldwide_Gross 5391 non-null object 5 USD_Domestic_Gross 5391 non-null object dtypes: int64(1), object(5) memory usage: 252.8+ KB
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 737 | 2079 | 9/23/1994 | The Shawshank Redemption | $25,000,000 | $28,307,092 | $28,241,469 |
| 3884 | 4228 | 4/22/2011 | Stake Land | $4,000,000 | $679,482 | $33,245 |
| 2365 | 5002 | 1/7/2005 | Undead | $750,000 | $229,250 | $41,196 |
| 1023 | 2432 | 11/14/1997 | The Man Who Knew Too Little | $20,000,000 | $13,801,755 | $13,801,755 |
| 1734 | 1463 | 12/21/2001 | Joe Somebody | $38,000,000 | $24,515,990 | $22,770,864 |
Data Type Conversions¶
Converting Currency Columns to Numeric¶
Strip $ signs and commas from the three monetary columns and convert to float so they can be used in calculations and plots.
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 0 | 5293 | 8/2/1915 | The Birth of a Nation | 110000 | 11000000 | 10000000 |
| 1 | 5140 | 5/9/1916 | Intolerance | 385907 | 0 | 0 |
| 2 | 5230 | 12/24/1916 | 20,000 Leagues Under the Sea | 200000 | 8000000 | 8000000 |
| 3 | 5299 | 9/17/1920 | Over the Hill to the Poorhouse | 100000 | 3000000 | 3000000 |
| 4 | 5222 | 1/1/1925 | The Big Parade | 245000 | 22000000 | 11000000 |
Parsing Release Dates¶
Convert the Release_Date column from object strings to Pandas datetime so year and decade can be extracted later.
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 0 | 5293 | 1915-08-02 | The Birth of a Nation | 110000 | 11000000 | 10000000 |
| 1 | 5140 | 1916-05-09 | Intolerance | 385907 | 0 | 0 |
| 2 | 5230 | 1916-12-24 | 20,000 Leagues Under the Sea | 200000 | 8000000 | 8000000 |
| 3 | 5299 | 1920-09-17 | Over the Hill to the Poorhouse | 100000 | 3000000 | 3000000 |
| 4 | 5222 | 1925-01-01 | The Big Parade | 245000 | 22000000 | 11000000 |
Descriptive Statistics¶
Descriptive Statistics¶
Summary statistics reveal budget and revenue distribution across the dataset: ranges, medians, and the spread between the bottom and top quartiles.
| Rank | Release_Date | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|
| count | 5,391.00 | 5391 | 5,391.00 | 5,391.00 | 5,391.00 |
| mean | 2,696.00 | 2003-09-19 15:02:02.203672704 | 31,113,737.58 | 88,855,421.96 | 41,235,519.44 |
| min | 1.00 | 1915-08-02 00:00:00 | 1,100.00 | 0.00 | 0.00 |
| 25% | 1,348.50 | 1999-12-02 12:00:00 | 5,000,000.00 | 3,865,206.00 | 1,330,901.50 |
| 50% | 2,696.00 | 2006-06-23 00:00:00 | 17,000,000.00 | 27,450,453.00 | 17,192,205.00 |
| 75% | 4,043.50 | 2011-11-23 00:00:00 | 40,000,000.00 | 96,454,455.00 | 52,343,687.00 |
| max | 5,391.00 | 2020-12-31 00:00:00 | 425,000,000.00 | 2,783,918,982.00 | 936,662,225.00 |
| std | 1,556.39 | NaN | 40,523,796.88 | 168,457,757.00 | 66,029,346.27 |
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 2427 | 5391 | 2005-05-08 | My Date With Drew | 1100 | 181041 | 181041 |
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 3529 | 1 | 2009-12-18 | Avatar | 425000000 | 2783918982 | 760507625 |
Investigating the Zero Revenue Films¶
Films with Zero Domestic Gross¶
Identify films reporting $0 in domestic (US) revenue — either unreleased, streaming-only, or internationally distributed titles.
Number of movies with zero domestic gross: 512
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 5388 | 96 | 2020-12-31 | Singularity | 175000000 | 0 | 0 |
| 5387 | 126 | 2018-12-18 | Aquaman | 160000000 | 0 | 0 |
| 5384 | 321 | 2018-09-03 | A Wrinkle in Time | 103000000 | 0 | 0 |
| 5385 | 366 | 2018-10-08 | Amusement Park | 100000000 | 0 | 0 |
| 5090 | 556 | 2015-12-31 | Don Gato, el inicio de la pandilla | 80000000 | 4547660 | 0 |
| 4294 | 566 | 2012-12-31 | Astérix et Obélix: Au service de Sa Majesté | 77600000 | 60680125 | 0 |
| 5058 | 880 | 2015-11-12 | The Ridiculous 6 | 60000000 | 0 | 0 |
| 5338 | 879 | 2017-04-08 | The Dark Tower | 60000000 | 0 | 0 |
| 5389 | 1119 | 2020-12-31 | Hannibal the Conqueror | 50000000 | 0 | 0 |
| 4295 | 1230 | 2012-12-31 | Foodfight! | 45000000 | 73706 | 0 |
Films with Zero Worldwide Gross¶
Narrow the view further to films reporting no revenue globally, and rank them by production budget to spot the most expensive zero-earners.
Number of movies with zero worldwide gross: 357
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 5388 | 96 | 2020-12-31 | Singularity | 175000000 | 0 | 0 |
| 5387 | 126 | 2018-12-18 | Aquaman | 160000000 | 0 | 0 |
| 5384 | 321 | 2018-09-03 | A Wrinkle in Time | 103000000 | 0 | 0 |
| 5385 | 366 | 2018-10-08 | Amusement Park | 100000000 | 0 | 0 |
| 5058 | 880 | 2015-11-12 | The Ridiculous 6 | 60000000 | 0 | 0 |
| 5338 | 879 | 2017-04-08 | The Dark Tower | 60000000 | 0 | 0 |
| 5389 | 1119 | 2020-12-31 | Hannibal the Conqueror | 50000000 | 0 | 0 |
| 5092 | 1435 | 2015-12-31 | The Crow | 40000000 | 0 | 0 |
| 3300 | 1631 | 2008-12-31 | Black Water Transit | 35000000 | 0 | 0 |
| 5045 | 1656 | 2015-10-30 | Freaks of Nature | 33000000 | 0 | 0 |
Filtering on Multiple Conditions¶
Practice¶
Number of international releases: 155
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 71 | 4310 | 1956-02-16 | Carousel | 3380000 | 3220 | 0 |
| 1579 | 5087 | 2001-02-11 | Everything Put Together | 500000 | 7890 | 0 |
| 1744 | 3695 | 2001-12-31 | The Hole | 7500000 | 10834406 | 0 |
| 2155 | 4236 | 2003-12-31 | Nothing | 4000000 | 63180 | 0 |
| 2203 | 2513 | 2004-03-31 | The Touch | 20000000 | 5918742 | 0 |
Filtering with .query()¶
Reproduce the international-releases filter using the .query() API as an alternative syntax.
Number of international releases (using query): 155
Removing Unreleased Films¶
Films with a release date on or after the data-collection date (May 1, 2018) had no opportunity to earn revenue. Drop them to create data_clean.
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 5384 | 321 | 2018-09-03 | A Wrinkle in Time | 103000000 | 0 | 0 |
| 5385 | 366 | 2018-10-08 | Amusement Park | 100000000 | 0 | 0 |
| 5386 | 2950 | 2018-10-08 | Meg | 15000000 | 0 | 0 |
| 5387 | 126 | 2018-12-18 | Aquaman | 160000000 | 0 | 0 |
| 5388 | 96 | 2020-12-31 | Singularity | 175000000 | 0 | 0 |
| 5389 | 1119 | 2020-12-31 | Hannibal the Conqueror | 50000000 | 0 | 0 |
| 5390 | 2517 | 2020-12-31 | Story of Bonnie and Clyde, The | 20000000 | 0 | 0 |
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | |
|---|---|---|---|---|---|---|
| 0 | 5293 | 1915-08-02 | The Birth of a Nation | 110000 | 11000000 | 10000000 |
| 1 | 5140 | 1916-05-09 | Intolerance | 385907 | 0 | 0 |
| 2 | 5230 | 1916-12-24 | 20,000 Leagues Under the Sea | 200000 | 8000000 | 8000000 |
| 3 | 5299 | 1920-09-17 | Over the Hill to the Poorhouse | 100000 | 3000000 | 3000000 |
| 4 | 5222 | 1925-01-01 | The Big Parade | 245000 | 22000000 | 11000000 |
Films that Failed to Recoup Their Budget¶
Calculate what proportion of films never recovered their production costs from worldwide revenue.
37.28% of movies lost money.
Seaborn for Data Viz: Bubble Charts¶
Budget vs Revenue Over Time¶
Plot every film as a point: release year on the x-axis, production budget on the y-axis, with point size and colour encoding worldwide gross.
Grouping Releases by Decade¶
Derive a Decade column using floor division so films can be grouped and compared across eras.
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | Decade | |
|---|---|---|---|---|---|---|---|
| 0 | 5293 | 1915-08-02 | The Birth of a Nation | 110000 | 11000000 | 10000000 | 1910 |
| 1 | 5140 | 1916-05-09 | Intolerance | 385907 | 0 | 0 | 1910 |
| 2 | 5230 | 1916-12-24 | 20,000 Leagues Under the Sea | 200000 | 8000000 | 8000000 | 1910 |
| 3 | 5299 | 1920-09-17 | Over the Hill to the Poorhouse | 100000 | 3000000 | 3000000 | 1920 |
| 4 | 5222 | 1925-01-01 | The Big Parade | 245000 | 22000000 | 11000000 | 1920 |
Splitting the Dataset by Era¶
Separate pre-1970 films from modern films to compare how the budget-to-revenue relationship differs across eras.
Number of old films: 153 Number of new films: 5231
| Rank | Release_Date | Movie_Title | USD_Production_Budget | USD_Worldwide_Gross | USD_Domestic_Gross | Decade | |
|---|---|---|---|---|---|---|---|
| 153 | 2159 | 1970-01-01 | Waterloo | 25000000 | 0 | 0 | 1970 |
| 154 | 2270 | 1970-01-01 | Darling Lili | 22000000 | 5000000 | 5000000 | 1970 |
| 155 | 3136 | 1970-01-01 | Patton | 12000000 | 62500000 | 62500000 | 1970 |
| 156 | 3277 | 1970-01-01 | The Molly Maguires | 11000000 | 2200000 | 2200000 | 1970 |
| 157 | 4265 | 1970-01-01 | M*A*S*H | 3500000 | 81600000 | 81600000 | 1970 |
Seaborn Regression Plots¶
Regression Line: Films from 1970 Onwards¶
Fit and visualise the linear relationship between budget and worldwide revenue for the modern film era.
Run Your Own Regression with scikit-learn¶
$$ REV \hat ENUE = \theta _0 + \theta _1 BUDGET$$
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| fit_intercept | True | |
| copy_X | True | |
| tol | 1e-06 | |
| n_jobs | None | |
| positive | False |
array([-8650768.00661096])
array([[3.12259592]])
The model explains 55.77% of the variance
Regression for Pre-1970 Films¶
Repeat the regression on old films to compare slope, intercept, and explanatory power against the modern era model.
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| fit_intercept | True | |
| copy_X | True | |
| tol | 1e-06 | |
| n_jobs | None | |
| positive | False |
The slope coefficient is 1.6477131440107315 THe intercept is 22821538.635080386
Estimating Revenue from Budget¶
With slope and intercept in hand, the model equation can generate a point prediction for any given budget.
$$ REV\hat{ENUE} = \theta_0 + \theta_1 \cdot BUDGET $$
The estimated revenue of a $350 million budget is $599,521,139
Key Findings¶
- Production budget explains 55.77% of the variance in worldwide revenue for films released from 1970 onwards (R² = 0.5577).
- Every $1 of production budget is associated with ~$3.12 in worldwide gross (slope = 3.1226) for the modern era.
- 37.28% of all released films failed to recoup their production costs from worldwide box-office revenue.
- 357 films reported zero worldwide gross, including several with budgets above $60M — many were unreleased or streaming-only at the time of data collection.
- 155 international releases earned worldwide revenue while recording zero domestic (US) gross.
- The highest-budget film in the dataset is Avatar ($425M budget), which grossed $2.78B worldwide — more than 6× its production cost.
- The lowest-budget film is My Date With Drew ($1,100 budget, 2005), which grossed $181,041 — a return of over 160×.
- Pre-1970 films show a weaker and shallower budget-to-revenue relationship (slope ≈ 1.65) compared to the modern era, reflecting structural differences in distribution and market scale.