Setup and Context¶
Introduction¶
Welcome to Boston Massachusetts in the 1970s! Imagine you're working for a real estate development company. Your company wants to value any residential project before they start. You are tasked with building a model that can provide a price estimate based on a home's characteristics like:
- The number of rooms
- The distance to employment centres
- How rich or poor the area is
- How many students there are per teacher in local schools etc
To accomplish your task you will:
- Analyse and explore the Boston house price data
- Split your data for training and testing
- Run a Multivariable Regression
- Evaluate how your model's coefficients and residuals
- Use data transformation to improve your model performance
- Use your model to estimate a property price
Upgrade plotly (only Google Colab Notebook)¶
Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.
Import Statements¶
Notebook Presentation¶
Load the Data¶
The first column in the .csv file just has the row numbers, so it will be used as the index.
Understand the Boston House Price Dataset¶
Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.
:Attribute Information (in order):
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centres
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT % lower status of the population
14. PRICE Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. You can find the original research paper here.
Preliminary Data Exploration¶
(506, 14)
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'PRICE'],
dtype='object')
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.01 | 18.00 | 2.31 | 0.00 | 0.54 | 6.58 | 65.20 | 4.09 | 1.00 | 296.00 | 15.30 | 396.90 | 4.98 | 24.00 |
| 1 | 0.03 | 0.00 | 7.07 | 0.00 | 0.47 | 6.42 | 78.90 | 4.97 | 2.00 | 242.00 | 17.80 | 396.90 | 9.14 | 21.60 |
| 2 | 0.03 | 0.00 | 7.07 | 0.00 | 0.47 | 7.18 | 61.10 | 4.97 | 2.00 | 242.00 | 17.80 | 392.83 | 4.03 | 34.70 |
| 3 | 0.03 | 0.00 | 2.18 | 0.00 | 0.46 | 7.00 | 45.80 | 6.06 | 3.00 | 222.00 | 18.70 | 394.63 | 2.94 | 33.40 |
| 4 | 0.07 | 0.00 | 2.18 | 0.00 | 0.46 | 7.15 | 54.20 | 6.06 | 3.00 | 222.00 | 18.70 | 396.90 | 5.33 | 36.20 |
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 501 | 0.06 | 0.00 | 11.93 | 0.00 | 0.57 | 6.59 | 69.10 | 2.48 | 1.00 | 273.00 | 21.00 | 391.99 | 9.67 | 22.40 |
| 502 | 0.05 | 0.00 | 11.93 | 0.00 | 0.57 | 6.12 | 76.70 | 2.29 | 1.00 | 273.00 | 21.00 | 396.90 | 9.08 | 20.60 |
| 503 | 0.06 | 0.00 | 11.93 | 0.00 | 0.57 | 6.98 | 91.00 | 2.17 | 1.00 | 273.00 | 21.00 | 396.90 | 5.64 | 23.90 |
| 504 | 0.11 | 0.00 | 11.93 | 0.00 | 0.57 | 6.79 | 89.30 | 2.39 | 1.00 | 273.00 | 21.00 | 393.45 | 6.48 | 22.00 |
| 505 | 0.05 | 0.00 | 11.93 | 0.00 | 0.57 | 6.03 | 80.80 | 2.50 | 1.00 | 273.00 | 21.00 | 396.90 | 7.88 | 11.90 |
CRIM 506 ZN 506 INDUS 506 CHAS 506 NOX 506 RM 506 AGE 506 DIS 506 RAD 506 TAX 506 PTRATIO 506 B 506 LSTAT 506 PRICE 506 dtype: int64
Data Cleaning - Check for Missing Values and Duplicates¶
<class 'pandas.core.frame.DataFrame'> Index: 506 entries, 0 to 505 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 506 non-null float64 1 ZN 506 non-null float64 2 INDUS 506 non-null float64 3 CHAS 506 non-null float64 4 NOX 506 non-null float64 5 RM 506 non-null float64 6 AGE 506 non-null float64 7 DIS 506 non-null float64 8 RAD 506 non-null float64 9 TAX 506 non-null float64 10 PTRATIO 506 non-null float64 11 B 506 non-null float64 12 LSTAT 506 non-null float64 13 PRICE 506 non-null float64 dtypes: float64(14) memory usage: 59.3 KB
Any NaN values? False
Any duplicates? False
There are no null (i.e., NaN) values. Fantastic!
Descriptive Statistics¶
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 |
| mean | 3.61 | 11.36 | 11.14 | 0.07 | 0.55 | 6.28 | 68.57 | 3.80 | 9.55 | 408.24 | 18.46 | 356.67 | 12.65 | 22.53 |
| std | 8.60 | 23.32 | 6.86 | 0.25 | 0.12 | 0.70 | 28.15 | 2.11 | 8.71 | 168.54 | 2.16 | 91.29 | 7.14 | 9.20 |
| min | 0.01 | 0.00 | 0.46 | 0.00 | 0.39 | 3.56 | 2.90 | 1.13 | 1.00 | 187.00 | 12.60 | 0.32 | 1.73 | 5.00 |
| 25% | 0.08 | 0.00 | 5.19 | 0.00 | 0.45 | 5.89 | 45.02 | 2.10 | 4.00 | 279.00 | 17.40 | 375.38 | 6.95 | 17.02 |
| 50% | 0.26 | 0.00 | 9.69 | 0.00 | 0.54 | 6.21 | 77.50 | 3.21 | 5.00 | 330.00 | 19.05 | 391.44 | 11.36 | 21.20 |
| 75% | 3.68 | 12.50 | 18.10 | 0.00 | 0.62 | 6.62 | 94.07 | 5.19 | 24.00 | 666.00 | 20.20 | 396.23 | 16.96 | 25.00 |
| max | 88.98 | 100.00 | 27.74 | 1.00 | 0.87 | 8.78 | 100.00 | 12.13 | 24.00 | 711.00 | 22.00 | 396.90 | 37.97 | 50.00 |
CHAS shows whether the home is next to the Charles River or not. As such, it only has the value 0 or 1. This kind of feature is also known as a dummy variable.
The average price of a Boston home in the 1970s was 22.53 or $22,530. We've experienced a lot of inflation and house price appreciation since then!
Feature Distributions¶
House Prices 💰¶
Note there is a spike in the number homes at the very right tail at the $50,000 mark. 🤔
Distance to Employment - Length of Commute 🚗¶
Most homes are about 3.8 miles away from work. There are fewer and fewer homes the further out we go.
Number of Rooms¶
Access to Highways 🛣¶
RAD is an index of accessibility to roads. Better access to a highway is represented by a higher number. There's a big gap in the values of the index.
Next to the River? ⛵️¶
Challenge
Create a bar chart with plotly for CHAS to show many more homes are away from the river versus next to it. The bar chart should look something like this:
You can make your life easier by providing a list of values for the x-axis (e.g., x=['No', 'Yes'])
We see that out of the total number of 506 homes, only 35 are located next to the Charles River.
Understand the Relationships in the Data¶
Pairwise Relationships¶
We see that we get back a grid. You might have to zoom in or squint a bit, but there are scatterplots between all the columns in our dataset. And down the diagonal in the middle, we get histograms for all our columns.
Detailed Pairwise Analysis¶
Distance from Employment vs. Pollution¶
Challenge:
Compare DIS (Distance from employment) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as the distance increases?
We see that pollution goes down as we go further and further out of town. This makes intuitive sense. However, even at the same distance of 2 miles to employment centres, we can get very different levels of pollution. By the same token, DIS of 9 miles and 12 miles have very similar levels of pollution.
Proportion of Non-Retail Industry 🏭🏭🏭 versus Pollution¶
Challenge:
Compare INDUS (the proportion of non-retail industry i.e., factories) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as there is a higher proportion of industry?
% of Lower Income Population vs Average Number of Rooms¶
Challenge
Compare LSTAT (proportion of lower-income population) with RM (number of rooms) using Seaborn's .jointplot(). How does the number of rooms per dwelling vary with the poverty of area? Do homes have more or fewer rooms when LSTAT is low?
In the top left corner we see that all the homes with 8 or more rooms, LSTAT is well below 10%.
% of Lower Income Population versus Home Price¶
Challenge
Compare LSTAT with PRICE using Seaborn's .jointplot(). How does the proportion of the lower-income population in an area affect home prices?
Number of Rooms versus Home Value¶
Challenge
Compare RM (number of rooms) with PRICE using Seaborn's .jointplot(). You can probably guess how the number of rooms affects home prices. 😊
Again, we see those homes at the $50,000 mark all lined up at the top of the chart. Perhaps there was some sort of cap or maximum value imposed during data collection.
Split Training & Test Dataset¶
We can't use all 506 entries in our dataset to train our model. The reason is that we want to evaluate our model on data that it hasn't seen yet (i.e., out-of-sample data). That way we can get a better idea of its performance in the real world.
Challenge
- Import the
train_test_split()function from sklearn - Create 4 subsets: X_train, X_test, y_train, y_test
- Split the training and testing data roughly 80/20.
- To get the same random split every time you run your notebook use
random_state=10. This helps us get the same results every time and avoid confusion while we're learning.
Hint: Remember, your target is your home PRICE, and your features are all the other columns you'll use to predict the price.
Training data is 79.8% of the total data. Test data makes up the remaining 20.2%.
Multivariable Regression¶
In a previous lesson, we had a linear model with only a single feature (our movie budgets). This time we have a total of 13 features. Therefore, our Linear Regression model will have the following form:
$$ PR \hat ICE = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta _3 DIS + \theta _4 CHAS ... + \theta _{13} LSTAT$$
Model v1 — Linear Regression on Raw Prices¶
Training data r-squared: 0.75
0.75 is a very high r-squared!
Coefficient Interpretation¶
| Coefficient | |
|---|---|
| CRIM | -0.13 |
| ZN | 0.06 |
| INDUS | -0.01 |
| CHAS | 1.97 |
| NOX | -16.27 |
| RM | 3.11 |
| AGE | 0.02 |
| DIS | -1.48 |
| RAD | 0.30 |
| TAX | -0.01 |
| PTRATIO | -0.82 |
| B | 0.01 |
| LSTAT | -0.58 |
The price premium for having an extra room is $3108.5
Analyse the Estimated Values & Regression Residuals¶
The next step is to evaluate our regression. How good our regression is depends not only on the r-squared. It also depends on the residuals - the difference between the model's predictions ($\hat y_i$) and the true values ($y_i$) inside y_train.
predicted_values = regr.predict(X_train)
residuals = (y_train - predicted_values)
Challenge: Create two scatter plots.
The first plot should be actual values (y_train) against the predicted value values:
The cyan line in the middle shows y_train against y_train. If the predictions had been 100% accurate then all the dots would be on this line. The further away the dots are from the line, the worse the prediction was. That makes the distance to the cyan line, you guessed it, our residuals 😊
The second plot should be the residuals against the predicted prices. Here's what we're looking for:
Residual Diagnostics — Model v1¶
We see that the residuals have a skewness of 1.46. There could be some room for improvement here.
Log Transformation of Target¶
The log prices have a skew that's closer to zero. This makes them a good candidate for use in our linear model. Perhaps using log prices will improve our regression's r-squared and our model's residuals.
How does the log transformation work?¶
Using a log transformation does not affect every price equally. Large prices are affected more than smaller prices in the dataset. Here's how the prices are "compressed" by the log transformation:
We can see this when we plot the actual prices against the (transformed) log prices.
Model v2 — Regression on Log Prices¶
Training data r-squared: 0.79
This time we got an r-squared of 0.79 compared to 0.75. This looks like a promising improvement.
Coefficient Interpretation — Log Model¶
| coef | |
|---|---|
| CRIM | -0.01 |
| ZN | 0.00 |
| INDUS | 0.00 |
| CHAS | 0.08 |
| NOX | -0.70 |
| RM | 0.07 |
| AGE | 0.00 |
| DIS | -0.05 |
| RAD | 0.01 |
| TAX | -0.00 |
| PTRATIO | -0.03 |
| B | 0.00 |
| LSTAT | -0.03 |
So how can we interpret the coefficients? The key thing we look for is still the sign - being close to the river results in higher property prices because CHAS has a coefficient greater than zero. Therefore property prices are higher next to the river.
More students per teacher - a higher PTRATIO - is a clear negative. Smaller classroom sizes are indicative of higher quality education, so have a negative coefficient for PTRATIO.
Residual Comparison — Raw vs Log Models¶
It's hard to see a difference here just by eye. The predicted values seems slightly closer to the cyan line, but eyeballing the charts is not terribly helpful in this case.
Residual Diagnostics — Log Model¶
Our new regression residuals have a skew of 0.09 compared to a skew of 1.46. The mean is still around 0. From both a residuals perspective and an r-squared perspective we have improved our model with the data transformation.
Out-of-Sample Performance¶
Original Model Test Data r-squared: 0.67 Log Model Test Data r-squared: 0.74
By definition, the model has not been optimised for the testing data. Therefore performance will be worse than on the training data. However, our r-squared still remains high, so we have built a useful model.
Predict a Property's Value using the Regression Coefficients¶
Our preferred model now has an equation that looks like this:
$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$
The average property has the mean value for all its charactistics:
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.61 | 11.36 | 11.14 | 0.07 | 0.55 | 6.28 | 68.57 | 3.80 | 9.55 | 408.24 | 18.46 | 356.67 | 12.65 |
Valuation — Average Property¶
The log price estimate is $3.03 The property is estimated to be worth $20703.2
A property with an average value for all the features has a value of $20,700.
Valuation — Custom Property¶
The log price estimate is $3.25 The property is estimated to be worth $25792.0
Key Findings¶
The multivariable linear regression on 506 Boston census tracts (1970s data) reveals the following:
Model performance
- Raw model: training r² = 0.75, test r² = 0.67
- Log-price model: training r² = 0.79, test r² = 0.74
- Applying a log transformation to the price target reduced residual skew from 1.46 to 0.09
Feature effects (log model coefficients)
- Each additional room (RM) adds approximately $3,109 in property value (raw model coefficient: 3.11)
- River proximity (CHAS): positive coefficient (+0.08) — bordering the Charles River raises estimated price
- Pollution (NOX): strongest negative predictor (−0.70) — higher nitric oxide concentration strongly depresses prices
- Poverty (LSTAT): negative coefficient (−0.03) — higher proportion of lower-status population reduces prices
- Higher pupil-to-teacher ratio (PTRATIO): negative coefficient (−0.03) — larger class sizes correlate with lower prices
Valuation outputs
- Average property (all features at mean values): estimated at $20,703
- 8-room river-front property, low poverty, 5 miles from employment: estimated at $25,792
Dataset
- 506 observations, 13 features, no missing values, no duplicates
- Only 35 of 506 properties border the Charles River