No description has been provided for this image

Setup and Context¶

Introduction¶

Welcome to Boston Massachusetts in the 1970s! Imagine you're working for a real estate development company. Your company wants to value any residential project before they start. You are tasked with building a model that can provide a price estimate based on a home's characteristics like:

The number of rooms
The distance to employment centres
How rich or poor the area is
How many students there are per teacher in local schools etc

To accomplish your task you will:

Analyse and explore the Boston house price data
Split your data for training and testing
Run a Multivariable Regression
Evaluate how your model's coefficients and residuals
Use data transformation to improve your model performance
Use your model to estimate a property price

Upgrade plotly (only Google Colab Notebook)¶

Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.

Import Statements¶

Notebook Presentation¶

Load the Data¶

The first column in the .csv file just has the row numbers, so it will be used as the index.

Understand the Boston House Price Dataset¶

Characteristics:

:Number of Instances: 506 

:Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.

:Attribute Information (in order):
    1. CRIM     per capita crime rate by town
    2. ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    3. INDUS    proportion of non-retail business acres per town
    4. CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    5. NOX      nitric oxides concentration (parts per 10 million)
    6. RM       average number of rooms per dwelling
    7. AGE      proportion of owner-occupied units built prior to 1940
    8. DIS      weighted distances to five Boston employment centres
    9. RAD      index of accessibility to radial highways
    10. TAX      full-value property-tax rate per $10,000
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    13. LSTAT    % lower status of the population
    14. PRICE     Median value of owner-occupied homes in $1000's
    
:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. You can find the original research paper here.

Preliminary Data Exploration¶

(506, 14)

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'PRICE'],
      dtype='object')

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.01	18.00	2.31	0.54	6.58	65.20	4.09	1.00	296.00	15.30	396.90	4.98	24.00
1	0.03	0.00	7.07	0.47	6.42	78.90	4.97	2.00	242.00	17.80	396.90	9.14	21.60
2	0.03	0.00	7.07	0.47	7.18	61.10	4.97	2.00	242.00	17.80	392.83	4.03	34.70
3	0.03	0.00	2.18	0.46	7.00	45.80	6.06	3.00	222.00	18.70	394.63	2.94	33.40
4	0.07	0.00	2.18	0.46	7.15	54.20	6.06	3.00	222.00	18.70	396.90	5.33	36.20

	CRIM	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
501	0.06	11.93	0.57	6.59	69.10	2.48	1.00	273.00	21.00	391.99	9.67	22.40
502	0.05	11.93	0.57	6.12	76.70	2.29	1.00	273.00	21.00	396.90	9.08	20.60
503	0.06	11.93	0.57	6.98	91.00	2.17	1.00	273.00	21.00	396.90	5.64	23.90
504	0.11	11.93	0.57	6.79	89.30	2.39	1.00	273.00	21.00	393.45	6.48	22.00
505	0.05	11.93	0.57	6.03	80.80	2.50	1.00	273.00	21.00	396.90	7.88	11.90

CRIM       506
ZN         506
INDUS      506
CHAS       506
NOX        506
RM         506
AGE        506
DIS        506
RAD        506
TAX        506
PTRATIO    506
B          506
LSTAT      506
PRICE      506
dtype: int64

Data Cleaning - Check for Missing Values and Duplicates¶

<class 'pandas.core.frame.DataFrame'>
Index: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  PRICE    506 non-null    float64
dtypes: float64(14)
memory usage: 59.3 KB

Any NaN values? False

Any duplicates? False

There are no null (i.e., NaN) values. Fantastic!

Descriptive Statistics¶

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
count	506.00	506.00	506.00	506.00	506.00	506.00	506.00	506.00	506.00	506.00	506.00	506.00	506.00	506.00
mean	3.61	11.36	11.14	0.07	0.55	6.28	68.57	3.80	9.55	408.24	18.46	356.67	12.65	22.53
std	8.60	23.32	6.86	0.25	0.12	0.70	28.15	2.11	8.71	168.54	2.16	91.29	7.14	9.20
min	0.01	0.00	0.46	0.00	0.39	3.56	2.90	1.13	1.00	187.00	12.60	0.32	1.73	5.00
25%	0.08	0.00	5.19	0.00	0.45	5.89	45.02	2.10	4.00	279.00	17.40	375.38	6.95	17.02
50%	0.26	0.00	9.69	0.00	0.54	6.21	77.50	3.21	5.00	330.00	19.05	391.44	11.36	21.20
75%	3.68	12.50	18.10	0.00	0.62	6.62	94.07	5.19	24.00	666.00	20.20	396.23	16.96	25.00
max	88.98	100.00	27.74	1.00	0.87	8.78	100.00	12.13	24.00	711.00	22.00	396.90	37.97	50.00

CHAS shows whether the home is next to the Charles River or not. As such, it only has the value 0 or 1. This kind of feature is also known as a dummy variable.

The average price of a Boston home in the 1970s was 22.53 or $22,530. We've experienced a lot of inflation and house price appreciation since then!

Feature Distributions¶

House Prices 💰¶

Note there is a spike in the number homes at the very right tail at the $50,000 mark. 🤔

Distance to Employment - Length of Commute 🚗¶

Most homes are about 3.8 miles away from work. There are fewer and fewer homes the further out we go.

Number of Rooms¶

Access to Highways 🛣¶

RAD is an index of accessibility to roads. Better access to a highway is represented by a higher number. There's a big gap in the values of the index.

Next to the River? ⛵️¶

Challenge

Create a bar chart with plotly for CHAS to show many more homes are away from the river versus next to it. The bar chart should look something like this:

You can make your life easier by providing a list of values for the x-axis (e.g., x=['No', 'Yes'])

We see that out of the total number of 506 homes, only 35 are located next to the Charles River.

Understand the Relationships in the Data¶

Pairwise Relationships¶

We see that we get back a grid. You might have to zoom in or squint a bit, but there are scatterplots between all the columns in our dataset. And down the diagonal in the middle, we get histograms for all our columns.

Detailed Pairwise Analysis¶

Distance from Employment vs. Pollution¶

Challenge:

Compare DIS (Distance from employment) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as the distance increases?

We see that pollution goes down as we go further and further out of town. This makes intuitive sense. However, even at the same distance of 2 miles to employment centres, we can get very different levels of pollution. By the same token, DIS of 9 miles and 12 miles have very similar levels of pollution.

Proportion of Non-Retail Industry 🏭🏭🏭 versus Pollution¶

Challenge:

Compare INDUS (the proportion of non-retail industry i.e., factories) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as there is a higher proportion of industry?

% of Lower Income Population vs Average Number of Rooms¶

Challenge

Compare LSTAT (proportion of lower-income population) with RM (number of rooms) using Seaborn's .jointplot(). How does the number of rooms per dwelling vary with the poverty of area? Do homes have more or fewer rooms when LSTAT is low?

In the top left corner we see that all the homes with 8 or more rooms, LSTAT is well below 10%.

% of Lower Income Population versus Home Price¶

Challenge

Compare LSTAT with PRICE using Seaborn's .jointplot(). How does the proportion of the lower-income population in an area affect home prices?

Number of Rooms versus Home Value¶

Challenge

Compare RM (number of rooms) with PRICE using Seaborn's .jointplot(). You can probably guess how the number of rooms affects home prices. 😊

Again, we see those homes at the $50,000 mark all lined up at the top of the chart. Perhaps there was some sort of cap or maximum value imposed during data collection.

Split Training & Test Dataset¶

We can't use all 506 entries in our dataset to train our model. The reason is that we want to evaluate our model on data that it hasn't seen yet (i.e., out-of-sample data). That way we can get a better idea of its performance in the real world.

Challenge

Import the train_test_split() function from sklearn
Create 4 subsets: X_train, X_test, y_train, y_test
Split the training and testing data roughly 80/20.
To get the same random split every time you run your notebook use random_state=10. This helps us get the same results every time and avoid confusion while we're learning.

Hint: Remember, your target is your home PRICE, and your features are all the other columns you'll use to predict the price.

Training data is 79.8% of the total data.
Test data makes up the remaining 20.2%.

Multivariable Regression¶

In a previous lesson, we had a linear model with only a single feature (our movie budgets). This time we have a total of 13 features. Therefore, our Linear Regression model will have the following form:

$$ PR \hat ICE = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta _3 DIS + \theta _4 CHAS ... + \theta _{13} LSTAT$$

Model v1 — Linear Regression on Raw Prices¶

Training data r-squared: 0.75

0.75 is a very high r-squared!

Coefficient Interpretation¶

	Coefficient
CRIM	-0.13
ZN	0.06
INDUS	-0.01
CHAS	1.97
NOX	-16.27
RM	3.11
AGE	0.02
DIS	-1.48
RAD	0.30
TAX	-0.01
PTRATIO	-0.82
B	0.01
LSTAT	-0.58

The price premium for having an extra room is $3108.5

Analyse the Estimated Values & Regression Residuals¶

The next step is to evaluate our regression. How good our regression is depends not only on the r-squared. It also depends on the residuals - the difference between the model's predictions ($\hat y_i$) and the true values ($y_i$) inside y_train.

predicted_values = regr.predict(X_train)
residuals = (y_train - predicted_values)

Challenge: Create two scatter plots.

The first plot should be actual values (y_train) against the predicted value values:

The cyan line in the middle shows y_train against y_train. If the predictions had been 100% accurate then all the dots would be on this line. The further away the dots are from the line, the worse the prediction was. That makes the distance to the cyan line, you guessed it, our residuals 😊

The second plot should be the residuals against the predicted prices. Here's what we're looking for:

Residual Diagnostics — Model v1¶

We see that the residuals have a skewness of 1.46. There could be some room for improvement here.

Log Transformation of Target¶

The log prices have a skew that's closer to zero. This makes them a good candidate for use in our linear model. Perhaps using log prices will improve our regression's r-squared and our model's residuals.

How does the log transformation work?¶

Using a log transformation does not affect every price equally. Large prices are affected more than smaller prices in the dataset. Here's how the prices are "compressed" by the log transformation:

We can see this when we plot the actual prices against the (transformed) log prices.

Model v2 — Regression on Log Prices¶

Training data r-squared: 0.79

This time we got an r-squared of 0.79 compared to 0.75. This looks like a promising improvement.

Coefficient Interpretation — Log Model¶

	coef
CRIM	-0.01
ZN	0.00
INDUS	0.00
CHAS	0.08
NOX	-0.70
RM	0.07
AGE	0.00
DIS	-0.05
RAD	0.01
TAX	-0.00
PTRATIO	-0.03
B	0.00
LSTAT	-0.03

So how can we interpret the coefficients? The key thing we look for is still the sign - being close to the river results in higher property prices because CHAS has a coefficient greater than zero. Therefore property prices are higher next to the river.

More students per teacher - a higher PTRATIO - is a clear negative. Smaller classroom sizes are indicative of higher quality education, so have a negative coefficient for PTRATIO.

Residual Comparison — Raw vs Log Models¶

It's hard to see a difference here just by eye. The predicted values seems slightly closer to the cyan line, but eyeballing the charts is not terribly helpful in this case.

Residual Diagnostics — Log Model¶

Our new regression residuals have a skew of 0.09 compared to a skew of 1.46. The mean is still around 0. From both a residuals perspective and an r-squared perspective we have improved our model with the data transformation.

Out-of-Sample Performance¶

Original Model Test Data r-squared: 0.67
Log Model Test Data r-squared: 0.74

By definition, the model has not been optimised for the testing data. Therefore performance will be worse than on the training data. However, our r-squared still remains high, so we have built a useful model.

Predict a Property's Value using the Regression Coefficients¶

Our preferred model now has an equation that looks like this:

$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$

The average property has the mean value for all its charactistics:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	3.61	11.36	11.14	0.07	0.55	6.28	68.57	3.80	9.55	408.24	18.46	356.67	12.65

Valuation — Average Property¶

The log price estimate is $3.03
The property is estimated to be worth $20703.2

A property with an average value for all the features has a value of $20,700.

Valuation — Custom Property¶

The log price estimate is $3.25
The property is estimated to be worth $25792.0

Key Findings¶

The multivariable linear regression on 506 Boston census tracts (1970s data) reveals the following:

Model performance

Raw model: training r² = 0.75, test r² = 0.67
Log-price model: training r² = 0.79, test r² = 0.74
Applying a log transformation to the price target reduced residual skew from 1.46 to 0.09

Feature effects (log model coefficients)

Each additional room (RM) adds approximately $3,109 in property value (raw model coefficient: 3.11)
River proximity (CHAS): positive coefficient (+0.08) — bordering the Charles River raises estimated price
Pollution (NOX): strongest negative predictor (−0.70) — higher nitric oxide concentration strongly depresses prices
Poverty (LSTAT): negative coefficient (−0.03) — higher proportion of lower-status population reduces prices
Higher pupil-to-teacher ratio (PTRATIO): negative coefficient (−0.03) — larger class sizes correlate with lower prices

Valuation outputs

Average property (all features at mean values): estimated at $20,703
8-room river-front property, low poverty, 5 miles from employment: estimated at $25,792

Dataset

506 observations, 13 features, no missing values, no duplicates
Only 35 of 506 properties border the Charles River