Zillow Time-Series Model
New England zip codes and future predictions
I was asked to forecast real estate prices of various zip codes using data from the small Zillow dataset. I will be acting as a consultant for a fictional real-estate investment firm and need to build a time series model to justify my findings. The firm has asked me to determine:
What is the best zip code of New England’s medium sized villages to purchase a home for the best five year return?
Import the necessary libraries:
# Below are the libraries I will use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inlinefrom statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from matplotlib.pylab import rcParamsimport itertools
from statsmodels.tsa.arima_model import ARMA
import statsmodels.api as smnp.random.seed(123)
Load the data from the small Zillow data set:
df = pd.read_csv("zillow_data.csv")
df.head()
You can see that the data is in a ‘long’ format with 272 columns, most of them for the values at each monthly date. First, I want to change the column “RegionName” to what they actually are, “Zipcodes”! Then drop unnecessary columns.
# Change RegionName to Zipcode, drop unnecessary columns
df = df.rename(columns={'RegionName': 'Zipcode'})
df = df.drop(['RegionID', 'SizeRank'], axis = 1)
df.info()
Now I want to get the data down to only New England. Turns out that all the New England states have zip codes that begin with ‘0’ — making them only four digits long.
# There should be six New England states!
NE_zips = df.loc[df['Zipcode'] < 9999]
NE_zips.State.unique()
# I viewed this map of the New England states.
from IPython.display import Image
from IPython.core.display import HTML
basin_url = ("https://newengland.com/wp-content/uploads/new-england-states-map.jpg")
Image(url=basin_url, width=500, height=400)
# Whooops, New Jersey is not a part of New England!
NE_zips.drop(NE_zips[NE_zips['State'] == 'NJ'].index, inplace=True)
print(NE_zips.State.unique())
NE_zips.info()
So I still have a lot of sorting to do. Therefore I will now use population as a filter to narrow down my list.
Again, in order to stick to the business plan, we will be cross referencing only the zip codes for villages between 10 and 15 thousand people — below that may be too small, too touristy, or may not have enough of a town center to provide adequate opportunities for necessities such as restaurants, groceries, or gas. Towns larger than that won’t have the quaint New England feel, too near a large metro area, or may be a suburb.
# From "The US Decennial Census of Population and Housing"
population = pd.read_csv('population_by_zip_2010.csv')
population.head()
I need to sort out only the totals, which are integrated into the columns, making it somewhat confusing to know the total population for each zip code. But I was able to figure out that all of the columns where the ‘age’ and ‘gender’ is ‘Nan’ that is where they hid the total populations.
# totals = population[(population['minimum_age'].isnull()) &
(population['maximum_age'].isnull()) &
(population['gender'].isnull())]
Now that I have only the total population for each zip code I can simplify it down, first by keeping only ‘population’ and ‘zipcode’ columns and then by looking at only the New England zip codes.
# Simplify new data frame to get rid of everything else
totals = totals[['population', 'zipcode']]
# Reduce it down to just New England again
totals = totals.loc[(totals['zipcode'] < 6999) & (totals['zipcode'] > 999)]
Now I can merge this data frame with my original.
# Merge the population totals with my first NE dataframe
ne_pops = NE_zips.merge(totals, left_on='Zipcode', right_on='zipcode')
From here I can look at the demographics
# Look at the demographics
ne_pops.population.describe()
And I can see the percent of villages I want to look at is just above the median population for zip codes.
ne_pops.population.hist()
plt.title('Populations Per Zip Code')
plt.xlabel('Populations')
plt.ylabel('Number of Zip Codes')
plt.show()
To narrow down my search, I had to spend quite a bit of time cross referencing villages around New England.
# Villages around 10,000 people
ne_villages = ne_pops.loc[(ne_pops['population'] > 10000) & (ne_pops['population'] < 16000)]
ne_villages.head()
To view where these villages are located, I decided to cross reference them on a map. To do that, I needed to get the latitude and longitude from another dataset.
# https://www.unitedstateszipcodes.org/zip-code-database/latlong_codes = pd.read_csv("zip_code_database.csv")
latlong_codes.head()
# I just want the zipcodes and the lat/lon
latlong_codes = latlong_codes.filter(['zip', 'latitude', 'longitude'], axis=1)
# Get rid of short codes and anything not in New England
ne_codes = latlong_codes[(latlong_codes.zip_codes > 999) & (latlong_codes.zip_codes < 6999)]
Merge the lat/lon data frame with my New England villages.
# Merge the lat and lon with my first NE dataframe
ne_village_map = ne_villages.merge(ne_codes, left_on='Zipcode', right_on='zip_codes')
Then put them on a map:
# Find out where these zip codes are located
import folium
NE = folium.Map([42.9956, -71.4548],zoom_start=7, width='50%')
#location
for lat,lon in zip(ne_village_map['latitude'],ne_village_map['longitude']):
folium.CircleMarker([lat, lon], radius=3).add_to(NE)
NE
Now I want to look at percentages…
# Stacked bar graph showing total '5 year', '3 year', and '1 year' percentage changes
ax = ne_vil_averages.plot(x="Zipcode", y='5_year', kind="bar", figsize=(16,6))
ne_vil_averages.plot(x="Zipcode", y='3_year', kind="bar", ax=ax, color="C2")
ne_vil_averages.plot(x="Zipcode", y='1_year', kind="bar", ax=ax, color="C3")
plt.title('Total Percent of Change Per Zipcode', size=24)
plt.legend(loc='upper left', title='Period', title_fontsize=18, fontsize=15)
plt.xlabel('Zipcodes', size=20)
plt.xticks(rotation='horizontal', size=14)
plt.ylabel('Percentage', size=20)
From here I am going to take the top six from the recent percent of change and put into a new data frame:
# I need only the six rows in a new df
final_six = ne_villages.loc[(ne_villages['Zipcode'] == 1082) +
(ne_villages['Zipcode'] == 2882) +
(ne_villages['Zipcode'] == 3755) +
(ne_villages['Zipcode'] == 4105) +
(ne_villages['Zipcode'] == 5602) +
(ne_villages['Zipcode'] == 6382)].sort_values(by='Zipcode')
final_six
I am going to create a function to change from the wide format to the long format.
# Melting the data to long format
def melt_data(df):
melted = pd.melt(df, id_vars=['Zipcode', 'City', 'State', 'Metro',
'CountyName', 'population'], var_name='time')
melted['time'] = pd.to_datetime(melted['time'], infer_datetime_format=True)
melted = melted.dropna(subset=['value'])
# Grouping by time and value, resampling, and filling NaN's
melted = melted.groupby('time').aggregate({'value':'mean'})
melted = melted['value'].resample('MS').mean()
melted = melted.fillna(melted.bfill())
return melted
Next I have to check the correlograms in order to examen the p-d-q’s.
# Look at the correlograms
def corr_graphs(melted):
model =' '
color = 'g'
fig, ax = plt.subplots(figsize=(12,3))
plot_acf(melted, ax=ax, lags=22, color= color, title=model +
' Autocorrelation');
fig, ax = plt.subplots(figsize=(12,3))
plot_pacf(melted, ax=ax, lags=22, color=color, title=model +
' Partial Autocorrelation');
Run the model
# This is the function that will choose the best 'pdq' and run the model
import warnings
warnings.filterwarnings('ignore')def run_model(melted):
# Define the p, d and q parameters to take any value between 0 and 2
p = d = q = range(0, 2)
# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))
# Generate all different combinations of seasonal p, q and q triplets
pdqs = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]# Run a grid with pdq and seasonal pdq parameters calculated above and get the best AIC value
ans = []
for comb in pdq:
for combs in pdqs:
try:
mod = sm.tsa.statespace.SARIMAX(melted, order=comb,
seasonal_order=combs,
enforce_stationarity=False,
enforce_invertibility=False)
output = mod.fit()
ans.append([comb, combs, output.aic])
except:
continue
# Find the parameters with minimal AIC value
ans_df = pd.DataFrame(ans, columns=['pdq', 'pdqs', 'aic'])
print("Minimum AIC Value and Optimal Params ", '\n',
ans_df.loc[ans_df['aic'].idxmin()], '\n')# Plug the optimal parameter values into a new SARIMAX model
SARIMA_MODEL = sm.tsa.statespace.SARIMAX(melted,
order=ans_df.loc[ans_df['aic'].idxmin()]['pdq'],
seasonal_order=ans_df.loc[ans_df['aic'].
idxmin()]['pdqs'],
enforce_stationarity=False,
enforce_invertibility=False)
# Fit the model and print results
output = SARIMA_MODEL.fit()
print("SARIMA Model ", '\n', output.summary().tables[1])
# Call plot_diagnostics() on the results calculated above
output.plot_diagnostics(figsize=(12, 6))
plt.show()
return output
And view the zip code with the best result
# This function will get the confidence interval for the predictions and graphs
def get_predictions(melted, output):
# Get predictions starting from 2016-04-01 and calculate confidence intervals
pred = output.get_prediction(start=pd.to_datetime('2016-04-01'), dynamic=False)
pred_conf = pred.conf_int()
# Plot real vs predicted values along with confidence interval
rcParams['figure.figsize'] = (12, 3)
# Plot observed values
ax = melted['2010':].plot(label='observed')
# Plot predicted values
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=0.9)
# Plot the range for confidence intervals
ax.fill_between(pred_conf.index,
pred_conf.iloc[:, 0],
pred_conf.iloc[:, 1], color='g', alpha=0.5)# Set axes labels
ax.set_title('One-Step Forecast')
ax.set_xlabel('Date')
ax.set_ylabel('Values')
plt.legend()
plt.show()# Get forecast for future (60 steps = 5 years)
prediction = output.get_forecast(steps=60)
# Get confidence intervals of forecasts
pred_conf = prediction.conf_int()# Plot future predictions with confidence intervals
ax = melted.plot(label='observed', figsize=(12, 3))
prediction.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_conf.index,
pred_conf.iloc[:, 0],
pred_conf.iloc[:, 1], color='y', alpha=0.25)
ax.set_title('Future Predictions', size = 18)
ax.set_xlabel('Date', size=16)
ax.set_ylabel('Values', size=16)
plt.legend(loc='upper left')
plt.show()
# Get the real and predicted values
melted_forecasted = pred.predicted_mean
melted_truth = melted['2016-04-01':]
# Compute the mean square error
mse = ((melted_forecasted - melted_truth) ** 2).mean()
print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)),
'\n')
# Compute the root mean square error
rmse = np.sqrt(mse)
print('The RMSE of our forecasts is {}'.format(round(rmse, 2), ), '\n')
# Compute the normalized root mean square error using max/min
nrmse = rmse*100/(pred_conf['upper value'].max() - pred_conf['lower value'].min())
print('The Max/Min Normalized RMSE is {}'.format(round(nrmse, 2)), '\n')
# Compute the mean_absolute_percentage_error
mape = np.mean(np.abs((melted_truth - melted_forecasted) / melted_truth))
print('The Mean Absolute Percentage Error is {}%'.format(round(mape * 100, 2)),'\n')
# Compute average column
pred_conf['mean value'] = pred_conf.apply(lambda row: (row['lower value'] +
row['upper value'])/2,
axis=1)
# Add helpful data for analysis
print('The Current Value is: $', melted[-1],'\n')
print('The Five Year predictions values are: ', '\n',
(round(pred_conf.tail(1), 2)), '\n')
# Plot just prediction
pred_conf.plot(figsize=(10,2))
plt.title('Predicted Forecast', size=18)
plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)
plt.xlabel('Years', size=15)
plt.xticks(rotation='horizontal')
plt.ylabel('Value', size=15)
# Average change in predicted value
start_pred = pred_conf['mean value'][0]
end_pred = pred_conf['mean value'][59]
change_pred = (end_pred - start_pred)/start_pred
print('The Five Year average change in value:', (round(change_pred, 4)*100), '%')
The predictions were:
The Five Year predictions values are:
lower value upper value mean value
2023-04-01 355691.02 865677.64 610684.33The RMSE of our forecasts is 1189.54
This is the graph for the five year predictions for my top zip code.
Using the five year predicted mean and the current value I determined the change in value percent.
The Current Value is: $ 447900.00
The Five Year average change in value: 34.93 %
The Rhode Island zip code of 02882 showed the best change in value for the mean of predictions at 34.93% even though it had one of the larger RMSE scores of 1189.54. Yet considering the price of houses for this region were nearly double that of some of the other regions it means that this prediction is still the best.
For investors looking to buy a property in a quaint New England village, where there town sizes are not too small, nor are the towns considered a suburb of a sprawling capital/large city, I have narrowed the field of possibilities and created a model that can forecast with limited certainly the region that is likely to show growth.
Using my predicted results from my SARIMAX Time Series model and calculated RMSE scores I would recommend the Rhode Island zip code of 02882 as the region with best possibility for investment growth over the next five years. While there certainly are other zip codes around the country and even in New England that may have better growth, only towns with a population of around 10,000 were considered.