# Zillow Time-Series Model

## New England zip codes and future predictions

I was asked to forecast real estate prices of various zip codes using data from the small Zillow dataset. I will be acting as a consultant for a fictional real-estate investment firm and need to build a time series model to justify my findings. The firm has asked me to determine:

*What is the best zip code of New England’s medium sized villages to purchase a home for the best five year return?*

Import the necessary libraries:

# Below are the libraries I will use.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inlinefrom statsmodels.graphics.tsaplots import plot_acf, plot_pacf

from matplotlib.pylab import rcParamsimport itertools

from statsmodels.tsa.arima_model import ARMA

import statsmodels.api as smnp.random.seed(123)

Load the data from the small Zillow data set:

`df = pd.read_csv("zillow_data.csv")`

df.head()

You can see that the data is in a ‘long’ format with 272 columns, most of them for the values at each monthly date. First, I want to change the column “RegionName” to what they actually are, “Zipcodes”! Then drop unnecessary columns.

`# Change RegionName to Zipcode, drop unnecessary columns`

df = df.rename(columns={'RegionName': 'Zipcode'})

df = df.drop(['RegionID', 'SizeRank'], axis = 1)

df.info()

Now I want to get the data down to only New England. Turns out that all the New England states have zip codes that begin with ‘0’ — making them only four digits long.

`# There should be six New England states!`

NE_zips = df.loc[df['Zipcode'] < 9999]

NE_zips.State.unique()

`# I viewed this map of the New England states.`

from IPython.display import Image

from IPython.core.display import HTML

basin_url = ("https://newengland.com/wp-content/uploads/new-england-states-map.jpg")

Image(url=basin_url, width=500, height=400)

`# Whooops, New Jersey is not a part of New England!`

NE_zips.drop(NE_zips[NE_zips['State'] == 'NJ'].index, inplace=True)

print(NE_zips.State.unique())

NE_zips.info()

So I still have a lot of sorting to do. Therefore I will now use population as a filter to narrow down my list.

Again, in order to stick to the business plan, we will be cross referencing only the zip codes for villages between 10 and 15 thousand people — below that may be too small, too touristy, or may not have enough of a town center to provide adequate opportunities for necessities such as restaurants, groceries, or gas. Towns larger than that won’t have the quaint New England feel, too near a large metro area, or may be a suburb.

`# From "The US Decennial Census of Population and Housing"`

population = pd.read_csv('population_by_zip_2010.csv')

population.head()

I need to sort out only the totals, which are integrated into the columns, making it somewhat confusing to know the total population for each zip code. But I was able to figure out that all of the columns where the ‘age’ and ‘gender’ is ‘Nan’ that is where they hid the total populations.

`# totals = population[(population['minimum_age'].isnull()) & `

(population['maximum_age'].isnull()) &

(population['gender'].isnull())]

Now that I have only the total population for each zip code I can simplify it down, first by keeping only ‘population’ and ‘zipcode’ columns and then by looking at only the New England zip codes.

`# Simplify new data frame to get rid of everything else`

totals = totals[['population', 'zipcode']]

# Reduce it down to just New England again

totals = totals.loc[(totals['zipcode'] < 6999) & (totals['zipcode'] > 999)]

Now I can merge this data frame with my original.

`# Merge the population totals with my first NE dataframe`

ne_pops = NE_zips.merge(totals, left_on='Zipcode', right_on='zipcode')

From here I can look at the demographics

`# Look at the demographics`

ne_pops.population.describe()

And I can see the percent of villages I want to look at is just above the median population for zip codes.

`ne_pops.population.hist()`

plt.title('Populations Per Zip Code')

plt.xlabel('Populations')

plt.ylabel('Number of Zip Codes')

plt.show()

To narrow down my search, I had to spend quite a bit of time cross referencing villages around New England.

`# Villages around 10,000 people`

ne_villages = ne_pops.loc[(ne_pops['population'] > 10000) & (ne_pops['population'] < 16000)]

ne_villages.head()

To view where these villages are located, I decided to cross reference them on a map. To do that, I needed to get the latitude and longitude from another dataset.

# https://www.unitedstateszipcodes.org/zip-code-database/latlong_codes = pd.read_csv("zip_code_database.csv")

latlong_codes.head()

`# I just want the zipcodes and the lat/lon`

latlong_codes = latlong_codes.filter(['zip', 'latitude', 'longitude'], axis=1)

# Get rid of short codes and anything not in New England

ne_codes = latlong_codes[(latlong_codes.zip_codes > 999) & (latlong_codes.zip_codes < 6999)]

Merge the lat/lon data frame with my New England villages.

`# Merge the lat and lon with my first NE dataframe`

ne_village_map = ne_villages.merge(ne_codes, left_on='Zipcode', right_on='zip_codes')

Then put them on a map:

`# Find out where these zip codes are located`

import folium

NE = folium.Map([42.9956, -71.4548],zoom_start=7, width='50%')

#location

for lat,lon in zip(ne_village_map['latitude'],ne_village_map['longitude']):

folium.CircleMarker([lat, lon], radius=3).add_to(NE)

NE

Now I want to look at percentages…

`# Stacked bar graph showing total '5 year', '3 year', and '1 year' percentage changes`

ax = ne_vil_averages.plot(x="Zipcode", y='5_year', kind="bar", figsize=(16,6))

ne_vil_averages.plot(x="Zipcode", y='3_year', kind="bar", ax=ax, color="C2")

ne_vil_averages.plot(x="Zipcode", y='1_year', kind="bar", ax=ax, color="C3")

plt.title('Total Percent of Change Per Zipcode', size=24)

plt.legend(loc='upper left', title='Period', title_fontsize=18, fontsize=15)

plt.xlabel('Zipcodes', size=20)

plt.xticks(rotation='horizontal', size=14)

plt.ylabel('Percentage', size=20)

From here I am going to take the top six from the recent percent of change and put into a new data frame:

`# I need only the six rows in a new df`

final_six = ne_villages.loc[(ne_villages['Zipcode'] == 1082) +

(ne_villages['Zipcode'] == 2882) +

(ne_villages['Zipcode'] == 3755) +

(ne_villages['Zipcode'] == 4105) +

(ne_villages['Zipcode'] == 5602) +

(ne_villages['Zipcode'] == 6382)].sort_values(by='Zipcode')

final_six

I am going to create a function to change from the wide format to the long format.

`# Melting the data to long format`

def melt_data(df):

melted = pd.melt(df, id_vars=['Zipcode', 'City', 'State', 'Metro',

'CountyName', 'population'], var_name='time')

melted['time'] = pd.to_datetime(melted['time'], infer_datetime_format=True)

melted = melted.dropna(subset=['value'])

# Grouping by time and value, resampling, and filling NaN's

melted = melted.groupby('time').aggregate({'value':'mean'})

melted = melted['value'].resample('MS').mean()

melted = melted.fillna(melted.bfill())

return melted

Next I have to check the correlograms in order to examen the p-d-q’s.

`# Look at the correlograms`

def corr_graphs(melted):

model =' '

color = 'g'

fig, ax = plt.subplots(figsize=(12,3))

plot_acf(melted, ax=ax, lags=22, color= color, title=model +

' Autocorrelation');

fig, ax = plt.subplots(figsize=(12,3))

plot_pacf(melted, ax=ax, lags=22, color=color, title=model +

' Partial Autocorrelation');

Run the model

# This is the function that will choose the best 'pdq' and run the model

import warnings

warnings.filterwarnings('ignore')def run_model(melted):

# Define the p, d and q parameters to take any value between 0 and 2

p = d = q = range(0, 2)

# Generate all different combinations of p, q and q triplets

pdq = list(itertools.product(p, d, q))

# Generate all different combinations of seasonal p, q and q triplets

pdqs = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]# Run a grid with pdq and seasonal pdq parameters calculated above and get the best AIC value

ans = []

for comb in pdq:

for combs in pdqs:

try:

mod = sm.tsa.statespace.SARIMAX(melted, order=comb,

seasonal_order=combs,

enforce_stationarity=False,

enforce_invertibility=False)

output = mod.fit()

ans.append([comb, combs, output.aic])

except:

continue

# Find the parameters with minimal AIC value

ans_df = pd.DataFrame(ans, columns=['pdq', 'pdqs', 'aic'])

print("Minimum AIC Value and Optimal Params ", '\n',

ans_df.loc[ans_df['aic'].idxmin()], '\n')# Plug the optimal parameter values into a new SARIMAX model

SARIMA_MODEL = sm.tsa.statespace.SARIMAX(melted,

order=ans_df.loc[ans_df['aic'].idxmin()]['pdq'],

seasonal_order=ans_df.loc[ans_df['aic'].

idxmin()]['pdqs'],

enforce_stationarity=False,

enforce_invertibility=False)

# Fit the model and print results

output = SARIMA_MODEL.fit()

print("SARIMA Model ", '\n', output.summary().tables[1])

# Call plot_diagnostics() on the results calculated above

output.plot_diagnostics(figsize=(12, 6))

plt.show()

return output

And view the zip code with the best result

# This function will get the confidence interval for the predictions and graphs

def get_predictions(melted, output):

# Get predictions starting from 2016-04-01 and calculate confidence intervals

pred = output.get_prediction(start=pd.to_datetime('2016-04-01'), dynamic=False)

pred_conf = pred.conf_int()

# Plot real vs predicted values along with confidence interval

rcParams['figure.figsize'] = (12, 3)

# Plot observed values

ax = melted['2010':].plot(label='observed')

# Plot predicted values

pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast', alpha=0.9)

# Plot the range for confidence intervals

ax.fill_between(pred_conf.index,

pred_conf.iloc[:, 0],

pred_conf.iloc[:, 1], color='g', alpha=0.5)# Set axes labels

ax.set_title('One-Step Forecast')

ax.set_xlabel('Date')

ax.set_ylabel('Values')

plt.legend()

plt.show()# Get forecast for future (60 steps = 5 years)

prediction = output.get_forecast(steps=60)

# Get confidence intervals of forecasts

pred_conf = prediction.conf_int()# Plot future predictions with confidence intervals

ax = melted.plot(label='observed', figsize=(12, 3))

prediction.predicted_mean.plot(ax=ax, label='Forecast')

ax.fill_between(pred_conf.index,

pred_conf.iloc[:, 0],

pred_conf.iloc[:, 1], color='y', alpha=0.25)

ax.set_title('Future Predictions', size = 18)

ax.set_xlabel('Date', size=16)

ax.set_ylabel('Values', size=16)

plt.legend(loc='upper left')

plt.show()

# Get the real and predicted values

melted_forecasted = pred.predicted_mean

melted_truth = melted['2016-04-01':]

# Compute the mean square error

mse = ((melted_forecasted - melted_truth) ** 2).mean()

print('The Mean Squared Error of our forecasts is {}'.format(round(mse, 2)),

'\n')

# Compute the root mean square error

rmse = np.sqrt(mse)

print('The RMSE of our forecasts is {}'.format(round(rmse, 2), ), '\n')

# Compute the normalized root mean square error using max/min

nrmse = rmse*100/(pred_conf['upper value'].max() - pred_conf['lower value'].min())

print('The Max/Min Normalized RMSE is {}'.format(round(nrmse, 2)), '\n')

# Compute the mean_absolute_percentage_error

mape = np.mean(np.abs((melted_truth - melted_forecasted) / melted_truth))

print('The Mean Absolute Percentage Error is {}%'.format(round(mape * 100, 2)),'\n')

# Compute average column

pred_conf['mean value'] = pred_conf.apply(lambda row: (row['lower value'] +

row['upper value'])/2,

axis=1)

# Add helpful data for analysis

print('The Current Value is: $', melted[-1],'\n')

print('The Five Year predictions values are: ', '\n',

(round(pred_conf.tail(1), 2)), '\n')

# Plot just prediction

pred_conf.plot(figsize=(10,2))

plt.title('Predicted Forecast', size=18)

plt.legend(bbox_to_anchor=(1, 1), loc='upper left', ncol=1)

plt.xlabel('Years', size=15)

plt.xticks(rotation='horizontal')

plt.ylabel('Value', size=15)

# Average change in predicted value

start_pred = pred_conf['mean value'][0]

end_pred = pred_conf['mean value'][59]

change_pred = (end_pred - start_pred)/start_pred

print('The Five Year average change in value:', (round(change_pred, 4)*100), '%')

The predictions were:

The Five Year predictions values are:

lower value upper value mean value

2023-04-01 355691.02 865677.64 610684.33The RMSE of our forecasts is 1189.54

This is the graph for the five year predictions for my top zip code.

Using the five year predicted mean and the current value I determined the change in value percent.

`The Current Value is: $ 447900.00`

The Five Year average change in value: 34.93 %

The Rhode Island zip code of 02882 showed the best change in value for the mean of predictions at 34.93% even though it had one of the larger RMSE scores of 1189.54. Yet considering the price of houses for this region were nearly double that of some of the other regions it means that this prediction is still the best.

For investors looking to buy a property in a quaint New England village, where there town sizes are not too small, nor are the towns considered a suburb of a sprawling capital/large city, I have narrowed the field of possibilities and created a model that can forecast with limited certainly the region that is likely to show growth.

Using my predicted results from my SARIMAX Time Series model and calculated RMSE scores I would recommend the Rhode Island zip code of 02882 as the region with best possibility for investment growth over the next five years. While there certainly are other zip codes around the country and even in New England that may have better growth, only towns with a population of around 10,000 were considered.