I took a Data Science class in my MBA program, and I was recently re-reading our final project write-up. The assignment was to take a complex data set, do some analysis on it, and show the results. My group decided to focus on NYC real estate. We boldly set out to try to use subjectively important variables to predict real estate prices of NYC apartments.
Our thesis at the onset was that deals could be found in unexpected places and neighborhoods that one may not have previously considered. After we realized that the granularity of data required for this project wasn’t readily available, we pivoted to a more reasonable thought exercise: might lagging annual or quarterly Yelp reviews be a predictor of real estate listing prices – that if local businesses were seeing an uptrend in reviews (more reviews or better reviews) in a given year or quarter, would we see an increase in the listing prices of local real estate, and if so, can we use this knowledge to forecast prices?
Project Outcomes
The project entailed marrying data from Yelp, Trulia, Zillow, Google, and NYC Open Data, organizing it all, running various regression analyses on it, and then visualizing the results. We generated some cool charts. For example, this graph shows average prices for various property sizes from Trulia plotted against the average reviews from Yelp for pharmacies, restaurants, and bars.
In Q2 of 2012, bar and pharmacy reviews improve noticeably. It appears that rankings improve after big property type prices increase, and though it’s hard to tell, following the uptrends in Yelp reviews, there is a slight increase in the price of 1-, 2- and 3-bedroom units.
Ultimately, our conclusion was that the best predictor of next year’s prices are this year’s prices! In other words, data science isn’t easy. It was an interesting exercise if nothing else. Full writeup is available here, and all the code is up on github.