Predicting Airbnb Data from Demographics
University of Michigan Master of Applied Data Science
Capstone Project 2021–12–19
Thomas Easley and Cameron Lyons
Overview
Motivation
The goal of this project is to examine the predictive power of demographic information and changes in demographics on Airbnb rental prices. Specifically, we looked at which factor has a greater determinative power on a normalized metric of rental prices in a given area — current demographics or changes in demographics. We examined first what demographic variables may be statistically correlated with prices and their variance in an area, as well as what variables may be confounding variables. Then we constructed a supervised model and saw how much predictive power can be generated from each set of variables.
Data Sources
We used the 29 US cities featured in the Inside AirBnb dataset for information on Airbnb rentals.
We also used the available data from the US Census Bureau, primarily the 2020 and 2010 datasets. These datasets are all available for download in various csv, zipped, and shapefile formats. Specifically, we used geospatial data from the TIGER/Line shapefiles and data from the Census Redistricting Data sets, and the accompanying H1, P1, P2, P3, and P4 datasets for both 2010 and 2020 as well as the P5 table only available in the 2020 dataset.
Methods and Evaluation
The first step in the implementation of the project was to co-locate each listing into its specific census tract using the python geopandas package. The census tracts vary between the 2010 and 2020 census information, so points may be co-located in different census tracts depending on the year. Once the individual listings have been co-located in their respective tracts for each year, queries can be made via the US Census Bureau API in order to pull down demographic data for each of the tracts in their respective years.
Once the listing data has been co-located with the demographic data, then naïve statistical analysis was done examining which variables had some predictive value. This then informed our primary variables to be used when building supervised learning models.
Once a subset of the demographic variables had been identified as statistically significant for correlation to Airbnb listing information, we then attempted to construct supervised learning models to examine the potential predictive power of current demographic variables and delta in demographic variables.
Data Engineering
Airbnb Data
The reading of the Inside Airbnb dataset is done in a straightforward manner with each link to a .gz file hosted on Inside Airbnb’s website being read as an IO Byte stream and turned into a pandas dataframe as per below code:
url = urllib.request.urlopen(filename)with gzip.open(BytesIO(url.read()), ‘rb’) as l_file:
l_df = pd.read_csv(l_file)
g_df.append(l_df)
From here the only additional step is combining the dataframes and then converting them into a geopandas dataframe so we can use it as a geospatial object with a defined CRS:
g_df = pd.concat(g_df)g_df = gpd.GeoDataFrame(
g_df,
geometry=gpd.points_from_xy(
g_df[‘longitude’],
g_df[‘latitude’]),
crs=’EPSG:4326'
)
A visualization of the final geographic data in QGIS can be viewed below:
Census Geospatial Data
Census Geospatial data, in this case Census Tract data, was read from Census Bureau TIGER/Line FTP archives for each of the years in question. Specifically, these datasets are provided as ESRI shapefiles which can be directly read into geopandas. These geopandas dataframes are then concatenated in the same manner as a pandas dataframe and output as a combined set of shapefiles.
A visualization of the census tracts for both years can been seen below:
Spatial Join and Census API Pull
The combined Airbnb and Census Geospatial datasets are then joined together using the .sjoin command available in geopandas, as long as all the dataframes have been converted to a common CRS. Performing this step is important because the limitations on pulling data via the Census API, we only want to be pulling data for the tracts that we are interested in or we could be downloading a large volume of data for tracts that are not part of the Airbnb dataset.
census_df = gpd.read_file(census_file)
airbnb_df = gpd.read_file(airbnb_file)
census_df = census_df.to_crs(‘EPSG:4326’)
airbnb_df = airbnb_df.to_crs(‘EPSG:4326’)
used_tl = gpd.sjoin(census_df,airbnb_df)
Once the join is complete, a set of unique of Census Tract identifiers can be used to pull only the required data from the H1 and P1 through P5 data that is part of the Census Redistricting Data set:
data_step = “,”.join(DATA_2020[last_n:next_n])
census_url = f’https://api.census.gov/data/2020/dec/pl?get={data_step}&for=tract:{l_tracts}&in=state:{l_statefp}%20county:{l_county}&key={API_KEY}'
try:
response=requests.get(census_url)
data=response.json()
i_df=pd.DataFrame(data[1:], columns=data[0])
if len(iter_df)==0:
iter_df = i_df
else:data_step = “,”.join(DATA_2020[last_n:next_n])
In this case the program actually uses a series of nested loops as the Census Bureau restricts requests to 50 at a time and more than 50 data fields (Handled as part of the {data_step} in the above) will be requested across the multiple tracts that will be requested (Handled as part of the {l_tracts}, {l_statefp} and {l_county} codes in the above)
Separate queries are also used for the 2010 and 2020 Censuses as they have slightly different names for each of the {data_steps} used above.
However, once all of the data has been requested it can relatively trivially be combined with the Airbnb dataset using a simple .merge command in pandas between the resultant dataframes for each Census Tract and the Airbnb data, which because of the spatial merge performed earlier, already has the relevant Census Tract information to be merged on.
The full diagram of the data engineering workflow can be seen below and is implemented via DVC in the Github repository linked at the end of this paper:
Modeling
We wanted to be able to include information from the listed photos as well as the descriptions and neighborhood overview in our models. These would likely have a strong causal impact on rental price with better photos and a better description inducing demand and allowing the host to charge a higher rental price.. We included information on the photos by training a simple CNN on the first photo of each listing trained to predict the rental price. We also used tfidf vectorizer and trained a model on each of the resulting vectors to predict rental price.
We also divided the total number of people in each racial group by the total population to get the percentage of each racial group. We also calculated the difference in equivalent fields between 2010 and 2020 for the census data fields.
For the level 1 model we selected gradient boosted trees as it performed well, is easy to get feature importances and can capture non linear relationships. We tried training models on the full set of features, without the 2010 data and changes in demographics and without the 2020 census data.
Discussion
Conclusions from Results
We were able to get strong r2 scores across the models we trained. However those that included 2020 census data were significantly more predictive than the models that just used the change in demographics.
The photo is more predictive of pricing than the text based models. The layout of the webpage and app likely contribute to this as the first photo is prominently displayed and likely the first thing a potential customer would see when searching for a place to book. The description of the place is more predictive than the other text information. It is larger and displays before the other information. The models trained on host and neighborhood descriptions were not as descriptive. These are smaller and placed below reviews which many may believe offer better information on the quality of the host.
For the demographic information, unfortunately and unsurprisingly the most important features were those related to the number of African American people in the neighborhoods, which were associated with lower prices for Airbnb rentals.
The combined modeling workflow can be visualized below:
Further Work
We could try using modern transfer learning methods to improve the predictive power of our level 1 models. We could also look at making these models more interpretable. This could be used to advise hosts on how to take photos and write descriptions that would be more appealing to potential customers. However, there are potential ethical pitfalls of a more interpretable model should any racial demographic information be shown to have a substantially negative impact on Airbnb listing price. These learnings may also transfer to other domains.
It would also be interesting to look at this data over longer time frames as people’s perceptions may be slow to change and therefore lag behind reality. It also may take a longer period of continued change for enough change to occur in demographics to register in public perception and then impact pricing.
Also training models on the remaining photos could improve our results as the first is certainly important for making a first impression but the remaining photos may contain something that may significantly impact what many potential customers would be willing to spend on the rental.
Statement of Work
Thomas Easley worked on the data engineering. Cameron Lyons worked on the modeling.
Source Code
All source code can be found at : https://github.com/tgeasley/capstone_dvc