Before you read this report, you need to know some about these:
- This project have two dataset, one is Airbnb’s Seattle dataset, and the other one is the Boston’s host and reviews dataset.
- In this report, we use data science’s methods to discuss three questions in two datasets, we also use Python as tools.
You can get all the code and the debugging process from here.
Step 0. Introduction
“If your friend is fresh out of college and wants to work in Seattle or Boston, then housing is the first consideration. Rent or buy? The vast majority of people rent because they are not well off just entering the workforce. So how do you offer the best value for money, if you’re a programmer, how do you help him, and what advice do you give?”
Nowadays, with the booming development of the Internet, our life cannot be separated from our mobile phones and network. Through the APP, the web can get the information of renting houses in the first time. The first thing that comes to my mind is Airbnb. This article will take you through the methods of data science to explore the mysteries of data.
Step 1. Explore and Preprocessing Dataset
First of all, I will start from the data set and collect historical records as my reference opportunity. Luckily, I can get the data set of Airbnb through the Internet. So we can easily do the first step, get enough data. If you need, you can find them through the GitHub link.
Then, we need to explore the data, process it, discover the mystery. We looked at Airbnb rental records for the whole of 2016 to 2017 and found some interesting phenomena.
As you can see from the above plot, Most of months the available houses more than 60K, but January is a declining trend, lower than 10k.
Unbelievable that you’ll notice that a sharp drop in occupancy rates in January, as if no one care about the renting market.I guess most of people are preparing for the New Year and few people are willing to rent their houses.
Step 2. Get the main research
In order to find the relationship between features, we first need to understand the existing features.So I select some features of dataframe. I remain a few of features like ID, city, price, security_deposit, review_score_rating and so on.:
In this picture, you can see the price column, it’s not a digital, if you want to calculate the price, we need to transfer it’s type.At first, delete the ‘$’,’.’,’ ‘ from dataframe. Then write a function do you wanted to do.After that like this:
With the above preprocessing, we can answer the following three questions.
Step 3. Question And Answer
Q1. Is there any relationship between ‘price’ and ‘review_score_rate’?
Through the above pictures about ‘Price and Review Scores Rating’,it shows that the house which price is higher and it’s review scores rating’s distribute is almost get higher at the same time.And you will find the price of house lower than $400, most of their score are between 70 and 90, which is very high. So, every penny of it, if you want to find a house with high quality, you had better use more money, it’s worth it.
Q2. Is there any pattern between other scores with price ?
At this question,I write a loop to plot each pattern.In order to show the relationship between different features and prices.
From the plot images, all the features are between the ‘price’ and ‘scores’, It seems like there’s only price/security_deposit plot have effects with others.From the plot of price/security_deposit, it seems that not have any relationship with ‘price’.Perhaps the expensive price houses were not care about the security_deposit, they face to higher level’s consumers.
Q3. Is there any pattern between the location and the price?
At this question,I try contact two data in one dataframe, also do one-hot handle for the feature of city.We need to notice that the city names, though different, refer to the same place, so we need to replace these names uniformly before we can carry out the one-hot coding.
Finally, we find that there are many expensive houses in class 3 or class 5. One of them is Seattle, the other one is the center of Boston.Both of them have a convenient transportation and high cost of living. Of course, living in the center of city, you need more money to support living expenses and rent.
Step 4. Conclusions
In this paper,I don’t use any ML models.Because it is not necessary.Throughout sample explore the data, we can find some relationship between different features. If want to make the report better,I think we can use linear regression model to predict the rent of house in next year. We can also use classification model to analyze customers spending habits, then recommend to customers the most suitable and most cost-effective house.