ImmoEliza-API
To create an API that will make price forecasts on houses or apartments according to certain parameters (postal code, number of rooms, surface area, etc.)
-
Website (AI Dev) - https://immoeliza-real-estate.herokuapp.com/
-
Github page - https://kaiyungtan.github.io/ImmoEliza-API/
-
Website (Web Dev) - Under construction
-
Github repo (Web Dev) - https://github.com/VieiraHenrique/testPhp (Under construction)
Background
This project is a collaboration between BeCode AI and the BeCode Web Dev team.
The AI developers will create an API and the web developers will develop an interface for the client “ImmoEliza”.
The main process is about a collaboration between the AI and the web dev so that all have to be in sync in order to know how to construct the form.
Team Members consists of:
-
AI Dev : Adam
-
Web Dev : Valentin, Henrique, Nathanaël
Mission objectives
- Be able to create a prediction model
- Be able to deploy model
- Be able to work in a team
- Be able to build an api
The Mission
You need to create an API that will make price forecasts on houses or apartments according to certain parameters (postal code, number of rooms, surface area, etc…). This API will be used by web devs who will be able to use it to create an interface for the ImmoEliza agency.
Use Case examples:
- to find out what is the predicted price of a house or apartment based on your selection criteria.
- to compare the asking price of what is posted on the real estate website i.e ImmoWeb with what is predicted price.
- to compare the price per square meter of the property with the average price per square meter of the city
Must-have features
- The api must be functional
- Your model must be functional
Additional features
- to provide house/apartment average price (square per meter) for all cities in Belgium
- to show the difference (%) between house/apartment sq/m2 versus the average sq/m2 for the city
Machine Learning Process Overview
Business Understanding
-
Getting a good estimate of the price of a house or apartment is hard even for the most seasoned real estate agents. It involves a lot of variables. The owner of the property can determine the price based on many factors. The property itself is of course the main factor to determine the price. However the location and facilities around the property can hugely impact the price. In addition, the inflation rate or personal reason can also play a part in influencing the price.
-
In this project, we are using different machine learning algorithms to learn from the features of the real estate in order to predict the price. The models neither take into account of the facilities around the real estate , nor it include other external factors i.e COVID-19 effect on the housing prices.
-
The price prediction is based on the price that posted on the website and not the actual transaction of the property. With this in mind, we can assume that when listing the property , most of the owner will probably put a higher price for room of negoatiation.
-
According to STATBEL:
-
The observed annual inflation rate for house prices amounts to 4.5 % in the second quarter of 2020 compared to 3.5 % in the previous quarter.
-
The average inflation rate for the last four quarters amounts to 4.3 %.
-
The house price index went up by 1.4 % in the second quarter of 2020 compared to the previous quarter.
-
The house price index can be broken down by new houses and existing houses. In the second quarter of 2020, annual inflation amounted to 5.3 % for new houses and 4.3 % for existing houses.
-
note: The house price index measures the price evolution with the assumption that the characteristics of the property sold remain unchanged.
Data Understanding
-
The dataset for the real estate were scrapped form Immoweb probably the biggest real estate website in Belgium mid of September 2020 with more than 50,000 of properties including houses and apartments from a previous BeCode Data Collecting Challenge.
-
Initially the dataset have 52077 rows and 20 columns and after data cleaning it was reduced to 40395 rows (observations) and 18 columns.
-
In order to get geographical informations about the data, Postal Codes dataset from https://data.gov.be/ is merged with the real estate dataset during a previous BeCode Real Estate Data analysis.
Data Preparation
-
After further data cleaning, the dataset was reduced to 24040 rows (observations) with 19 columns.(belgium_real_estate_2020_rev1_19.11.2020.csv)
-
Then the dataset were seperated to 2 dataset seperately namely df_house for houses and df_apartment for apartements.Belgium_Real_Estate_2020
- df_house (10254 rows, 19 columns)(belgium_houses_20.11.2020.csv)
- df_apartment (13207 rows, 18 columns) (belgium_apartments_20.11.2020.csv’)
-
to compare predicted price square per meter with the average price square per meter of a city, a seperate dataset was prepared to have only 7 columns:Price_Sqm - city_name - postal_code - price_sqm - region - province - longitude - lattitude
-
2 folium map were created to show average price square per meter in each city for houses and apartement.
Features of the dataset:
- postal_code (str): Postal code of city.
- city_name (str): city names in Belgium.
- number_of_rooms (int): The number of rooms of the property.
- house_area (int): The area (m2) of the house (floors).
- fully_equipped_kitchen (str): yes/no
- open_fire (str): yes/no
- terrace (str): yes/no
- garden (str): yes/no
- number_of_facades (int): The number of facades (0 to 4).
- swimming_pool (str): yes/no
- state_of_the_building (str): as new/good/just renovated/to renovate/unknown
- construction_year (int): The property built's year.
- surface_of_the_land (int): The area (m2) of the land. (for house only)
Target of the dataset:
- price (float) : Price (€) of the property.
Modeling
-
The objective of machine learning is not a model that does well on training data, but one that demonstrates it satisfies the business need and can be deployed on live data.
-
A machine learning model is a file that has been trained to recognize certain types of patterns. You train a model over a set of data, providing it an algorithm that it can use to reason over and learn from those data.
- Libraries used in this project as follow:
Libraries
- from sklearn.model_selection import train_test_split
- from sklearn.preprocessing import StandardScaler,OneHotEncoder
- from sklearn.compose import ColumnTransformer
- from sklearn.pipeline import Pipeline
- from sklearn.linear_model import LinearRegression
- from sklearn.linear_model import Lasso,Ridge,ElasticNet
- from sklearn.tree import DecisionTreeRegressor
- from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
- from xgboost import XGBRegressor
- from sklearn.metrics import mean_squared_error, r2_score
- The following diagram shows the process of modeling.
-
random search cross validation were used to find the best parameter setting that gave the best results on the hold out data.
-
Libraries used as follow:
- from sklearn.model_selection import RandomizedSearchCV
- from sklearn.metrics import make_scorer
Evaluation
-
After evaluating 8 different models on the test set plus various random search cross validation conducted on the test set:
-
a ridge model (alpha=0.7) was selected for house price prediction with train accuracy: 0.77 and test accuracy: 0.73.
-
a XGboost model (n_estimators=700, max_depth= 4, learning_rate= 0.3) was selected for apartment price prediction with train accuracy: 0.88 and test accuracy: 0.77
-
-
plot predicted vs actual overlay the regression line as show for ridge model.
- plot predicted vs actual overlay the regression line as show for Xgboost model.
-
models was saved using joblib library.
-
to test a new / live unseen data, an example of new immoweb were chosen:
-
https://www.immoweb.be/en/classified/house/for-sale/averbode/3271/9040949?searchId=5fb6439b8044e
House for sale
- €230,000
- 4 bedrooms 226 m²square meters
- Bredestraat 70 3271 — Averbode
- Construction year 1930
- Building condition To renovate
- Facades 3
- Kitchen type Installed
- Surface of the plot 398 m²square meters
- Garden surface 150 m²square meters
- Terrace surface 25 m²square meters
- create X_new features for predictions
- X_new = { ‘postal_code’:’3271’, ‘number_of_rooms’: 4, ‘house_area’ : 226, ‘fully_equipped_kitchen’: ‘yes’, ‘open_fire’:’no’, ‘terrace’:’yes’, ‘garden’:’yes’, ‘number_of_facades’: 3, ‘swimming_pool’: ‘no’, ‘state_of_the_building’: ‘to renovate’, ‘construction_year’ : 1930, ‘surface_of_the_land’ : 398}
- The result show predicted price for the house is € 228964.0 and -0.45 % difference compare to the posted asking price for the house.
Deployment on Heruko
- Create web app using Flask as framework
- Create a virtual environment called myenv
- pip install all libraries flask / numpy etc
- pip freeze –local > requirements.txt to create list of libraries installed on myenv environment
- Create a flask app – named app.py
- Add routes for api
- create layout template
- create Procfile – web: gunicorn app:app
- pip install gunicorn and update requirements.txt
- Commit code on github
- To deploy in heruko:
- create account in heroku
- link the github to heroku
- using terminal:
- heroku create
- git push heroku HEAD:master
- heroku ps:scale web=1
- heroku open
-
Deployment successful.
-
To test a new unseen data, an example from immoweb is chosen:
-
https://www.immoweb.be/en/classified/apartment/for-sale/anderlecht/1070/9042073?searchId=5fb749cc3354c
Apartment for sale
- €320,000
- 3 bedrooms | 130 m² square meters
- 1070 — Anderlecht
- Construction year 2017
- Building condition As new
- Facades 2
- Kitchen type USA hyper equipped
- Terrace surface 14 m² square meters
- Prediction with Postal Code - Apartment https://immoeliza-real-estate.herokuapp.com/apartment_postal_code
- Result the predicted price for the apartment is € 324,034 and 1.3 % difference compare to the posted asking price for the apartment.
API for web dev
-
two API routes were created for web dev to access the api namely:
- https://immoeliza-real-estate.herokuapp.com/predict_house_tojson2
- https://immoeliza-real-estate.herokuapp.com/predict_apartment_tojson2
-
it returns a json file with 4 key value pairs:
- “1. predicted price” : str(output)
- “2. predicted price_sqm” : str(pricem2)
- “3. {city_name} average price_sqm” : str(price_sqm)
- “4. difference(%)” : str(difference_pct)
-
tested api with postman
Challenges
- time spent on training models
- deployment on heruko
Limitation
-
only for prediction price of general houses or apartments, it doesn’t include subtype of property like villa,town-house,mansion,other exceptional property, country house.
-
when unseen data is one of the subtype of property, the model predicted price will have higher difference of the price.
Further Development
- To obtain more recent dataset from immoweb or other property websites
-
To include other features:
- Amenities : cellar? attic? parking?
- View of the property
- Energy class
- Number of floors (for apartment)
- Elevator (for apartment)
- Explore other machine algorithms i.e CatBoost, LightGBM
- To propose related property on the website based on the inputs and predicted price
- To predict rental prices of houses or apartments
- Deployment model in virtual service in the cloud:
- Amazon Web Services (AWS) EC2 Instance
- Google Cloud platform
- Azure Cloud
- To include range of prediction i.e 10% lower or higher of the predicted price