Property Analysis for JHB
This notebook will contain analysis for the housing market of Johannesburg. (As of late 2018)
import pandas as pd import numpy as np import matplotlib.pyplot as plt plt.style.use("ggplot") # Setting the style to ggplot (My personal pref) %matplotlib inline plt.rcParams['figure.figsize'] = [16, 8]
# Data Import csv_path = r"C:\Users\mdjco\Desktop\Py\data_csv.csv" df = pd.read_csv(csv_path)
Peaking our data
If we have a look at our data, we'll see that some values are missing, and that the data type of the price column is not correct.
Because the data was scraped from a website, some values might be missing - some might be incorrect.
print('Top 5 Rows of our data') print(df.head()) print('\nDFrame info') print(df.info())
Top 5 Rows of our data id hyperlink \ 0 105991866 /for-sale/lenasia/johannesburg/gauteng/4699/10... 1 106821708 /for-sale/lenasia/johannesburg/gauteng/4699/10... 2 106822297 /for-sale/lenasia-south/johannesburg/gauteng/3... 3 106863873 /for-sale/glenvista/johannesburg/gauteng/5317/... 4 106863817 /for-sale/mid-ennerdale/johannesburg/gauteng/3... area name \ 0 lenasia 4 Bedroom House for sale in Lenasia - Johannes... 1 lenasia 3 Bedroom House for sale in Lenasia - Johannes... 2 lenasia-south 4 Bedroom House for sale in Lenasia South - Jo... 3 glenvista 4 Bedroom House for sale in Glenvista - Johann... 4 mid-ennerdale House for sale in Mid Ennerdale - Johannesburg price bed bath 0 None 4.0 3.0 1 None 3.0 4.0 2 None 4.0 2.0 3 2199000.0000 4.0 3.0 4 650000.0000 NaN NaN DFrame info <class 'pandas.core.frame.DataFrame'> RangeIndex: 5297 entries, 0 to 5296 Data columns (total 7 columns): id 5297 non-null int64 hyperlink 5297 non-null object area 5277 non-null object name 5297 non-null object price 5297 non-null object bed 5173 non-null float64 bath 5173 non-null float64 dtypes: float64(2), int64(1), object(4) memory usage: 289.8+ KB None
Cleaning the Data
- We'll have to set our price to the correct dtype
- We'll set our id to the index
- We'll Make sure that our data doesn't contain duplicate houses (Each id should be unique)
- We'll drop null values, just to make sure that empty values don't scew the results
# Setting price to correct type: df["price"] = pd.to_numeric(df["price"], errors="coerce") # Invalid parses, will be treated as NaN # Dropping anything that is Nan, as well as duplicates df = df.drop_duplicates(subset=["id"], keep=False).dropna() # Setting the id as the index df = df.set_index('id')
<class 'pandas.core.frame.DataFrame'> Int64Index: 4887 entries, 106863873 to 106680615 Data columns (total 6 columns): hyperlink 4887 non-null object area 4887 non-null object name 4887 non-null object price 4887 non-null float64 bed 4887 non-null float64 bath 4887 non-null float64 dtypes: float64(3), object(3) memory usage: 267.3+ KB
|106863873||/for-sale/glenvista/johannesburg/gauteng/5317/...||glenvista||4 Bedroom House for sale in Glenvista - Johann...||2199000.0||4.0||3.0|
|106863540||/for-sale/bassonia/johannesburg/gauteng/5246/1...||bassonia||3 Bedroom House for sale in Bassonia - Johanne...||8700000.0||3.0||1.0|
|106863472||/for-sale/melrose/johannesburg/gauteng/5837/10...||melrose||3 Bedroom House for sale in Melrose - Johannes...||9450000.0||3.0||3.5|
|106863377||/for-sale/lenasia-south/johannesburg/gauteng/3...||lenasia-south||3 Bedroom House for sale in Lenasia South - Jo...||1095000.0||3.0||2.0|
|106863300||/for-sale/cyrildene/johannesburg/gauteng/5922/...||cyrildene||3 Bedroom House for sale in Cyrildene - Johann...||2399000.0||3.0||2.0|
Making Observations based on our Data
How many Houses are for sale?
(Limiting this to areas that have more than 50 houses for sale, As there are a lot of areas...)
# Value count of the houses in the area s = pd.value_counts(df["area"]) # We filter s so that there aren't that many names (Areas) s = s[s > 50] s.plot.bar(title="Areas with more than 50 Houses for Sale:")
<matplotlib.axes._subplots.AxesSubplot at 0x21c7ce4e198>
Where can you find the most expensive houses?
s_avg = df.groupby('area')['price'].mean() s_avg_greater_than5mil = s_avg[s_avg > 5000000] # Areas that have a house average value over R 5 000 000 s_avg_greater_than5mil.plot.bar(title='Avg. Price Per Area over R 5 000 000 ')
<matplotlib.axes._subplots.AxesSubplot at 0x21c7cd41da0>
Does the amount of rooms correlate to the most expensive houses?
Lets Check if the amount of rooms, on average, that a house in a certain area has, correlates with the price in that area.
s = df.groupby('area')['bed'].mean() s_greaterthan5 = s[s >= 5] s_greaterthan5.plot.bar(title="Bedrooms per area (More than 5 Rooms)")
<matplotlib.axes._subplots.AxesSubplot at 0x21c7cdb9470>
Well that's interesting?
We see some of the higher values areas... but there's one that stands out... What's going on highlands ... (That's a less than good area btw ...)
# Lets take a look at what we have for highlands then df_highlands = df.loc[df['area'] == 'highlands'] df_highlands.head()
|105243191||/for-sale/highlands/johannesburg/gauteng/5924/...||highlands||18 Bedroom House for sale in Highlands - Johan...||1399999.0||18.0||13.0|
We can see we only have one House/flat/investment property thing there... well let's make sure that this is correct still...
import IPython house_url = df_highlands.iloc['hyperlink'] # We know there is only one house url = 'https://www.property24.com' + house_url #iframe = '<iframe src=' + url + ' width=900 height=350></iframe>' #IPython.display.HTML(iframe) # Note, this may change # Not rendering the iframe, creates a weird issue with the notebook. Feel free to look at the URL destination though. print(url)
room_cost_per_area = np.divide(s_avg, s) room_cost_per_area.plot.bar(title="Average price for a single room in Area")
<matplotlib.axes._subplots.AxesSubplot at 0x21c7cfb5f28>
# The houses where the room price is over 2 million room_costOver2Mil = room_cost_per_area[room_cost_per_area > 2000000] room_costOver2Mil.plot.bar(title="Average price for a single room in Area, Over 2 Million")
<matplotlib.axes._subplots.AxesSubplot at 0x21c7dc1cef0>
# The houses where the room price is under R250 000 room_costUnderMil = room_cost_per_area[room_cost_per_area < 250000] room_costUnderMil.plot.bar(title="Average price for a single room in Area, Under R250 000")
<matplotlib.axes._subplots.AxesSubplot at 0x21c7dc97470>