Building a web scraper tool to analyze condo prices in Toronto

An attempt at taking a data-driven approach towards one of the most crucial decisions of one’s life: Buying a home

Published in

Towards Data Science

8 min readSep 16, 2020

Condos have always fascinated me and in the city, I live, they are perhaps the first (or realistic) option that comes to the mind of many first-time home-buyers. Unless you have been living under a rock, you would be aware that house prices in Toronto have jumped manyfold over the last decade, and more so in the last 5 years. More recently, it has been the condo market that is on fire. So if you, like me, are looking for a condo to buy, you are in the right place.

Being a highly data-driven person, I built a web-scraper tool that helps me analyze the condo prices in Toronto. My go-to website is Condos.ca. It has a good user interface and provides market intelligence (which I thought would be useful in validating my results). At the time of writing this article, it has listings spanning over 80 web pages, and I shall extract data from the first 50 pages as described later in the article.

The objective of this undertaking was 2-fold:

To scrape essential data on relevant parameters from the website to build a benchmark database
To conduct market research by performing some exploratory EDA on the database such as average price per bedroom, average maintenance costs, average condo size, etc.

I extracted the information displayed on every listing such as the price, the street address, the number of bedrooms, bathrooms, whether it has parking or not, the size range and, the maintenance fees. (Note: Many other parameters affect condo prices such as the age of the building, property tax, Floor number, images, etc., but I have left these for simplicity)

It’s worth mentioning here that I had limited to no experience with HTML before performing this exercise. But here lies the beauty of web scraping. You don’t need an advanced understanding of HTML. I simply learned how to extract the required value from the waterfall of tags within the HTML code. And the rest is all python! Here is a useful resource on how to scrape websites.

So let’s get started!

We begin by importing the necessary modules.

from bs4 import BeautifulSoup # For HTML parsing
from time import sleep # To prevent overwhelming the server between connections
import pandas as pd # For converting results to a dataframe and bar chart plots# For Visualizations
import matplotlib.pyplot as plt
import plotly.offline as py
import plotly.graph_objs as go%matplotlib inline

As soon as I made the request to scrape the website, I ran into an error. This is because many websites try to block users from scraping any data and it may be illegal depending on what one plans to do with the data (I had however obtained permission from condos.ca). To navigate through this issue, I used an API called Selenium. It’s an API that allows you to programmatically interact with a browser the way a real user would. Although Selenium is primarily used to help test a web application, it can be used for any task where you need browser automation.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver.get(“https://condos.ca")

Working with filters

Running the previous code opens up a new browser and loads the website, and helps you interact with the website as if a real user would. For example, instead of manually clicking on the website to select filters such as the no. of bedrooms, or home type, or provide a price range, Selenium does that easily by passing in a couple of lines of command. The model gives the user the ability to select multiple filters.
For example, to get 2 bedroom options, I use the following code to click on the button:

two_bed = driver.find_element_by_css_selector( ‘insert_css_path’)
two_bed.click()

Similarly, to get all the results with Gym, I simply use the following code:

Gym = driver.find_element_by_css_selector('insert_css_path')
Gym.click()

Defining a function to iterate over multiple pages

Because I want to be able to do this analysis for other cities, I define a function that creates a beautiful soup object using parameters of ‘city’, ‘mode’, and ‘page no’. Here, the ‘mode’ parameter takes in ‘Sale’ or ‘Rent’ as values, giving the user the ability to analyze rental prices as well!

def get_page(city, mode, page):
    url= f'https://condos.ca/{city}/condos-for-{mode}?mode=                 {mode}&page={page}'
    driver.get(url) 
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'lxml')
    return soup

The function utilizes the module called BeautifulSoup and returns an object called soup for a given webpage. It also loads up the requested webpage. (Later I shall iterate over all the webpages to extract the soup object for all pages)

Now that we have the soup object, we can extract some useful information such as the total number of listings, total listings per page, etc., by parsing through the webpage HTML. It’s not as difficult as it sounds! We use soup.find() to obtain the relevant tags. A useful approach is to start by extracting the data on the first page. If you can do that successfully, rest is simply iterating the process over all the pages!

#Defining soup object for page 1
soup = get_page('toronto', 'Sale',1)

Extracting some relevant information on listings from the first page.

Total listings in Toronto :

#The total number of Condo Listings in Torontosoup.find(‘div’,class_ = ‘sc-AxjAm dWkXrE’).find(‘span’,class_ = _5FYo1’).get_text() #no. of listings : 3560

The number of listings on the first page :

len(soup.find_all(‘div’,class_ = ‘_3O0GU’)) #43

Now that we are a little comfortable with this, we can be a bit more ambitious and extract all the prices on page 1.

prices=[]
for tag in soup.find_all(‘div’,class_ = ‘_2PKdn’):
prices.append(tag.get_text())
prices[:5]['$450,000',
'$649,900',
'$399,999',
'$599,900',
'$780,000']

To make things simpler, I defined a variable called condo_container that would hold all the relevant data (price, location, size, etc.) of all the listings on a page

condo_container = soup.find_all('div','_3SMhd')

Now, all we have to do is extract price and other data from this condo_container. See the example below:

#Obtaining Location of all listings on page 1Location_list=[]
for i in range(len(condo_container)):
for tag in condo_container[i].find('span',class_='_1Gfb3'):
Location_list.append(tag)
Location_list[:5]['1001 - 19 Four Winds Dr',
'306 - 2 Aberfoyle Cres',
'911 - 100 Echo Pt',
'524 - 120 Dallimore Circ',
'1121 - 386 Yonge St']

Rinse and repeat the above process for all variables and we got all the lists that we would need to construct a data-frame (see sample code below). The process gets a little bit tricky while trying to extract parameters such as bathrooms, size, and parking, etc owing to the HTML structure but with a little bit of effort, it can be done! (I am not sharing the complete code on purpose so as to avoid the reproduction of code).

Final Dataset

Now that we have all the lists, we simply append them onto a dictionary called data defined below. Some of the tags get a little confusing but that’s because they have been formatted from string type to integer wherever required.

data = {'Prices':[],
'Location':[],
'Date_listed':[],
'Bedrooms':[],
'Bathrooms':[],
'Maint_Fees':[],
'Size':[],
'Parking':[]
}final_list=[]
for page in range(50):
    soup = get_page('toronto', 'Sale',page)
    condo_container = soup.find_all('div','_3SMhd')
    sleep(random())
    print(page)    for i in range(len(condo_container)):
    listing = []    price_tag = condo_container[i].find('div',class_= '_2PKdn').get_text()
    formatted_tag = int(price_tag.split('$')[1].replace(',',''))
    data['Prices'].append(formatted_tag) 
   
    location_tag =    condo_container[i].find('span',class_='_1Gfb3').get_text()    data['Location'].append(location_tag)    if maint_tag != '':
    
        Maintenance_Fees = int(maint_tag.split('$')    [1].replace(',',''))
        data['Maint_Fees'].append(Maintenance_Fees)
    else:
        data['Maint_Fees'].append('error')    for info_tag in condo_container[i].find('div',class_='_3FIJA'):
            listing.append(info_tag)
       
        final_list.append(listing)

Once we have the dictionary ready, we convert it into a pandas data-frame for further processing and EDA. The resulting data-frame has 2031 rows.

A quick look at the dataset tells us that Bedrooms, Bathrooms, Maintenance fees, and Size are object type variables because they had been stored as string type in the HTML.

These variables were cleaned and converted into an integer or float type. Moreover, I created a variable, Avg_Size from the Size variable. Through further data inspection, I found error values which I replaced with the mean of the respective columns. I also engineered a Street Address variable from the Location variable in case I want to perform some kind of location analysis later on. The dataset was treated for outliers, which were skewing my averages (expensive condos can get well expensive! ). The missing values were imputed with average or most occurring values in their respective columns.

Now, that our dataset looks nice and clean, we can go ahead with some basic exploratory analysis!

Analysis / Results

I was curious, how the average price and size varied by the number of bedrooms, which I think would be the first thing on the mind of any buyer! So I created some plots using Plotly (see sample code below).

price_by_bed = condos.groupby(['Bedrooms']['Prices'].mean()data = go.Bar(
    x=price_by_bed.index,
    y=price_by_bed.values,
             )layout = go.Layout(
    title='Average Condo Price',
    xaxis_title="No. of Beds",
    yaxis_title="Value in $M"
                  )
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Summary of Price and Size by beds (Image by author)

Average Price per sq/ft in Toronto (Image Src: Condos.ca)

The average price per square foot (taking in only 1,2 and 3 bedrooms in the subset) was calculated at $ 897, a little higher than the quoted average of $827/sq. ft on the day of my analysis (Note: There has been a gradual decline in average prices since the onset of COVID so the values shown here may be different from the current values).

I was also able to analyze average maintenance values by the number of bedrooms. An interesting insight was that maintenance fees can make or break your investment in a condo since it could account for almost 25% of your monthly mortgage value! (something to keep in mind and not just focus on that hefty price tag)

Average Maintenance Fee by Condo Size (Image by author)

Below, I analyzed the number of listings by average sizes and found that most condos on sale fall in the 600–699 sq. ft category.

Number of listings by Size (Image by Author)

These were some of the interesting insights I derived from this web scraping exercise. I am sure that armed with this knowledge now, I would fall into the category of an ‘informed buyer’’.

If you have any interesting points to share, I would love to hear your comments down below.

Thanks to the team at condos.ca for giving me permission to carry out this interesting and valuable exercise!

Disclaimer: Please note that the views, thoughts, and opinions expressed in the text belong solely to the author, and not necessarily to the author’s employer, organization, committee, or other group or individual. This is not investment advice and the author does not intend to use the data for any commercial purposes but for personal reasons only.

Building a web scraper tool to analyze condo prices in Toronto

An attempt at taking a data-driven approach towards one of the most crucial decisions of one’s life: Buying a home

Final Dataset

Analysis / Results

Written by Karan Singh