The first part of this project presents how the data used in the project was scraped from the web. Unlike part 2 and 3, this part will not be extensively commented, only the main ideas behind how the code was structured will be explained. The code itself contains many comments explaining the actions within each code block. Also, the whole structure of the code could surely be improved (e.g. there surely is a way to make the main function shorter), all comments are welcome. Notes to self were included which explain the thought process for scraping before the coding started.
All the data scraped is public and not copyrighted, and intended for personal use within this project only. The following code contains many GET requests to the swimrankings.net website. For ethical reasons, a timer of varying length was introduced between every request to avoid the target server being overloaded by a sudden great number of requests. Data for major meets before year 2015 was gated by authentication, which makes it a “little bit less ethical” to scrape, therefore only data from 2015 to 2023 was scraped.
Database models
We need to store the scraped data so that we can access it anytime we want. For that, we will create a database and corresponding mapped classes with SQLAlchemy ORM 2.0:
from sqlalchemy import Integer, String, ForeignKey
from sqlalchemy.orm import DeclarativeBase, mapped_column, relationship
class Base(DeclarativeBase):
pass
class MeetResult(Base):
__tablename__ = "meet_results"
id = mapped_column(Integer, primary_key=True)
swimmerId = mapped_column(ForeignKey("swimmers_info.id"), nullable=False)
majorMeet = mapped_column(String(20), nullable=False)
meetYear = mapped_column(String(5), nullable=False)
event = mapped_column(String(20), nullable=False)
soloTime = mapped_column(String(10), nullable=False)
soloReact = mapped_column(String(10), nullable=False)
soloRank = mapped_column(String(5))
relayEvent = mapped_column(String(20))
relayTime = mapped_column(String(10))
relayReact = mapped_column(String(10))
swimmer = relationship("Swimmer", back_populates="races")
class Swimmer(Base):
__tablename__ = "swimmers_info"
id = mapped_column(Integer, primary_key=True)
name = mapped_column(String(50), unique=True, nullable=False)
gender = mapped_column(String(10), nullable=False)
yearBirth = mapped_column(String(5), nullable=False)
country = mapped_column(String(20), nullable=False)
races = relationship("MeetResult", back_populates="swimmer")
2 mapped classes are created in order to keep the personal information of swimmers separate from the meet results table. In the meet_results
table, columns are created for every race-related variable we will collect data for; note the column swimmerId
which is a foreign key of the id
column of the swimmers_info
table, so that a meet result can be linked to the swimmer. A relationship()
attribute is created in both tables, and they back-populate each other. This is to facilitate database operations such as deleting a swimmer and all their races from the database in one go. The four last columns in meet_results
table are nullable
since we will be populating them as we go.
Data scraping code
# imports:
from bs4 import BeautifulSoup
from sqlalchemy import create_engine, select, update, and_
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm.exc import NoResultFound
from models import Base, MeetResult, Swimmer # importing the database classes
from datetime import datetime
import requests
import time
import random as rd
import os
Defining the constants
# proxy logins:
USERNAME = os.environ['SMTPROXY_USR']
PASSWORD = os.environ['SMTPROXY_PWD']
PROXY = f"https://{USERNAME}:{PASSWORD}@fr.smartproxy.com:40009"
# Chrome user-agent for requests:
HEADER = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}
# major meets and their "meetid" parameter for the url:
CHAMPIONSHIPS_DICT = {
"fukuoka_2023": 639635,
"budapest_2022": 632168,
"tokyo_2021": 626385,
"gwangju_2019": 616689,
"budapest_2017": 603267,
"rio_2016": 596227,
"kazan_2015": 589276,
}
# swimming events and their "value" parameter for the url:
SOLO_EVENTS_DICT = {
"100 Free": {"value": "2", "event": "100m Freestyle"},
"200 Free": {"value": "3", "event": "200m Freestyle"},
"100 Fly": {"value": "16", "event": "100m Butterfly"},
"100 Breast": {"value": "13", "event": "100m Breaststroke"},
"100 Back": {"value": "10", "event": "100m Backstroke"},
}
RELAY_EVENTS_DICT = {
"100 Breast lap": {"value": "63", "event": "100m Breaststroke"},
"100 Fly Lap": {"value": "65", "event": "100m Butterfly"},
"100 Free lap": {"value": "58", "event": "100m Freestyle"},
"200 Free lap": {"value": "59", "event": "200m Freestyle"},
}
# time formats of swim times:
TIME_FORMAT_LONG = "%M:%S.%f"
TIME_FORMAT_SHORT = "%S.%f"
The proxy location was periodically rotated, though not after every request as it would give away that the website is being “assaulted” by a bot using a rotating proxy; instead, it was rotated after a certain period of 20 to 30 minutes to avoid the target server flagging us as a suspicious user.
Secondary functions
The function below gets the reaction time for a race performance. It takes as parameter an html tag found by the main function (see next section), and returns the scraped reaction time as a string. Since we are making a request, we add error handling commands.
Notes to self: Execute a GET request from the url obtained, and find the reaction time in the page: find the
<th class="date" colspan="4">tag and fetch the reaction time contained in the tag. Ex: “Reaction time: + 0.50”
def get_reaction_time(race_result):
global HEADER
global PROXY
# get link to the page with race details:
reaction_time_link = "https://www.swimrankings.net/index.php" + race_result.find('td', class_="swimtime").a.get("href")
# hop over to the race details page (splits, reaction time...):
try:
# get reaction time:
reaction_response = requests.get(
reaction_time_link,
headers=HEADER,
proxies={
'http': PROXY,
'https': PROXY
})
print(reaction_response.status_code)
# start a timer to avoid overloading the server with requests:
time.sleep(rd.randint(6, 12))
reaction_soup = BeautifulSoup(reaction_response.text, 'html.parser')
try:
reaction_time = reaction_soup.find('th', class_="date", colspan="4").getText().split("+ ")[1]
except AttributeError:
return None
else:
return reaction_time
except requests.exceptions.ProxyError as proxy_error:
print(f"ProxyError: {proxy_error}")
except requests.exceptions.HTTPError as e:
print(f"HTTP error while fetching data from {reaction_time_link}: {e}")
except requests.exceptions.RequestException as e:
print(f"An error occurred while fetching data from {reaction_time_link}: {e}")
The function below creates the profile of a swimmer in the swimmers_info
database table if it does not already exist:
def create_new_swimmer(session, name, gender, year_birth, country):
try:
new_swimmer = Swimmer(name=name, gender=gender, yearBirth=year_birth, country=country)
except Exception as e:
print(f"Secondary exception: {e}")
else:
session.add(new_swimmer)
session.commit()
return new_swimmer
The function below creates a new row in the meet_results
database table and fills in the individual race details:
def create_entry(session, swimmer, meet, event, swim_time, react_time, rank):
try:
solo_entry = MeetResult(
majorMeet=meet,
meetYear=meet.split('_')[1],
event=event,
soloTime=swim_time,
soloReact=react_time,
soloRank=rank,
swimmer=swimmer # relationship() attribute, back-populated by Swimmer.races
)
except Exception as e:
print(f"Secondary exception: {e}")
else:
session.add(solo_entry)
session.commit()
The function below updates an individual race time if, while scraping the Semi-finals results, a faster Semis time is found:
def update_solo(session, swimmer_id, meet, event, swim_time, react_time, rank):
try:
session.execute(
update(MeetResult)
.where(
and_(
MeetResult.swimmerId == swimmer_id,
MeetResult.majorMeet == meet,
MeetResult.event == event
)
)
.values(soloTime=swim_time, soloReact=react_time, soloRank=rank)
)
session.commit()
except Exception as e:
print(f"Secondary exception: {e}")
The function below does the same thing as the function above, but for relay times:
def update_relay(session, swimmer_id, meet, event, relay_event, swim_time, react_time):
try:
session.execute(
update(MeetResult)
.where(
and_(
MeetResult.swimmerId == swimmer_id,
MeetResult.majorMeet == meet,
MeetResult.event == event
)
)
.values(relayEvent=relay_event, relayTime=swim_time, relayReact=react_time)
)
session.commit()
except Exception as e:
print(f"Secondary exception: {e}")
Main function
All the scraping and population of our database is done within the main function. Our code basically navigates the website as a normal human user would, making pauses of random lengths, which of course caused the whole scraping process to take a lot of time. These measures were taken since the target website proved to be particularly wary of unusual behaviour, and only cooperated when the scraping was done closer to a real human pace and behaviour. This is why the code returns back to the meet results page before going to the next reaction time page every single time to imitate human behaviour, as hopping directly from reaction time to reaction time pages (which we could do since we fetched all the hrefs) would be detected as suspicious behaviour. The order in which our code scraped the different swimming events was also made random by shuffling the events dictionaries in every championship iteration. The scraped data was gradually added to the database, in a style known as “commit as-you-go”. Lastly, we loop over all swimmers in swimmers_info
table and if a swimmer had no relay time in all of their performance rows, they are dropped from the database as they will not be of any use to our study.
Notes to self:
The backstroke leg always comes first in a relay, so there is no “100 Backstroke lap” page.
Loop through Finals and Semifinals:
<th class="event" colspan="4"> Text: "Men/Women, *distance*m *event*, *Final/Semifinal/Swim Off Semifinal*, Open"
in <form>, look for every <table class="meetResult"> whose <th class="event"> contains: 'Final' OR 'Semifinal' OR 'Swim Off Semifinal'
loop through every <tr class="meetResult0"> and <tr class="meetResult1">
if athlete name both in finals & relay or semifinals & relay:
find the name inside a <td width="150" class="name"> tagthen:
find the <td width="50" class="swimtime"> tag within the same <tr>, find the anchor tag inside that <td> tag and fetch the href link from the <a class="time"> tag. The text inside the <a> tag is the swim time.Go to the href link and get reaction time. Store all the information (Name, YoB, Country) in swimmers_info then, fetch existing perf and for solo event, look for relay matches in
<table class="meetResult"> whose <th class="event"> contains: ('4 x' AND Final) OR ('4 x' AND Prelim)
, and add relay race details to the perf row.
Finally, drop the swimmer if no relay match was found.
def scrape_to_database(session):
url = 'https://www.swimrankings.net/'
requests.get(url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
time.sleep(rd.randint(14, 24))
for meet, meetId in CHAMPIONSHIPS_DICT.items(): # loop over each championship in the dict
if "tokyo" in meet or "rio" in meet: # url for olympics
url = 'https://www.swimrankings.net/index.php?page=meetSelect&selectPage=BYTYPE&meetType=1'
else: # url for worlds
url = 'https://www.swimrankings.net/index.php?page=meetSelect&selectPage=BYTYPE&meetType=2'
requests.get(url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
time.sleep(rd.randint(8, 16))
# go to meet:
url = f"https://www.swimrankings.net/index.php?page=meetDetail&meetId={meetId}"
requests.get(url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
time.sleep(rd.randint(14, 21))
for gender in ["1", "2"]: # first scrape data for men, then women
# shuffle the solo and relay events dictionaries:
solo_keys = list(SOLO_EVENTS_DICT.keys())
relay_keys = list(RELAY_EVENTS_DICT.keys())
rd.shuffle(solo_keys)
rd.shuffle(relay_keys)
solo_dict = {key: SOLO_EVENTS_DICT[key] for key in solo_keys}
relay_dict = {key: RELAY_EVENTS_DICT[key] for key in relay_keys}
EVENTS_DICT = solo_dict | relay_dict
for page in EVENTS_DICT: # iterating over every swimming event
value = EVENTS_DICT[page]["value"]
event = EVENTS_DICT[page]["event"]
# get meet results & error handling:
request_url = f'https://www.swimrankings.net/index.php?page=meetDetail&meetId={meetId}&gender={gender}&styleId={value}'
try:
html = requests.get(
request_url,
headers=HEADER,
proxies={
'http': PROXY,
'https': PROXY
}
).text
time.sleep(rd.randint(11, 16))
except requests.exceptions.HTTPError as e:
print(f"HTTP error while fetching data from {request_url}: {e}")
continue
except requests.exceptions.RequestException as e:
print(f"An error occurred while fetching data from {request_url}: {e}")
continue
soup = BeautifulSoup(html, 'html.parser')
# find all tables with class 'meetResult' which contain the race results:
tables = soup.find_all('table', class_='meetResult')
for table in tables:
# fetch race results:
results = table.find_all(name='tr', class_="meetResult0") + table.find_all(name='tr', class_="meetResult1")
# find the <th> element with the event title in it:
race = table.find('th', class_='event')
# check if "Final" is in the text of the <th> element (race)
if any(rnd in race.getText() for rnd in ["Final", "Semifinal", "Swim Off Semifinal"]) and "4 x " not in race.getText():
for result in results: # loop through Finals/Semis race results:
# random anti-ban pause to better simulate real human activity:
if rd.randint(1, 8) == 1:
time.sleep(rd.randint(7, 14))
# first, find the name of the athlete and their perf:
name = result.find('td', width="150", class_="name").getText()
swim_time = result.find('td', class_="swimtime").getText()
rank = result.find('td', class_="meetPlace").getText().split('.')[0]
# the swimmers that raced in the Finals will already have an entry:
try: # is this swimmer already in the database?
swimmer = session.execute(
select(Swimmer)
.filter_by(name=name)
).scalar()
if swimmer is None:
raise NoResultFound()
except NoResultFound: # swimmer not in database:
# fetch all the other info on the swimmer:
y_o_b = result.find('td', class_="yob").getText()
country = result.find('td', width="190", class_="name").getText()
# create the new swimmer's profile:
new_swimmer = create_new_swimmer(session, name, gender, y_o_b, country)
# get reaction time:
reaction_time = get_reaction_time(result)
try:
requests.get(request_url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
except requests.exceptions.ProxyError as proxy_error:
print(f"ProxyError: {proxy_error}")
else: # fill out the entry:
create_entry(session, new_swimmer, meet, event, swim_time, reaction_time, rank)
time.sleep(rd.randint(8, 12))
else: # if swimmer already in database:
swimmer_id = swimmer.id
try: # have we already recorded a perf for this swimmer and event?
existing_perf = session.execute(
select(MeetResult)
.filter_by(swimmerId=swimmer_id, majorMeet=meet, event=event)
).scalar()
if existing_perf is None:
raise NoResultFound()
except NoResultFound: # if not, log the perf:
reaction_time = get_reaction_time(result)
create_entry(session, swimmer, meet, event, swim_time, reaction_time, rank)
else:
# if a swimmer's Semifinals time is better than their Finals time, replace with the best time:
if len(existing_perf.soloTime) > 5:
previous_time = datetime.strptime(existing_perf.soloTime, TIME_FORMAT_LONG)
else:
previous_time = datetime.strptime(existing_perf.soloTime, TIME_FORMAT_SHORT)
if len(swim_time) > 5:
current_time = datetime.strptime(swim_time, TIME_FORMAT_LONG)
else:
current_time = datetime.strptime(swim_time, TIME_FORMAT_SHORT)
if current_time < previous_time: # if faster time, replace
reaction_time = get_reaction_time(result)
try:
requests.get(request_url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
except requests.exceptions.ProxyError as proxy_error:
print(f"ProxyError: {proxy_error}")
else: # update the swimmer's solo entry with the new best time:
update_solo(session, swimmer_id, meet, event, swim_time, reaction_time, rank)
time.sleep(rd.randint(5, 9))
# now onto the relays:
elif "4 x " in race.getText():
for result in results: # loop through all relay splits:
# anti-ban pause:
if rd.randint(1, 8) == 1:
time.sleep(rd.randint(7, 13))
name = result.find('td', width="150", class_="name").getText()
swim_time = result.find('td', class_="swimtime").getText()
# only include swimmers who swam the solo event:
try: # get swimmer
swimmer = session.execute(
select(Swimmer)
.filter_by(name=name)
).scalar()
if swimmer is None:
raise NoResultFound()
existing_perf = session.execute(
select(MeetResult)
.filter_by(swimmerId=swimmer.id, majorMeet=meet, event=event)
).scalar()
if existing_perf is None:
raise NoResultFound()
except NoResultFound:
pass
else:
swimmer_id = swimmer.id
# get the name of the relay event:
relay_event = race.getText().split('- ')[1].split(', Open')[0]
if existing_perf.relayEvent is not None: # check for already existing time
if len(existing_perf.relayTime) > 5:
previous_time = datetime.strptime(existing_perf.relayTime, TIME_FORMAT_LONG)
else:
previous_time = datetime.strptime(existing_perf.relayTime, TIME_FORMAT_SHORT)
if len(swim_time) > 5:
current_time = datetime.strptime(swim_time, TIME_FORMAT_LONG)
else:
current_time = datetime.strptime(swim_time, TIME_FORMAT_SHORT)
if current_time < previous_time: # if faster time, replace
reaction_time = get_reaction_time(result)
try:
requests.get(request_url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
except requests.exceptions.ProxyError as proxy_error:
print(f"ProxyError: {proxy_error}")
else: # update the swimmer's entry with the new best time:
update_relay(session, swimmer_id, meet, event, relay_event, swim_time, reaction_time)
time.sleep(rd.randint(8, 13))
else:
reaction_time = get_reaction_time(result)
try:
requests.get(request_url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
except requests.exceptions.ProxyError as proxy_error:
print(f"ProxyError: {proxy_error}")
else:
# create a new relay entry:
update_relay(session, swimmer_id, meet, event, relay_event, swim_time, reaction_time)
time.sleep(rd.randint(5, 9))
# anti-ban pause:
time.sleep(rd.randint(11, 24))
url = 'https://www.swimrankings.net/index.php?page=home'
requests.get(url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
time.sleep(rd.randint(31, 75))
# removing races and possibly swimmers with no relay times, making use of the relationship() attributes of the db classes:
all_swimmers = session.execute(select(Swimmer)).scalars()
for swimmer in all_swimmers:
nb_races = len(swimmer.races)
k = 0 # race counter
for race in swimmer.races:
if race.relayEvent is None:
session.delete(race)
k += 1
if k == nb_races: # if all races were deleted (swimmer didn't swim any relay), swimmer is out of the database:
session.delete(swimmer)
session.commit()
Finally, the database tables are created and the entire code is run:
engine = create_engine("sqlite+pysqlite:///major_meets_2015-2023.db")
Session = sessionmaker(engine)
Base.metadata.create_all(engine)
with Session() as session:
scrape_to_database(session)
Leave a Reply