The first part of this project presents how the data used in the project was scraped from the web. Unlike part 2 and 3, this part will not be extensively commented, only the main ideas behind how the code was structured will be explained. The code itself contains many comments explaining the actions within each code block. Also, the whole structure of the code could surely be improved (e.g. there surely is a way to make the main function shorter), all comments are welcome. Notes to self were included which explain the thought process for scraping before the coding started.

All the data scraped is public and not copyrighted, and intended for personal use within this project only. The following code contains many GET requests to the swimrankings.net website. For ethical reasons, a timer of varying length was introduced between every request to avoid the target server being overloaded by a sudden great number of requests. Data for major meets before year 2015 was gated by authentication, which makes it a “little bit less ethical” to scrape, therefore only data from 2015 to 2023 was scraped.

Database models

We need to store the scraped data so that we can access it anytime we want. For that, we will create a database and corresponding mapped classes with SQLAlchemy ORM 2.0:

In [ ]:

from sqlalchemy import Integer, String, ForeignKey
from sqlalchemy.orm import DeclarativeBase, mapped_column, relationship


class Base(DeclarativeBase):
    pass


class MeetResult(Base):
    __tablename__ = "meet_results"

    id = mapped_column(Integer, primary_key=True)
    swimmerId = mapped_column(ForeignKey("swimmers_info.id"), nullable=False)
    majorMeet = mapped_column(String(20), nullable=False)
    meetYear = mapped_column(String(5), nullable=False)
    event = mapped_column(String(20), nullable=False)
    soloTime = mapped_column(String(10), nullable=False)
    soloReact = mapped_column(String(10), nullable=False)
    soloRank = mapped_column(String(5))
    relayEvent = mapped_column(String(20))
    relayTime = mapped_column(String(10))
    relayReact = mapped_column(String(10))

    swimmer = relationship("Swimmer", back_populates="races")


class Swimmer(Base):
    __tablename__ = "swimmers_info"

    id = mapped_column(Integer, primary_key=True)
    name = mapped_column(String(50), unique=True, nullable=False)
    gender = mapped_column(String(10), nullable=False)
    yearBirth = mapped_column(String(5), nullable=False)
    country = mapped_column(String(20), nullable=False)

    races = relationship("MeetResult", back_populates="swimmer")

2 mapped classes are created in order to keep the personal information of swimmers separate from the meet results table. In the meet_results table, columns are created for every race-related variable we will collect data for; note the column swimmerId which is a foreign key of the id column of the swimmers_info table, so that a meet result can be linked to the swimmer. A relationship() attribute is created in both tables, and they back-populate each other. This is to facilitate database operations such as deleting a swimmer and all their races from the database in one go. The four last columns in meet_results table are nullable since we will be populating them as we go.

Data scraping code

In [ ]:

# imports:
from bs4 import BeautifulSoup
from sqlalchemy import create_engine, select, update, and_
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm.exc import NoResultFound
from models import Base, MeetResult, Swimmer  # importing the database classes
from datetime import datetime
import requests
import time
import random as rd
import os

Defining the constants

In [ ]:

# proxy logins:
USERNAME = os.environ['SMTPROXY_USR']
PASSWORD = os.environ['SMTPROXY_PWD']
PROXY = f"https://{USERNAME}:{PASSWORD}@fr.smartproxy.com:40009"

# Chrome user-agent for requests:
HEADER = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
}

# major meets and their "meetid" parameter for the url:
CHAMPIONSHIPS_DICT = {
    "fukuoka_2023": 639635,
    "budapest_2022": 632168,
    "tokyo_2021": 626385,
    "gwangju_2019": 616689,
    "budapest_2017": 603267,
    "rio_2016": 596227,
    "kazan_2015": 589276,
}

# swimming events and their "value" parameter for the url:
SOLO_EVENTS_DICT = {
    "100 Free": {"value": "2", "event": "100m Freestyle"},
    "200 Free": {"value": "3", "event": "200m Freestyle"},
    "100 Fly": {"value": "16", "event": "100m Butterfly"},
    "100 Breast": {"value": "13", "event": "100m Breaststroke"},
    "100 Back": {"value": "10", "event": "100m Backstroke"},
}
RELAY_EVENTS_DICT = {
    "100 Breast lap": {"value": "63", "event": "100m Breaststroke"},
    "100 Fly Lap": {"value": "65", "event": "100m Butterfly"},
    "100 Free lap": {"value": "58", "event": "100m Freestyle"},
    "200 Free lap": {"value": "59", "event": "200m Freestyle"},
}

# time formats of swim times:
TIME_FORMAT_LONG = "%M:%S.%f"
TIME_FORMAT_SHORT = "%S.%f"

The proxy location was periodically rotated, though not after every request as it would give away that the website is being “assaulted” by a bot using a rotating proxy; instead, it was rotated after a certain period of 20 to 30 minutes to avoid the target server flagging us as a suspicious user.

Secondary functions

The function below gets the reaction time for a race performance. It takes as parameter an html tag found by the main function (see next section), and returns the scraped reaction time as a string. Since we are making a request, we add error handling commands.

Notes to self: Execute a GET request from the url obtained, and find the reaction time in the page: find the

<th class="date" colspan="4">

tag and fetch the reaction time contained in the tag. Ex: “Reaction time: + 0.50”

In [ ]:

def get_reaction_time(race_result):
    global HEADER
    global PROXY
    # get link to the page with race details:
    reaction_time_link = "https://www.swimrankings.net/index.php" + race_result.find('td', class_="swimtime").a.get("href")
    # hop over to the race details page (splits, reaction time...):
    try:
        # get reaction time:
        reaction_response = requests.get(
            reaction_time_link,
            headers=HEADER,
            proxies={
                'http': PROXY,
                'https': PROXY
            })
        print(reaction_response.status_code)
        # start a timer to avoid overloading the server with requests:
        time.sleep(rd.randint(6, 12))
        reaction_soup = BeautifulSoup(reaction_response.text, 'html.parser')
        try:
            reaction_time = reaction_soup.find('th', class_="date", colspan="4").getText().split("+ ")[1]
        except AttributeError:
            return None
        else:
            return reaction_time
        
    except requests.exceptions.ProxyError as proxy_error:
        print(f"ProxyError: {proxy_error}")
    except requests.exceptions.HTTPError as e:
        print(f"HTTP error while fetching data from {reaction_time_link}: {e}")
    except requests.exceptions.RequestException as e:
        print(f"An error occurred while fetching data from {reaction_time_link}: {e}")

The function below creates the profile of a swimmer in the swimmers_info database table if it does not already exist:

In [ ]:

def create_new_swimmer(session, name, gender, year_birth, country):
    try:
        new_swimmer = Swimmer(name=name, gender=gender, yearBirth=year_birth, country=country)
    except Exception as e:
        print(f"Secondary exception: {e}")
    else:
        session.add(new_swimmer)
        session.commit()
        return new_swimmer

The function below creates a new row in the meet_results database table and fills in the individual race details:

In [ ]:

def create_entry(session, swimmer, meet, event, swim_time, react_time, rank):
    try:
        solo_entry = MeetResult(
            majorMeet=meet,
            meetYear=meet.split('_')[1],
            event=event,
            soloTime=swim_time,
            soloReact=react_time,
            soloRank=rank,
            swimmer=swimmer  # relationship() attribute, back-populated by Swimmer.races
        )
    except Exception as e:
        print(f"Secondary exception: {e}")
    else:
        session.add(solo_entry)
        session.commit()

The function below updates an individual race time if, while scraping the Semi-finals results, a faster Semis time is found:

In [ ]:

def update_solo(session, swimmer_id, meet, event, swim_time, react_time, rank):
    try:
        session.execute(
            update(MeetResult)
            .where(
                and_(
                    MeetResult.swimmerId == swimmer_id,
                    MeetResult.majorMeet == meet,
                    MeetResult.event == event
                )
            )
            .values(soloTime=swim_time, soloReact=react_time, soloRank=rank)
        )
        session.commit()
    except Exception as e:
        print(f"Secondary exception: {e}")

The function below does the same thing as the function above, but for relay times:

In [ ]:

def update_relay(session, swimmer_id, meet, event, relay_event, swim_time, react_time):
    try:
        session.execute(
            update(MeetResult)
            .where(
                and_(
                    MeetResult.swimmerId == swimmer_id,
                    MeetResult.majorMeet == meet,
                    MeetResult.event == event
                )
            )
            .values(relayEvent=relay_event, relayTime=swim_time, relayReact=react_time)
        )
        session.commit()
    except Exception as e:
        print(f"Secondary exception: {e}")

Main function

All the scraping and population of our database is done within the main function. Our code basically navigates the website as a normal human user would, making pauses of random lengths, which of course caused the whole scraping process to take a lot of time. These measures were taken since the target website proved to be particularly wary of unusual behaviour, and only cooperated when the scraping was done closer to a real human pace and behaviour. This is why the code returns back to the meet results page before going to the next reaction time page every single time to imitate human behaviour, as hopping directly from reaction time to reaction time pages (which we could do since we fetched all the hrefs) would be detected as suspicious behaviour. The order in which our code scraped the different swimming events was also made random by shuffling the events dictionaries in every championship iteration. The scraped data was gradually added to the database, in a style known as “commit as-you-go”. Lastly, we loop over all swimmers in swimmers_info table and if a swimmer had no relay time in all of their performance rows, they are dropped from the database as they will not be of any use to our study.

Notes to self:

The backstroke leg always comes first in a relay, so there is no “100 Backstroke lap” page.

Loop through Finals and Semifinals:

<th class="event" colspan="4"> Text:
"Men/Women, *distance*m *event*, *Final/Semifinal/Swim Off Semifinal*,  Open"

in <form>, look for every <table class="meetResult"> whose <th class="event"> contains: 'Final' OR 'Semifinal' OR 'Swim Off Semifinal'

loop through every <tr class="meetResult0"> and <tr class="meetResult1">

if athlete name both in finals & relay or semifinals & relay:

find the name inside a <td width="150" class="name"> tag

then:

find the <td width="50" class="swimtime"> tag within the same <tr>, find the anchor tag inside that <td> tag and fetch the href link from the <a class="time"> tag. The text inside the <a> tag is the swim time.

Go to the href link and get reaction time. Store all the information (Name, YoB, Country) in swimmers_info then, fetch existing perf and for solo event, look for relay matches in

<table class="meetResult"> whose <th class="event"> contains: ('4 x' AND Final) OR ('4 x' AND Prelim)

, and add relay race details to the perf row.

Finally, drop the swimmer if no relay match was found.

In [ ]:

def scrape_to_database(session):

    
    url = 'https://www.swimrankings.net/'
    requests.get(url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
    time.sleep(rd.randint(14, 24))

    for meet, meetId in CHAMPIONSHIPS_DICT.items():  # loop over each championship in the dict

        if "tokyo" in meet or "rio" in meet:  # url for olympics
            url = 'https://www.swimrankings.net/index.php?page=meetSelect&selectPage=BYTYPE&meetType=1'
        else:  # url for worlds
            url = 'https://www.swimrankings.net/index.php?page=meetSelect&selectPage=BYTYPE&meetType=2'
        
        requests.get(url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
        time.sleep(rd.randint(8, 16))
        # go to meet:
        url = f"https://www.swimrankings.net/index.php?page=meetDetail&meetId={meetId}"
        requests.get(url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
        time.sleep(rd.randint(14, 21))

        for gender in ["1", "2"]:  # first scrape data for men, then women

            # shuffle the solo and relay events dictionaries:
            solo_keys = list(SOLO_EVENTS_DICT.keys())
            relay_keys = list(RELAY_EVENTS_DICT.keys())
            rd.shuffle(solo_keys)
            rd.shuffle(relay_keys)
            solo_dict = {key: SOLO_EVENTS_DICT[key] for key in solo_keys}
            relay_dict = {key: RELAY_EVENTS_DICT[key] for key in relay_keys}
            EVENTS_DICT = solo_dict | relay_dict

            for page in EVENTS_DICT:  # iterating over every swimming event
                value = EVENTS_DICT[page]["value"]
                event = EVENTS_DICT[page]["event"]
                # get meet results & error handling:
                request_url = f'https://www.swimrankings.net/index.php?page=meetDetail&meetId={meetId}&gender={gender}&styleId={value}'

                try:
                    html = requests.get(
                        request_url,
                        headers=HEADER,
                        proxies={
                            'http': PROXY,
                            'https': PROXY
                        }
                    ).text
                    time.sleep(rd.randint(11, 16))
                except requests.exceptions.HTTPError as e:
                    print(f"HTTP error while fetching data from {request_url}: {e}")
                    continue
                except requests.exceptions.RequestException as e:
                    print(f"An error occurred while fetching data from {request_url}: {e}")
                    continue

                soup = BeautifulSoup(html, 'html.parser')
                # find all tables with class 'meetResult' which contain the race results:
                tables = soup.find_all('table', class_='meetResult')

                for table in tables:
                    # fetch race results:
                    results = table.find_all(name='tr', class_="meetResult0") + table.find_all(name='tr', class_="meetResult1")
                    # find the <th> element with the event title in it:
                    race = table.find('th', class_='event')
                    # check if "Final" is in the text of the <th> element (race)

                    if any(rnd in race.getText() for rnd in ["Final", "Semifinal", "Swim Off Semifinal"]) and "4 x " not in race.getText():

                        for result in results:  # loop through Finals/Semis race results:

                            # random anti-ban pause to better simulate real human activity:
                            if rd.randint(1, 8) == 1:
                                time.sleep(rd.randint(7, 14))

                            # first, find the name of the athlete and their perf:
                            name = result.find('td', width="150", class_="name").getText()
                            swim_time = result.find('td', class_="swimtime").getText()
                            rank = result.find('td', class_="meetPlace").getText().split('.')[0]

                            # the swimmers that raced in the Finals will already have an entry:
                            try:  # is this swimmer already in the database?
                                swimmer = session.execute(
                                    select(Swimmer)
                                    .filter_by(name=name)
                                ).scalar()
                                if swimmer is None:
                                    raise NoResultFound()

                            except NoResultFound:  # swimmer not in database:
                                # fetch all the other info on the swimmer:
                                y_o_b = result.find('td', class_="yob").getText()
                                country = result.find('td', width="190", class_="name").getText()
                                # create the new swimmer's profile:
                                new_swimmer = create_new_swimmer(session, name, gender, y_o_b, country)
                                # get reaction time:
                                reaction_time = get_reaction_time(result)
                                try:
                                    requests.get(request_url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
                                except requests.exceptions.ProxyError as proxy_error:
                                    print(f"ProxyError: {proxy_error}")
                                else:  # fill out the entry:
                                    create_entry(session, new_swimmer, meet, event, swim_time, reaction_time, rank)
                                time.sleep(rd.randint(8, 12))

                            else:  # if swimmer already in database:
                                swimmer_id = swimmer.id
                                try:  # have we already recorded a perf for this swimmer and event?
                                    existing_perf = session.execute(
                                        select(MeetResult)
                                        .filter_by(swimmerId=swimmer_id, majorMeet=meet, event=event)
                                    ).scalar()
                                    if existing_perf is None:
                                        raise NoResultFound()

                                except NoResultFound:  # if not, log the perf:
                                    reaction_time = get_reaction_time(result)
                                    create_entry(session, swimmer, meet, event, swim_time, reaction_time, rank)

                                else:
                                    # if a swimmer's Semifinals time is better than their Finals time, replace with the best time:
                                    if len(existing_perf.soloTime) > 5:
                                        previous_time = datetime.strptime(existing_perf.soloTime, TIME_FORMAT_LONG)
                                    else:
                                        previous_time = datetime.strptime(existing_perf.soloTime, TIME_FORMAT_SHORT)
                                    if len(swim_time) > 5:
                                        current_time = datetime.strptime(swim_time, TIME_FORMAT_LONG)
                                    else:
                                        current_time = datetime.strptime(swim_time, TIME_FORMAT_SHORT)

                                    if current_time < previous_time:  # if faster time, replace
                                        reaction_time = get_reaction_time(result)
                                        try:
                                            requests.get(request_url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
                                        except requests.exceptions.ProxyError as proxy_error:
                                            print(f"ProxyError: {proxy_error}")
                                        else:  # update the swimmer's solo entry with the new best time:
                                            update_solo(session, swimmer_id, meet, event, swim_time, reaction_time, rank)
                                        time.sleep(rd.randint(5, 9))

                    # now onto the relays:
                    elif "4 x " in race.getText():

                        for result in results: # loop through all relay splits:

                            # anti-ban pause:
                            if rd.randint(1, 8) == 1:
                                time.sleep(rd.randint(7, 13))

                            name = result.find('td', width="150", class_="name").getText()
                            swim_time = result.find('td', class_="swimtime").getText()

                            # only include swimmers who swam the solo event:
                            try:  # get swimmer
                                swimmer = session.execute(
                                    select(Swimmer)
                                    .filter_by(name=name)
                                ).scalar()
                                if swimmer is None:
                                    raise NoResultFound()

                                existing_perf = session.execute(
                                    select(MeetResult)
                                    .filter_by(swimmerId=swimmer.id, majorMeet=meet, event=event)
                                ).scalar()
                                if existing_perf is None:
                                    raise NoResultFound()

                            except NoResultFound:
                                pass

                            else:
                                swimmer_id = swimmer.id
                                # get the name of the relay event:
                                relay_event = race.getText().split('- ')[1].split(',  Open')[0]

                                if existing_perf.relayEvent is not None:  # check for already existing time
                                    if len(existing_perf.relayTime) > 5:
                                        previous_time = datetime.strptime(existing_perf.relayTime, TIME_FORMAT_LONG)
                                    else:
                                        previous_time = datetime.strptime(existing_perf.relayTime, TIME_FORMAT_SHORT)
                                    if len(swim_time) > 5:
                                        current_time = datetime.strptime(swim_time, TIME_FORMAT_LONG)
                                    else:
                                        current_time = datetime.strptime(swim_time, TIME_FORMAT_SHORT)

                                    if current_time < previous_time:  # if faster time, replace
                                        reaction_time = get_reaction_time(result)
                                        try:
                                            requests.get(request_url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
                                        except requests.exceptions.ProxyError as proxy_error:
                                            print(f"ProxyError: {proxy_error}")
                                        else:  # update the swimmer's entry with the new best time:
                                            update_relay(session, swimmer_id, meet, event, relay_event, swim_time, reaction_time)
                                        time.sleep(rd.randint(8, 13))
                                else:
                                    reaction_time = get_reaction_time(result)
                                    try:
                                        requests.get(request_url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
                                    except requests.exceptions.ProxyError as proxy_error:
                                        print(f"ProxyError: {proxy_error}")
                                    else:
                                        # create a new relay entry:
                                        update_relay(session, swimmer_id, meet, event, relay_event, swim_time, reaction_time)
                                    time.sleep(rd.randint(5, 9))
                # anti-ban pause:
                time.sleep(rd.randint(11, 24))

        url = 'https://www.swimrankings.net/index.php?page=home'
        requests.get(url, headers=HEADER, proxies={'http': PROXY, 'https': PROXY})
        time.sleep(rd.randint(31, 75))

    # removing races and possibly swimmers with no relay times, making use of the relationship() attributes of the db classes:
    all_swimmers = session.execute(select(Swimmer)).scalars()
    for swimmer in all_swimmers:
        nb_races = len(swimmer.races)
        k = 0  # race counter
        for race in swimmer.races:
            if race.relayEvent is None:
                session.delete(race)
                k += 1
        if k == nb_races:  # if all races were deleted (swimmer didn't swim any relay), swimmer is out of the database:
            session.delete(swimmer)
    session.commit()

Finally, the database tables are created and the entire code is run:

In [ ]:

engine = create_engine("sqlite+pysqlite:///major_meets_2015-2023.db")
Session = sessionmaker(engine)

Base.metadata.create_all(engine)

with Session() as session:
    scrape_to_database(session)

DEXstasis

PART 1: DATA SCRAPING

Database models

Data scraping code

Defining the constants

Secondary functions

Main function

Leave a Reply Cancel reply