Overview
I find tennis to be one of the most fascinating sports to watch, especially because of the mental strength that is required of players who have no one to lean on besides themselves. They play a calendar-long season with few breaks and are facing tournament elimination in every single match.
The goal of this project is to quantify the mental strength of men’s and women’s (ATP and WTA) tennis players. The process involves simulating the match win probability at every point of every Grand Slam match, and determining the probability swing, or importance, of each point. That allows for quantifying the consistency of player performance regardless of point importance, and how effectively players rise to the occasion in high-leverage situations.
The code repository for additional detail on the methodology of the end-to-end data pipeline can be found here: GitHub
Application
Interactive Filters
- Year Range
- Tournament (Australian Open, French Open, Wimbledon, US Open)
- Player Status (Active vs. Inactive/Retired)
- Minimum Points Played (Default to 400/year, to filter out small sample sizes)
Visualize Results
- Player Consistency vs. Clutchness
- Breakdowns of the Most and Least Consistent and Clutch Players
- Player Performance in High-Pressure Situations (with comparison to their overall baseline)
- Most Unlikely Wins (with match state and probability at winner’s low point)
- Highest Leverage Points (with probability swing and point winner)
- Additional details on simulation and data pipeline methodology
Tech Stack
Programming & ETL:
- Python (data extraction, match state reconstruction, simulation scripts)
- PySpark (running large-scale point-by-point match simulations)
- SQL / DuckDB queries (transforming and analyzing simulation outputs)
Data Warehousing / Storage:
- DuckDB (storing match simulation outputs and transformed datasets)
- CSVs (raw point-by-point match data from GitHub)
Visualization / Dashboard:
- Streamlit (interactive dashboard showing player performance, match probabilities, swing points, and visualizations by tournament, year, and ATP/WTA tour)
DevOps / Automation:
- Git (version control)
Architecture Diagram
Database Diagram
Future Improvements
- Integrate API connection containing match point-by-point data from tournaments other than Grand Slam matches
- Incorporate data orchestration with Airflow/Dagster
- Migrate DuckDB transformations from Python to dbt