floodlight.io.datasets
Note
We cannot guarantee data availability for public data sets, unfortunately. Data from published articles (e.g. the EIGD) should be permanently available and stay static. Public provider data from StatsBomb is available on GitHub, but unversioned and with dynamically changing content. You can find methods to query the current list of games, and we also state the last date that we found the data to be available.
As public data sets for proprietary sports data are fairly rare, the standard way of accessing data is still via provider raw data files. To load these, we have more than ten parser for different provider formats in the IO submodule!
- class floodlight.io.datasets.EIGDDataset(dataset_dir_name='eigd_dataset')[source]
This dataset loads the EIGD-H data from the A Unified Taxonomy and Multimodal Dataset for Events in Invasion Games paper. 1
Upon instantiation, the class checks if the data already exists in the repository’s root
.data
-folder, and will download the files (~120MB) to this folder if not.- Parameters
dataset_dir_name (str, optional) – Name of subdirectory where the dataset is stored within the root .data directory. Defaults to ‘eigd_dataset’.
Notes
The dataset contains a total of 25 short samples of spatiotemporal data for both teams and the ball from the German Men’s Handball Bundesliga (HBL). For more information, visit the official project repository. Data for one sample can be queried calling the
get()
-method specifying the match and segment. The following matches and segments are available:matches = ['48dcd3', 'ad969d', 'e0e547', 'e8a35a', 'ec7a6a'] segments = { '48dcd3': ['00-06-00', '00-15-00', '00-25-00', '01-05-00', '01-10-00'], 'ad969d': ['00-00-30', '00-15-00', '00-43-00', '01-11-00', '01-35-00'], 'e0e547': ['00-00-00', '00-08-00', '00-15-00', '00-50-00', '01-00-00'], 'e8a35a': ['00-02-00', '00-07-00', '00-14-00', '01-05-00', '01-14-00'], 'ec7a6a': ['00-30-00', '00-53-00', '01-19-00', '01-30-00', '01-40-00'], }
Examples
>>> from floodlight.io.datasets import EIGDDataset
>>> dataset = EIGDDataset() # get one sample >>> teamA, teamB, ball = dataset.get(match_name="48dcd3", segment="00-06-00") # get the corresponding pitch >>> pitch = dataset.get_pitch()
References
- get(match_name='48dcd3', segment='00-06-00')[source]
Get one sample from the EIGD dataset.
- Parameters
match_name (str, optional) – Match name, check Notes section for valid arguments. Defaults to the first match (“48dcd3”).
segment (str, optional) – Segment identifier, check Notes section for valid arguments. Defaults to the first segment (“00-06-00”).
- Returns
sample – Returns three XY objects of the form (teamA, teamB, ball) for the requested sample.
- Return type
- class floodlight.io.datasets.StatsBombOpenDataset(dataset_dir_name='statsbomb_dataset')[source]
This dataset loads the StatsBomb open data provided by the official data repository.
Due to the size of the full dataset (~5GB), only metadata (~2MB) are downloaded to the repository’s root
.data
-folder upon instantiation while the other data are only downloaded on demand. All downloaded files stay on disk if not manually removed.- Parameters
dataset_dir_name (str, optional) – Name of subdirectory where the dataset is stored within the root .data directory. Defaults to ‘statsbomb_dataset’.
Notes
The dataset contains results, lineups, event data, and (partly) StatsBomb360 data for a variety of matches from a total of eight different competitions (Women’s World Cup, FIFA World Cup, UEFA Euro, Champions League, FA Women’s Super League, NWSL, Premier League, and La Liga). The Champions League data for example contains all Finals from 2003/2004 to 2018/2019. The La Liga data contains every one of the 520 matches ever played by Lionel Messi for FC Barcelona. The UEFA Euro data contains 51 matches where StatsBomb360 data is available. As the data is constantly updated, we provide an overview over the stats here but refer to the official repository for up-to-date information (last checked 20.08.2022):
number_of_matches = { "Champions League": { '1999/2000' : 0, '2003/2004' : 1, '2004/2005' : 1, '2006/2007' : 1, '2008/2009' : 1, '2009/2010' : 1, '2010/2011' : 1, '2011/2012' : 1, '2012/2013' : 1, '2013/2014' : 1, '2014/2015' : 1, '2015/2016' : 1, '2016/2017' : 1, '2017/2018' : 1, '2018/2019' : 1, }, "FA Women's Super League": { '2018/2019' : 108, '2019/2020' : 87, '2020/2021' : 131, }, "FIFA World Cup": { '2018' : 64, }, "La Liga": { '2004/2005': 7, '2005/2006' : 17, '2006/2007' : 26, '2007/2008' : 28, '2008/2009' : 31, '2009/2010' : 35, '2010/2011' : 33, '2011/2012' : 37, '2012/2013' : 32, '2013/2014' : 31, '2014/2015' : 38, '2015/2016' : 33, '2016/2017' : 34, '2017/2018' : 36, '2018/2019' : 34, '2019/2020' : 33, '2020/2021' : 35, }, "NWSL": { '2018' : 36, }, "Premier League": { '2003/2004' : 33, }, "UEFA Euro" : { '2020' : 51, }, "Women's World Cup": { '2019' : 52, }, }
Examples
>>> from floodlight.io.datasets import StatsBombOpenDataset >>> dataset = StatsBombOpenDataset() # get one sample of event data with StatsBomb360 data >>> events, teamsheets = dataset.get("UEFA Euro", "2020", "England vs. Germany") # get the corresponding pitch >>> pitch = dataset.get_pitch() # get a summary of available matches in the dataset >>> matches = dataset.available_matches # extract every La Liga Clásico played in Camp Nou by Lionel Messi >>> clasicos = matches[matches["match_name"] == "Barcelona vs. Real Madrid"] # print outcomes >>> for _, match in clasicos.iterrows(): >>> print(f"Season {match['season_name']} - Barcelona {match['score']} Real'") # read events to list >>> clasico_events = [] >>> for _, clasico in clasicos.iterrows(): >>> data = dataset.get("La Liga", clasico["season_name"], clasico["match_name"]) >>> clasico_events.append(data)
- property available_matches: pandas.core.frame.DataFrame
Creates and returns a DataFrame with information for all available matches from the metadata that is downloaded upon instantiation.
- Returns
summary – Table where the rows contain meta information of individual games such as
competition_name
,season_name
, andmatch_name
(in the format Home vs. Away), location of the match (stadium
andcountry
),sex
of the players (female or male), theStatsBomb360_status
and the finalscore
.- Return type
pd.DataFrame
- get(competition_name='La Liga', season_name='2020/2021', match_name=None, teamsheet_home=None, teamsheet_away=None)[source]
Get events and teamsheets from one match of the StatsBomb open dataset.
If StatsBomb360data are available, they are stored in the
qualifier
column of the Events object. If the files are not contained in the repository’s root.data
folder they are downloaded to the folder and will be stored until removed by hand.- Parameters
competition_name (str, optional) – Competition name for which the match is played, check Notes section for possible competitions. Defaults to “La Liga”.
season_name (str, optional) – Season name during which the match is played. For league matches use the format YYYY/YYYY and for international cup matches the format YYYY. Check Notes for available seasons of every competition. Defaults to “2020/2021”.
match_name (str, optional) – Match name relating to the available matches in the chosen competition and season. If equal to None (default), the first available match of the given competition and season is chosen.
teamsheet_home (Teamsheet, optional) – Teamsheet-object for the home team used to create link dictionaries of the form links[pID] = team. If given as None (default), teamsheet is extracted from the data.
teamsheet_away (Teamsheet, optional) – Teamsheet-object for the away team. If given as None (default), teamsheet is extracted from data.
- Returns
data_objects – Tuple of (nested) floodlight core objects with shape (events_objects, teamsheets).
events_objects
is a nested dictionary containingEvents
objects for each team and segment of the formevents_objects[segment][team] = Events
. For a typical league match with two halves and teams this dictionary looks like:{'HT1': {'Home': Events, 'Away': Events}, 'HT2': {'Home': Events, 'Away': Events}}
.teamsheets
is a dictionary containingTeamsheet
objects for each team of the formteamsheets[team] = Teamsheet
.- Return type
- static get_pitch()[source]
Returns a Pitch-object corresponding to the StatsBomb Dataset.
- Return type
- get_teamsheets(competition_name='La Liga', season_name='2020/2021', match_name=None)[source]
Returns a dictionary with Teamsheet-objects for both teams (“Home” and “Away”) from one match of the StatsBomb open dataset.
- Parameters
competition_name (str, optional) – Competition name for which the match is played, check Notes section for possible competitions. Defaults to “La Liga”.
season_name (str, optional) – Season name during which the match is played. For league matches use the format YYYY/YYYY and for international cup matches the format YYYY. Check Notes for available seasons of every competition. Defaults to “2020/2021”.
match_name (str, optional) – Match name relating to the available matches in the chosen competition and season. If equal to None (default), the first available match of the given competition and season is chosen.
- Returns
teamsheets – Teamsheet-objects for both teams (“Home” and “Away”) of the given match.
- Return type
Dict[str, Teamsheet]
- class floodlight.io.datasets.ToyDataset[source]
This dataset loads synthetic data for a (very) short artificial football match.
The data can be used for testing or trying out features. They come shipped with the package and are stored in the repository’s root
.data
-folder.Examples
>>> from floodlight.io.datasets import ToyDataset
>>> dataset = ToyDataset() # get one sample >>> ( >>> xy_home, >>> xy_away, >>> xy_ball, >>> events_home, >>> events_away, >>> possession, >>> ballstatus, >>> ) = dataset.get(segment="HT1") # get the corresponding pitch >>> pitch = dataset.get_pitch()
- get(segment='HT1')[source]
Get data objects for one segment from the toy dataset.
- Parameters
segment ({‘HT1’, ‘HT2’}, optional) – Segment identifier for the first (“HT1”, default)) or the second (“HT2”) half.
- Returns
toy_dataset – Returns seven core objects of the form (xy_home, xy_away, xy_ball, events_home, events_away, possession, ballstatus) for the requested segment.
- Return type