floodlight.io.datasets

Note

We cannot guarantee data availability for public data sets, unfortunately. Data from published articles (e.g. the EIGD) should be permanently available and stay static. Public provider data from StatsBomb is available on GitHub, but unversioned and with dynamically changing content. You can find methods to query the current list of games, and we also state the last date that we found the data to be available.

As public data sets for proprietary sports data are fairly rare, the standard way of accessing data is still via provider raw data files. To load these, we have more than ten parser for different provider formats in the IO submodule!

class floodlight.io.datasets.EIGDDataset(dataset_dir_name='eigd_dataset')[source]

This dataset loads the EIGD-H data from the A Unified Taxonomy and Multimodal Dataset for Events in Invasion Games paper. 1

Upon instantiation, the class checks if the data already exists in the repository’s root .data-folder, and will download the files (~120MB) to this folder if not.

Parameters

dataset_dir_name (str, optional) – Name of subdirectory where the dataset is stored within the root .data directory. Defaults to ‘eigd_dataset’.

Notes

The dataset contains a total of 25 short samples of spatiotemporal data for both teams and the ball from the German Men’s Handball Bundesliga (HBL). For more information, visit the official project repository. Data for one sample can be queried calling the get()-method specifying the match and segment. The following matches and segments are available:

matches = ['48dcd3', 'ad969d', 'e0e547', 'e8a35a', 'ec7a6a']
segments = {
    '48dcd3': ['00-06-00', '00-15-00', '00-25-00', '01-05-00', '01-10-00'],
    'ad969d': ['00-00-30', '00-15-00', '00-43-00', '01-11-00', '01-35-00'],
    'e0e547': ['00-00-00', '00-08-00', '00-15-00', '00-50-00', '01-00-00'],
    'e8a35a': ['00-02-00', '00-07-00', '00-14-00', '01-05-00', '01-14-00'],
    'ec7a6a': ['00-30-00', '00-53-00', '01-19-00', '01-30-00', '01-40-00'],
}

Examples

>>> from floodlight.io.datasets import EIGDDataset
>>> dataset = EIGDDataset()
# get one sample
>>> teamA, teamB, ball = dataset.get(match_name="48dcd3", segment="00-06-00")
# get the corresponding pitch
>>> pitch = dataset.get_pitch()

References

1

Biermann, H., Theiner, J., Bassek, M., Raabe, D., Memmert, D., & Ewerth, R. (2021, October). A Unified Taxonomy and Multimodal Dataset for Events in Invasion Games. In Proceedings of the 4th International Workshop on Multimedia Content Analysis in Sports (pp. 1-10).

get(match_name='48dcd3', segment='00-06-00')[source]

Get one sample from the EIGD dataset.

Parameters
  • match_name (str, optional) – Match name, check Notes section for valid arguments. Defaults to the first match (“48dcd3”).

  • segment (str, optional) – Segment identifier, check Notes section for valid arguments. Defaults to the first segment (“00-06-00”).

Returns

sample – Returns three XY objects of the form (teamA, teamB, ball) for the requested sample.

Return type

Tuple[XY, XY, XY]

static get_pitch()[source]

Returns a Pitch object corresponding to the EIGD-data.

Return type

Pitch

class floodlight.io.datasets.StatsBombOpenDataset(dataset_dir_name='statsbomb_dataset')[source]

This dataset loads the StatsBomb open data provided by the official data repository.

Due to the size of the full dataset (~5GB), only metadata (~2MB) are downloaded to the repository’s root .data-folder upon instantiation while the other data are only downloaded on demand. All downloaded files stay on disk if not manually removed.

Parameters

dataset_dir_name (str, optional) – Name of subdirectory where the dataset is stored within the root .data directory. Defaults to ‘statsbomb_dataset’.

Notes

The dataset contains results, lineups, event data, and (partly) StatsBomb360 data for a variety of matches from a total of eight different competitions (Women’s World Cup, FIFA World Cup, UEFA Euro, Champions League, FA Women’s Super League, NWSL, Premier League, and La Liga). The Champions League data for example contains all Finals from 2003/2004 to 2018/2019. The La Liga data contains every one of the 520 matches ever played by Lionel Messi for FC Barcelona. The UEFA Euro data contains 51 matches where StatsBomb360 data is available. As the data is constantly updated, we provide an overview over the stats here but refer to the official repository for up-to-date information (last checked 20.08.2022):

number_of_matches = {
    "Champions League": {
        '1999/2000' : 0, '2003/2004' : 1, '2004/2005' : 1, '2006/2007' : 1,
        '2008/2009' : 1, '2009/2010' : 1, '2010/2011' : 1, '2011/2012' : 1,
        '2012/2013' : 1, '2013/2014' : 1, '2014/2015' : 1, '2015/2016' : 1,
        '2016/2017' : 1, '2017/2018' : 1, '2018/2019' : 1,
        },
    "FA Women's Super League": {
        '2018/2019' : 108, '2019/2020' : 87, '2020/2021' : 131,
        },
    "FIFA World Cup": {
        '2018' : 64,
        },
    "La Liga": {
        '2004/2005': 7, '2005/2006' : 17, '2006/2007' : 26, '2007/2008' : 28,
        '2008/2009' : 31, '2009/2010' : 35, '2010/2011' : 33, '2011/2012' : 37,
        '2012/2013' : 32, '2013/2014' : 31, '2014/2015' : 38, '2015/2016' : 33,
        '2016/2017' : 34, '2017/2018' : 36, '2018/2019' : 34, '2019/2020' : 33,
        '2020/2021' : 35,
        },
    "NWSL": {
        '2018' : 36,
        },
    "Premier League": {
        '2003/2004' : 33,
        },
    "UEFA Euro" : {
        '2020' : 51,
        },
    "Women's World Cup": {
        '2019' : 52,
        },
}

Examples

>>> from floodlight.io.datasets import StatsBombOpenDataset
>>> dataset = StatsBombOpenDataset()
# get one sample of event data with StatsBomb360 data
>>> events, teamsheets = dataset.get("UEFA Euro", "2020", "England vs. Germany")
# get the corresponding pitch
>>> pitch = dataset.get_pitch()
# get a summary of available matches in the dataset
>>> matches = dataset.available_matches
# extract every La Liga Clásico played in Camp Nou by Lionel Messi
>>> clasicos = matches[matches["match_name"] == "Barcelona vs. Real Madrid"]
# print outcomes
>>> for _, match in clasicos.iterrows():
>>>     print(f"Season {match['season_name']} - Barcelona {match['score']} Real'")
# read events to list
>>> clasico_events = []
>>> for _, clasico in clasicos.iterrows():
>>>     data = dataset.get("La Liga", clasico["season_name"], clasico["match_name"])
>>>     clasico_events.append(data)
property available_matches: pandas.core.frame.DataFrame

Creates and returns a DataFrame with information for all available matches from the metadata that is downloaded upon instantiation.

Returns

summary – Table where the rows contain meta information of individual games such as competition_name, season_name, and match_name (in the format Home vs. Away), location of the match (stadium and country), sex of the players (female or male), the StatsBomb360_status and the final score.

Return type

pd.DataFrame

get(competition_name='La Liga', season_name='2020/2021', match_name=None, teamsheet_home=None, teamsheet_away=None)[source]

Get events and teamsheets from one match of the StatsBomb open dataset.

If StatsBomb360data are available, they are stored in the qualifier column of the Events object. If the files are not contained in the repository’s root .data folder they are downloaded to the folder and will be stored until removed by hand.

Parameters
  • competition_name (str, optional) – Competition name for which the match is played, check Notes section for possible competitions. Defaults to “La Liga”.

  • season_name (str, optional) – Season name during which the match is played. For league matches use the format YYYY/YYYY and for international cup matches the format YYYY. Check Notes for available seasons of every competition. Defaults to “2020/2021”.

  • match_name (str, optional) – Match name relating to the available matches in the chosen competition and season. If equal to None (default), the first available match of the given competition and season is chosen.

  • teamsheet_home (Teamsheet, optional) – Teamsheet-object for the home team used to create link dictionaries of the form links[pID] = team. If given as None (default), teamsheet is extracted from the data.

  • teamsheet_away (Teamsheet, optional) – Teamsheet-object for the away team. If given as None (default), teamsheet is extracted from data.

Returns

data_objects – Tuple of (nested) floodlight core objects with shape (events_objects, teamsheets).

events_objects is a nested dictionary containing Events objects for each team and segment of the form events_objects[segment][team] = Events. For a typical league match with two halves and teams this dictionary looks like: {'HT1': {'Home': Events, 'Away': Events}, 'HT2': {'Home': Events, 'Away': Events}}.

teamsheets is a dictionary containing Teamsheet objects for each team of the form teamsheets[team] = Teamsheet.

Return type

Tuple[Dict[str, Dict[str, Events]], Dict[str, Teamsheet]]

static get_pitch()[source]

Returns a Pitch-object corresponding to the StatsBomb Dataset.

Return type

Pitch

get_teamsheets(competition_name='La Liga', season_name='2020/2021', match_name=None)[source]

Returns a dictionary with Teamsheet-objects for both teams (“Home” and “Away”) from one match of the StatsBomb open dataset.

Parameters
  • competition_name (str, optional) – Competition name for which the match is played, check Notes section for possible competitions. Defaults to “La Liga”.

  • season_name (str, optional) – Season name during which the match is played. For league matches use the format YYYY/YYYY and for international cup matches the format YYYY. Check Notes for available seasons of every competition. Defaults to “2020/2021”.

  • match_name (str, optional) – Match name relating to the available matches in the chosen competition and season. If equal to None (default), the first available match of the given competition and season is chosen.

Returns

teamsheets – Teamsheet-objects for both teams (“Home” and “Away”) of the given match.

Return type

Dict[str, Teamsheet]

class floodlight.io.datasets.ToyDataset[source]

This dataset loads synthetic data for a (very) short artificial football match.

The data can be used for testing or trying out features. They come shipped with the package and are stored in the repository’s root .data-folder.

Examples

>>> from floodlight.io.datasets import ToyDataset
>>> dataset = ToyDataset()
# get one sample
>>> (
>>>     xy_home,
>>>     xy_away,
>>>     xy_ball,
>>>     events_home,
>>>     events_away,
>>>     possession,
>>>     ballstatus,
>>> ) = dataset.get(segment="HT1")
# get the corresponding pitch
>>> pitch = dataset.get_pitch()
get(segment='HT1')[source]

Get data objects for one segment from the toy dataset.

Parameters

segment ({‘HT1’, ‘HT2’}, optional) – Segment identifier for the first (“HT1”, default)) or the second (“HT2”) half.

Returns

toy_dataset – Returns seven core objects of the form (xy_home, xy_away, xy_ball, events_home, events_away, possession, ballstatus) for the requested segment.

Return type

Tuple[XY, XY, XY, Events, Events, Code, Code]

static get_pitch()[source]

Returns a Pitch object corresponding to the Toy Dataset.

Return type

Pitch