We cannot guarantee data availability for public data sets, unfortunately. Data from published articles (e.g. the EIGD) should be permanently available and stay static. Public provider data from StatsBomb is available on GitHub, but unversioned and with dynamically changing content. You can find methods to query the current list of games, and we also state the last date that we found the data to be available.
As public data sets for proprietary sports data are fairly rare, the standard way of accessing data is still via provider raw data files. To load these, we have more than ten parser for different provider formats in the IO submodule!
This dataset loads the EIGD-H data from the A Unified Taxonomy and Multimodal
Dataset for Events in Invasion Games paper. [1]
Upon instantiation, the class checks if the data already exists in the repository’s
root .data-folder, and will download the files (~120MB) to this folder if not.
Parameters:
dataset_dir_name (str, optional) – Name of subdirectory where the dataset is stored within the root .data
directory. Defaults to ‘eigd_dataset’.
Notes
The dataset contains a total of 25 short samples of spatiotemporal data for both
teams and the ball from the German Men’s Handball Bundesliga (HBL). For more
information, visit the
official project repository.
Data for one sample can be queried calling the get()-method
specifying the match and segment. The following matches and segments are
available:
>>> dataset=EIGDDataset()# get one sample>>> teamA,teamB,ball=dataset.get(match_name="48dcd3",segment="00-06-00")# get the corresponding pitch>>> pitch=dataset.get_pitch()
This dataset loads the accompanying data set from the An integrated dataset of
spatiotemporal and event data in elite soccer paper. [2]
Upon instantiation, the class checks if the specified data already exists in the
repository’s root .data-folder, and will download the files to this folder if
not. The default setting is to load the first match from the dataset. However, any
individual match or the entire dataset (~2.4 GB) can be downloaded.
Parameters:
dataset_dir_name (str, optional) – Name of subdirectory where the dataset is stored within the root .data
directory. Defaults to ‘idsse_dataset’.
match_id (str, optional) – Match-ID of either one of the matches or ‘all’. Defaults to ‘J03WMX’. Setting it
to one of the matches will download the data of this individual match, if it
does not exist in the repository’s root .data-folder. Setting it to ‘all’
will download the data of all matches that do not exist in .data.
Notes
The dataset contains seven full matches of raw event and position data for both
teams and the ball from the German Men’s Bundesliga season 2022/23 first and second
division. A detailed description of the dataset as well as the collection process
can be found in the accompanying paper. Data for one match can be queried calling
the get()-method by specifying the match. The following matches
are available:
>>> dataset=IDSSEDataset("J03WMX")# get one sample>>> events,xy,possession,ballstatus,teamsheets,pitch=dataset.get("J03WMX")# get the corresponding pitch>>> pitch=dataset.get_pitch()
Get event and position data from the IDSSE dataset.
Parameters:
match_id (str, optional) – Match name, check Notes section for valid arguments.
Defaults to the first match “J03WMX”.
teamsheet_home (Teamsheet, optional) –
Teamsheet-object for the home team used to create link dictionaries of the
form links[pID] = team. If given as None (default), teamsheet is
extracted from the data.
teamsheet_away (Teamsheet, optional) –
Teamsheet-object for the away team used to create link dictionaries of the
form links[pID] = team. If given as None (default), teamsheet is
extracted from the data.
events (bool, optional) – Specifies whether the event data should be returned. Default is True. If
false None will be returned instead of the events-objects.
positions (bool, optional) – Specifies whether the position data should be returned. Default is True. If
false None will be returned instead of the XY-objects, possession-objects,
and ballstatus-objects. This will improve performance considerably if only
event data is required.
Dict[str, Code], Dict[str, Code], Dict[str, Teamsheet],Pitch] – Returns a tuple of shape (events_objects, xy_objects, possession_objects,
ballstatus_objects, teamsheets_objects, pitch_object) as returned by the
floodlight.io.dfl.read_event_data_xml() and
floodlight.io.dfl.read_position_data_xml() functions for the requested
match. If any of the arguments events or positions are set to False,
None is returned instead of event_data or xy_objects,
possession_objects, and ballstatus_objects, respectively.
Due to the size of the full dataset (~5GB), only metadata (~2MB) are downloaded
to the repository’s root .data-folder upon instantiation while the other data
are only downloaded on demand. All downloaded files stay on disk if not manually
removed.
Parameters:
dataset_dir_name (str, optional) – Name of subdirectory where the dataset is stored within the root .data
directory. Defaults to ‘statsbomb_dataset’.
Notes
The dataset contains results, lineups, event data, and (partly) StatsBomb360 data for a variety
of matches from a total of eight different competitions (Women’s World Cup,
FIFA World Cup, UEFA Euro, Champions League, FA Women’s Super League, NWSL,
Premier League, and La Liga).
The Champions League data for example contains all Finals from 2003/2004 to
2018/2019.
The La Liga data contains every one of the 520 matches ever played by Lionel Messi
for FC Barcelona.
The UEFA Euro data contains 51 matches where StatsBomb360 data is available.
As the data is constantly updated, we provide an overview over the stats here but
refer to the official repository for up-to-date information (last
checked 20.08.2022):
number_of_matches={"Champions League":{'1999/2000':0,'2003/2004':1,'2004/2005':1,'2006/2007':1,'2008/2009':1,'2009/2010':1,'2010/2011':1,'2011/2012':1,'2012/2013':1,'2013/2014':1,'2014/2015':1,'2015/2016':1,'2016/2017':1,'2017/2018':1,'2018/2019':1,},"FA Women's Super League":{'2018/2019':108,'2019/2020':87,'2020/2021':131,},"FIFA World Cup":{'2018':64,},"La Liga":{'2004/2005':7,'2005/2006':17,'2006/2007':26,'2007/2008':28,'2008/2009':31,'2009/2010':35,'2010/2011':33,'2011/2012':37,'2012/2013':32,'2013/2014':31,'2014/2015':38,'2015/2016':33,'2016/2017':34,'2017/2018':36,'2018/2019':34,'2019/2020':33,'2020/2021':35,},"NWSL":{'2018':36,},"Premier League":{'2003/2004':33,},"UEFA Euro":{'2020':51,},"Women's World Cup":{'2019':52,},}
Examples
>>> fromfloodlight.io.datasetsimportStatsBombOpenDataset>>> dataset=StatsBombOpenDataset()# get one sample of event data with StatsBomb360 data>>> events,teamsheets=dataset.get("UEFA Euro","2020","England vs. Germany")# get the corresponding pitch>>> pitch=dataset.get_pitch()# get a summary of available matches in the dataset>>> matches=dataset.available_matches# extract every La Liga Clásico played in Camp Nou by Lionel Messi>>> clasicos=matches[matches["match_name"]=="Barcelona vs. Real Madrid"]# print outcomes>>> for_,matchinclasicos.iterrows():>>> print(f"Season {match['season_name']} - Barcelona {match['score']} Real'")# read events to list>>> clasico_events=[]>>> for_,clasicoinclasicos.iterrows():>>> data=dataset.get("La Liga",clasico["season_name"],clasico["match_name"])>>> clasico_events.append(data)
Creates and returns a DataFrame with information for all available matches
from the metadata that is downloaded upon instantiation.
Returns:
summary – Table where the rows contain meta information of individual games such as
competition_name, season_name, and match_name (in the format
Home vs. Away), location of the match (stadium and country),
sex of the players (female or male), the StatsBomb360_status and
the final score.
Get events and teamsheets from one match of the StatsBomb open dataset.
If StatsBomb360data are
available, they are stored in the qualifier column of the Events object.
If the files are not contained in the repository’s root .data folder they
are downloaded to the folder and will be stored until removed by hand.
Parameters:
competition_name (str, optional) – Competition name for which the match is played, check Notes section for
possible competitions. Defaults to “La Liga”.
season_name (str, optional) – Season name during which the match is played. For league matches use the
format YYYY/YYYY and for international cup matches the format YYYY.
Check Notes for available seasons of every competition.
Defaults to “2020/2021”.
match_name (str, optional) – Match name relating to the available matches in the chosen competition and
season. If equal to None (default), the first available match of the
given competition and season is chosen.
teamsheet_home (Teamsheet, optional) – Teamsheet-object for the home team used to create link dictionaries of the
form links[pID] = team. If given as None (default), teamsheet is extracted
from the data.
teamsheet_away (Teamsheet, optional) – Teamsheet-object for the away team. If given as None (default), teamsheet is
extracted from data.
Returns:
data_objects – Tuple of (nested) floodlight core objects with shape (events_objects,
teamsheets).
events_objects is a nested dictionary containing Events objects for
each team and segment of the form
events_objects[segment][team]=Events.
For a typical league match with two halves and teams this dictionary looks
like:
{'HT1':{'Home':Events,'Away':Events},'HT2':{'Home':Events,'Away':Events}}.
teamsheets is a dictionary containing Teamsheet objects for each
team of the form teamsheets[team]=Teamsheet.
Returns a dictionary with Teamsheet-objects for both teams (“Home” and
“Away”) from one match of the StatsBomb open dataset.
Parameters:
competition_name (str, optional) – Competition name for which the match is played, check Notes section for
possible competitions. Defaults to “La Liga”.
season_name (str, optional) – Season name during which the match is played. For league matches use the
format YYYY/YYYY and for international cup matches the format YYYY.
Check Notes for available seasons of every competition.
Defaults to “2020/2021”.
match_name (str, optional) – Match name relating to the available matches in the chosen competition and
season. If equal to None (default), the first available match of the
given competition and season is chosen.
Returns:
teamsheets – Teamsheet-objects for both teams (“Home” and “Away”) of the given match.
This dataset loads synthetic data for a (very) short artificial football match.
The data can be used for testing or trying out features. They come shipped with the
package and are stored in the repository’s root .data-folder.
Examples
>>> fromfloodlight.io.datasetsimportToyDataset
>>> dataset=ToyDataset()# get one sample>>> (>>> xy_home,>>> xy_away,>>> xy_ball,>>> events_home,>>> events_away,>>> possession,>>> ballstatus,>>> )=dataset.get(segment="HT1")# get the corresponding pitch>>> pitch=dataset.get_pitch()
Get data objects for one segment from the toy dataset.
Parameters:
segment ({‘HT1’, ‘HT2’}, optional) – Segment identifier for the first (“HT1”, default)) or the second (“HT2”)
half.
Returns:
toy_dataset – Returns seven core objects of the form (xy_home, xy_away, xy_ball,
events_home, events_away, possession, ballstatus) for the requested segment.