This is part of a series of articles inspired by questions from our readers. So thank you to, among others, Tom F, Steve G, Kristof P, Vaageesh T and Matt X, who all asked for a piece on data collection and the calculation of metrics such as xG.
“We should have won that game if you look at the expected goals.”
A common debate among football fans these days, but that wasn’t the case just a few years ago.
The proliferation of data in the public sphere has allowed fans to easily access football statistics with the click of a button. Whether that’s on social media or somewhere else on the internet, advanced metrics are now accessible to a wider audience and have democratised analysis to some extent.
But the average fan takes what they get from public sources. The next strata of data require paid access to bespoke packages from the handful of data companies out there, with club-specific data (and what those clubs do with it, more importantly) the most secretive bit of all.
StatsBomb, Opta and Deltatre have been at the forefront of providing data to media organisations, player agencies and clubs in the past decade, but there’s a lingering question that should be on the mind of all data consumers: how exactly is this done? How, in less than a day, is advanced data for myriad leagues recorded and up to date? Do these numbers magically appear after the games?
To find an answer, The Athletic headed to StatsBomb’s collection centre in Cairo.
Its location in the Egyptian capital might sound peculiar at first, but StatsBomb’s acquisition of ArqamFC — which is Arabic for NumbersFC — in 2019 provided the growing data company with a central hub that had developed a way to collect and analyse football matches in great detail. In StatsBomb’s own words, “they were the ideal partner”.
Another benefit of this location is the time zone. By having the data collection hub in Egypt, StatsBomb has a central base where Europe, Asia and Africa are within reasonable time zones and one central hub is something they pushed for in the early days to maintain quality control.
“There’s a famous trade-off between having multiple data collection centres around the world, where people know the local players, and having one central hub. The problem in the first is that the quality control measures between the hubs are inconsistent,” says Hesham Abozekry, co-founder and head of data operations at StatsBomb.
“That’s why Ted Knutson (StatsBomb’s founder and CEO) pushed for centrality to maintain the quality of the data.”
On the other end of expected goals (xG) tables, pressure metrics and expected assists (xA), there’s a meticulous process that involves a human and a computer working in tandem to collect the raw data that powers everything. Computer vision is used to help the data collector tag events such as shots, passes and tackles, and insert their locations on the pitch.
“The error threshold rate of artificial intelligence is improving, but when you combine it with a human you can reach 99 per cent accuracy,” says Ali Elfakharany, co-founder and head of data product at StatsBomb. That’s why computer vision is used in live and post-match data collection processes in addition to the input of the data collector.
Post-match data collection aims to provide an in-depth report for the client within 12 hours of the game kicking off. If that client is a club, the coaching staff will have the report after the game and ready for their next meeting, even if it’s the next day.
For this option, two data collectors work on one game — one for each team. Each collector selects their assigned match, which already has metadata that includes the date and time, line-ups, initial formations, referee and managers, all meticulously added by a separate team.
Then, a computer algorithm validates the data entered by ‘data collector A’ with the one entered by his colleague to check that events are linked — on the end of an aerial won, for instance, there should be an aerial lost.
If there is missing information, the collector goes back to check the sequence and fill in the event. Each instance is flagged to a member of StatsBomb’s quality assurance team to assess whether the collector needs further training.
The next bit is to work on the locations of the events. Since there’s only one ball, the two collectors split the 90 minutes between them to tag the coordinates of the events. After that, it’s time for ‘Freeze Frame’.
In May 2018, StatsBomb began to offer precise positions of the defenders and the goalkeeper at each shot using a process they call Freeze Frame — a snapshot taken at the moment of the shot that captures the location of all players involved in the event. This allows StatsBomb to record the context around each shot and puts into the equation the pressure from defenders and the positioning of the goalkeeper.
The Freeze Frame process starts with the computer automatically producing a list of snapshots for all logged shots that were in the system after the first pass. The collector then tags every player in the frame before inserting information about the shot itself – the body part used, whether the shot came from open play, a free kick or a corner, and the technique of the shot. It could be a volley, half-volley, diving header, lob or even a backheel.
All of these are parameters entered by the collector that affect the xG calculation of each shot. After that, the end location of the shot is inserted by the collector. If it’s a shot on target, the data collector has to choose the precise position of the ball in the goal.
Other details are also taken into consideration. The goalkeeper’s body orientation is one, his action on the shot is another. He or she could be moving forward when the shot is taken or in a set position, while they could also be diving to one side or just standing there.
The final part of Freeze Frame is adjusting the ‘shot impact height’, which calculates the height of the ball when struck, as a ball struck from a rest position on the ground shouldn’t be equal to a header when the forward is 7ft (more than 200cm) in the air. Put all of this in the mixer and you get the xG and expected goals on target (xGOT) values that are now familiar to many fans.
The overall process of offline data collection takes around five hours and a collector will work on a game and a half each day on average. The next shift picks up from where the previous one left off, meaning there is 24-hour coverage for more than 100 competitions worldwide.
When asked what the hardest part of data collection is, Amr Azzam, a StatsBomb data collector, replies that it’s player identification. That’s why data collectors use boots, hair styles and height to differentiate players.
High-profile games are always easier to collect because the quality of the football is higher, as is the quality of the video. “Premier League or Champions Leagues (games) are easier to collect as you know the players,” says Amr.
This is echoed by Elfakharany. “In general, player identification is our hardest problem,” he says. That’s why StatsBomb specifically created a team to prepare information about the players before collectors work on games. Before live games, each collector is sent a document that has information about the teams and players he will be working on.
Live collection requires five data collectors: a reviewer to check everything is correct, one person to collect all the main events, two to tag players and the location of events for each team, and an extra person to fill in information about each event.
The first four are self-explanatory and the last one works on details ranging from Freeze Frame to pass height and pass footedness. This provides clubs with a live stream of data they could check through the game to help their decision-making.
Two more features help data collectors during live collection. First, an algorithm automatically assigns the pass receiver to be the tagged player on the next event. Second, a player-customisation tool where, before the game, data collectors adjust players’ height, hairstyle and the colour of their boots using the provided material to help them during the tagging process.
Despite that, data collection can be difficult. Nantes’ yellow kit and green numbers make it harder to differentiate the players. Elland Road in the morning sun presents identification problems. Newcastle United’s black and white stripes don’t make life easier for the data collectors either. The hardest of them all? The Nordic leagues in the snow.
Abozekry and Elfakharany know the importance of data collectors and empowering them is part of StatsBomb’s culture. It’s normal that a data collector might be in the conversation loop with one of the biggest clubs in Europe. “The connection between the data collector and the end client gives the collector a sense of ownership and this drives him to produce the highest quality,” says Abozekry.
Working with major clubs is a prestigious honour that any data company seeks, but collecting data for lower leagues is just as important. When talking about that, Elfakharany explains that despite Europe being the final destination, promising talent is out there, all over the world. That’s why collecting data for lower leagues is crucial, even where camera angles aren’t the best and the stadiums aren’t equipped with the highest technology.
“The big teams now must go to the source of talent because if a smaller club spots the player before them, they will have to pay a premium,” he says.
An example that immediately pops up is Kaoru Mitoma. Now rocking the Premier League, Brighton & Hove Albion signed him from Kawasaki Frontale in Japan two years ago.
StatsBomb has been working with many Premier League, Major League Soccer and Ligue 1 teams, providing data and tools to assist those clubs throughout the year. But a data-oriented approach is also working down the football pyramid. Last season in League Two, two of the automatically promoted sides and two of the four teams that reached the play-offs were StatsBomb customers.
“It’s a league that we were told would likely never have the budget to use stats and data to run their football teams,” said Knutson last year. “The reality has turned out to be quite different because these teams figured out that they needed to reallocate budget resources — which are still lower than a year’s wages for one elite-level player — to a competitive edge they will never stop using.”
Moving forward, the company is looking to develop more skill-oriented metrics. “We want to continue developing better data points to help us move towards less output-driven metrics and more towards skill-oriented metrics,” says Elfakharany.
Instead of simply judging an attacker based on his expected numbers, they are looking to contextualise the numbers by also factoring in the player’s profile and skill set.
Before the day ends, Elfakharany tells me the secret to success.
“You are only as good as your data quality.”
(Top photos: Getty Images; design: Sean Reilly)
[embed]https://www.youtube.com/watch?v=i3urURYLJ4I[/embed]The preseason kudos are coming in for Ohio State defensive end J.T. Tuimoloau.The 6-4, 271-pound Tuimo
Part of what makes Nick Saban arguably the greatest college football coach of all time is how he creates advantages for his teams. The Alabama head coach priori
Antonio Wallace is the latest head football coach at E.E. Smith High School. While he is a first-time head coach, Wallace brings in years of experience
By Joel Bryant, HSOT Sr. Football & Recruiting EditorMount Airy, N.C. — HighSchoolOT stopped