Machine learning model measures performance of MLB players

A team of researchers from Penn State College of Information Sciences and Technology has developed a machine learning model that can better measure the short- and long-term performance of baseball players and teams. The new method was measured against existing statistical analysis methods called sabermetrics.

The research was presented in a paper titled “Using Machine Learning to Describe Player Impact on Play in MLB.”

Rely on NLP and computer vision

The team’s approach relied on recent advances in natural language processing and computer vision, and this could have big implications for how the player’s impact on the game is measured.

Connor Heaton is a PhD candidate at the College of IST.

Heaton says the existing family of methods relies on the number of times a player or team completes a discrete event, such as hitting a home run. These methods do not take into account the context of each action.

“Think of a scenario where a player recorded a single in his last plate appearance,” Heaton said. “He could have hit a dribble on the third base line, got a runner forward from first to second and beat the pitch to first, or hit a ball into deep left field and comfortably reached first base, but hadn’t the speed to push for a double. Describing the two situations as resulting in “one” is accurate but doesn’t tell the whole story.

The new model

Heaton’s model is based on learning the meaning of events in the game, which is based on the impact they have on the game and their context. The model then views the game as a sequence of events to generate numerical representations of players’ impact on the game.

“We often talk about baseball in terms of ‘this player had two singles and a double yesterday.’ or “it went one for four,” Heaton said. “A lot of the ways we talk about the game are just summarizing events with a summary statistic. Our work tries to take a more holistic picture of the game and get a description more nuanced computing of player impact on the game.”

The new method relies on sequential modeling techniques in NLP to allow computers to learn the meaning of different words. Heaton used it to teach his model the significance of events in the baseball game, such as a batter hitting a single. The game was then modeled as a sequence of events.

“The impact of this work is the framework that is offered for what I like to call ‘interrogating the game,'” Heaton said. “We consider it as a sequence in all this computer scaffolding to model a game.”

The model is able to describe a player’s influence on the game in the short term, and when combined with traditional methods, it can predict the winner of a game with over 59% accuracy.

Model training

The researchers trained their model using data previously collected from systems installed at major league ballparks. These systems track detailed information for each pitch, including player positioning, base occupancy, and pitch speed. Two types of data were used. The first was step data, which helped analyze information such as step type. The second was season-by-season data, used to investigate position-specific information.

Each pitch in the collected dataset had three main characteristics, which were the specific game, the in-game batting number, and the in-batting pitch number. This data allowed researchers to piece together the sequence of events that make up an MLB game.

To describe what happened, how it happened, and who was involved in each play, the team identified 325 possible game changes that could occur when a pitch is thrown. These were then combined with the existing data and the player records were imputed.

Prasenjit Mitra is a professor of information science and technology, as well as a co-author of the article.

“This work has the potential to significantly advance the state of the art in sabermetry,” said Professor Mitrae. “To the best of our knowledge, ours is the first to capture and represent a nuanced state of the game and use this information as context to evaluate individual events that are counted by traditional statistics – for example, by automatically building a model which includes key moments and milestones.

Sherry J. Basler