Explore data from this open source collection of professional League of Legends matches.
Analysis by Alex Bisberg
%matplotlib inline
import pandas as pd
import re as re
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
I started by looking at the matches data set, as it seemed to be the most comprehensive set of general data
matches = pd.read_csv('data/matches.csv',delimiter=',')
print(sorted(matches.columns))
This data set has most of the high level properties of a League of Legends game.
l = matches[['League','Year']].groupby('League').count().sort_values(by='Year',ascending=False)
l.columns = ['matches']
l.plot(kind='bar',rot=0,figsize=(15,4),title='Number of matches by league')
Many of the matches are from the LCK, although NA and EU are close behind. Even some relatively obscure leagues like CBLoL and tournaments like Worlds and MSI are included.
sy = matches[['Season','Year','League']].groupby(['Year','Season']).count()
sy.columns = ['matches']
sy.plot(kind='bar',rot=0,figsize=(15,4),title='Number of matches by season')
Matches date back to 2014 with more games from more recent seaons.
kills = pd.read_csv('data/kills.csv',delimiter=',')
kills[kills['Address']==matches['Address'][0]] \
[['Team','Killer','Victim','Time','x_pos','y_pos']][:5]
There was some interesting data in the kills.csv
set. I was especially curious about the x_pos
and y_pos
of the kill. I thought it would be interesting to overlay the data on summoners rift and see if something could be determined about kill location. It seemed like this may take a while and be slightly out of the scope of this project.
gold = pd.read_csv('data/gold.csv',delimiter=',')
cols = [c for c in gold.columns if c not in ['Address','Type']]
onegame = gold[(gold['Address']==matches['Address'][0]) & (gold['Type'] == 'golddiff') ]\
[cols].T.dropna()
onegame.columns = ['gold']
onegame.plot(kind='bar',figsize=(15,4))
The gold.csv
data set kept track of all of the gold over time which I thought could be useful information, but I would likely have to join this dataset with matches
to try to analyze how champions or groups of champions performed over time in certain situations. Once again, I thought this may be slightly outside the scope of this project.
monsters
, structures
, and bans
seemed slightly less interesting to me, and at this point I figure I could find something interesting in the matches
data set itself.
After looking through the columns of the main matches.csv
data set some variables I noticed there was data on which champions played in the game (blue{}Champ
,red{}Champ
), the result of the game (bResult
,rResult
), the game duration (gamelength
). I thought about some of the common winrate vs game time graph displayed on sites like op.gg
, but I wanted to ask a slightly deeper question. Is there a significant difference in the mean game time for games that a champ wins vs. games that they lose. This could give pro teams insights into which champions to pick given a playstyle that would allow them to win early or stall out to late game.
roles = ['Top','Jungle','Middle','ADC','Support']
blue = ['blue{}Champ'.format(r) for r in roles]
red = ['red{}Champ'.format(r) for r in roles]
champ_gt_array = []
for i,m in matches.iterrows():
for c in blue + red:
champ_gt_array.append(
[m[c],
m['bResult'] if bool(re.match('blue',m[c])) else m['rResult'],
m['gamelength']]
)
champ_frame = pd.DataFrame(data=champ_gt_array,columns=['champ','win','gamelength'])
champ_frame.describe()['gamelength']
I munged the data to get the champ names, whether they were on the winning team, and how long their game was. It looks like the mean game length is around 37 and slightly skewed right. Since in general it's easier for games to go longer in League of Legends this makes sense.
wins = champ_frame[champ_frame['win'] == 1]
loses = champ_frame[champ_frame['win'] == 0]
w_gl= wins[['win','gamelength']].groupby('gamelength').count()
w_gl.columns = ['wins']
l_gl= loses[['win','gamelength']].groupby('gamelength').count()
l_gl.columns = ['loses']
fig, axs = plt.subplots(1,2,figsize=(15,5))
w_gl.plot(kind='bar',ax=axs[0],title='Wins by game length')
l_gl.plot(kind='bar',ax=axs[1],title='Loses by game length')
Given the relatively normal distribution of game length it would probably be reasonable to try to compare the means for different champions.
w_gl.describe()['wins']['25%'],l_gl.describe()['loses']['25%']
Next I wanted to make sure I had enough data on a champion for the test to return significant results, so I looked at the bottom quartile for number of games per champ and cut off the list there.
play_cnt = champ_frame[['win','champ']] \
.groupby(by='champ').count() \
.sort_values('win',ascending=False)
top_75 = play_cnt[play_cnt['win'] >= 50]
bot_25 = play_cnt[play_cnt['win'] < 50]
len(play_cnt),len(top_75)
108 champions still seems like a reasonable pool that would still yield a useful amount of choices, notably those that are played in pro games more frequently.
def process_games(champs, wins, loses):
mean_data = []
for champ in champs:
cw = wins[wins['champ'] == champ]['gamelength']
cl = loses[loses['champ'] == champ]['gamelength']
mean_data.append([
champ,
cw.count()+cl.count(),
cw.mean(),
cl.mean(),
abs(cw.mean()-cl.mean()),
stats.ttest_ind(cw,cl)[1] <= 0.05
])
return pd.DataFrame(
data=mean_data,
columns=['champ','tot_games','mean_win_gl','mean_loss_gl','abs_diff','pvalue'])
mean_frame = process_games(top_75.index,wins,loses)
mean_frame.sort_values(by='abs_diff',ascending=False)[:10]
These were some interesting results. I rejected the null hypothesis that the means were different if the p value was more than 0.05. The champion that had the largest significant absolute difference was Dr. Mundo at around 5 min. The mean of his loses is below the mean time for games, but his mean for wins is greater at around 40 min. Draven is another standout. He is known for his huge damage output which may manifest in the mid game, but in most of his loses are below the mean game time, possibly singifying when he got behind early he would lose.
mean_frame[mean_frame.mean_win_gl > mean_frame.mean_loss_gl]['champ'].count(),\
mean_frame[mean_frame.mean_win_gl < mean_frame.mean_loss_gl]['champ'].count()
Many champions had longer mean game times for wins than for loses, so I decided to look at just the games that were significantly longer than the mean. I chose to zoom in on the top quartile, or those that were above 41 minutes long.
x_wins = wins[wins.gamelength > 41]
x_loses = loses[loses.gamelength > 41]
x_mean_frame = process_games(top_75.index,x_wins,x_loses)
sv = x_mean_frame.sort_values(by='abs_diff',ascending=False)[x_mean_frame.pvalue == True]
with sns.axes_style("white"):
ax = plt.axes()
g = sns.heatmap(
sv[['mean_win_gl','mean_loss_gl']],
annot=True,
cmap='Greens',
ax=ax)
ax.set_title('Mean win and loss time for games over 41 min')
g.set_yticklabels(sv['champ'],rotation=0)
This heat map visualizes the significant mean win and loss time for champs who had games over 41 min. These four champions all faired worse in longer games. Anecdotally this makes sense as well, given the assassain playstyle of Fizz and LeeSin and the lack of scaling of Olaf in the late game. Kalista makes an interesting appearance here which could be explained by her shorter range making her more susceptible in late game team fights.
Overall it was possible to show that some of champions do have a significantly longer mean game time when they win. This data could be used as the input to another, more complex model to try to determine ideal team compositions for certain strategies or situations. The main contributors to a lack of significance were a shortage of game data for the infrequently played champions and a regression to the mean game time for the most played champions. Vaiarbles I didn't account for could include changes in game objectives and champions themselves over time.
To dig deeper I would like to look at factors like gold difference or KDA. Gold difference could be used to show the potential for some champs to get ahead early or even comeback from a deficit. KDA is a good metric since it's somewhat normalized for game time (assuming kills, deaths and assists are equally as likely to occur as the game goes on). In addition looking at multiple champions together could reveal more significant or interesting trends.