GOAL:

Explore data from this open source collection of professional League of Legends matches.

Analysis by Alex Bisberg

In [1]:
%matplotlib inline
import pandas as pd
import re as re
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

Exploring the data

I started by looking at the matches data set, as it seemed to be the most comprehensive set of general data

In [2]:
matches = pd.read_csv('data/matches.csv',delimiter=',')
print(sorted(matches.columns))
['Address', 'League', 'Season', 'Type', 'Year', 'bBarons', 'bDragons', 'bHeralds', 'bInhibs', 'bKills', 'bResult', 'bTowers', 'blueADC', 'blueADCChamp', 'blueBans', 'blueJungle', 'blueJungleChamp', 'blueMiddle', 'blueMiddleChamp', 'blueSupport', 'blueSupportChamp', 'blueTeamTag', 'blueTop', 'blueTopChamp', 'gamelength', 'goldblue', 'goldblueADC', 'goldblueJungle', 'goldblueMiddle', 'goldblueSupport', 'goldblueTop', 'golddiff', 'goldred', 'goldredADC', 'goldredJungle', 'goldredMiddle', 'goldredSupport', 'goldredTop', 'rBarons', 'rDragons', 'rHeralds', 'rInhibs', 'rKills', 'rResult', 'rTowers', 'redADC', 'redADCChamp', 'redBans', 'redJungle', 'redJungleChamp', 'redMiddle', 'redMiddleChamp', 'redSupport', 'redSupportChamp', 'redTeamTag', 'redTop', 'redTopChamp']

This data set has most of the high level properties of a League of Legends game.

In [3]:
l = matches[['League','Year']].groupby('League').count().sort_values(by='Year',ascending=False)
l.columns = ['matches']
l.plot(kind='bar',rot=0,figsize=(15,4),title='Number of matches by league')
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c0952e8>

Many of the matches are from the LCK, although NA and EU are close behind. Even some relatively obscure leagues like CBLoL and tournaments like Worlds and MSI are included.

In [4]:
sy = matches[['Season','Year','League']].groupby(['Year','Season']).count()
sy.columns = ['matches']
sy.plot(kind='bar',rot=0,figsize=(15,4),title='Number of matches by season')
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c0799b0>

Matches date back to 2014 with more games from more recent seaons.

In [5]:
kills = pd.read_csv('data/kills.csv',delimiter=',')
kills[kills['Address']==matches['Address'][0]] \
  [['Team','Killer','Victim','Time','x_pos','y_pos']][:5]
Out[5]:
Team Killer Victim Time x_pos y_pos
0 bKills TSM Bjergsen C9 Hai 10.820 9229 8469
1 bKills TSM WildTurtle C9 LemonNation 16.286 13908 5960
2 bKills TSM Bjergsen C9 Hai 18.733 8281 7889
3 bKills TSM Dyrus C9 Meteos 18.880 8522 8895
4 bKills TSM Bjergsen C9 Balls 27.005 8902 7643

There was some interesting data in the kills.csv set. I was especially curious about the x_pos and y_pos of the kill. I thought it would be interesting to overlay the data on summoners rift and see if something could be determined about kill location. It seemed like this may take a while and be slightly out of the scope of this project.

In [6]:
gold = pd.read_csv('data/gold.csv',delimiter=',')
cols = [c for c in gold.columns if c not in ['Address','Type']]
onegame = gold[(gold['Address']==matches['Address'][0]) & (gold['Type'] == 'golddiff') ]\
  [cols].T.dropna()
onegame.columns = ['gold']
onegame.plot(kind='bar',figsize=(15,4))
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x1129b0dd8>

The gold.csv data set kept track of all of the gold over time which I thought could be useful information, but I would likely have to join this dataset with matches to try to analyze how champions or groups of champions performed over time in certain situations. Once again, I thought this may be slightly outside the scope of this project.

monsters, structures, and bans seemed slightly less interesting to me, and at this point I figure I could find something interesting in the matches data set itself.

Are some champions actually "early game" or "late game"

After looking through the columns of the main matches.csv data set some variables I noticed there was data on which champions played in the game (blue{}Champ,red{}Champ), the result of the game (bResult,rResult), the game duration (gamelength). I thought about some of the common winrate vs game time graph displayed on sites like op.gg , but I wanted to ask a slightly deeper question. Is there a significant difference in the mean game time for games that a champ wins vs. games that they lose. This could give pro teams insights into which champions to pick given a playstyle that would allow them to win early or stall out to late game.

In [7]:
roles = ['Top','Jungle','Middle','ADC','Support']
blue = ['blue{}Champ'.format(r) for r in roles]
red = ['red{}Champ'.format(r) for r in roles]
champ_gt_array = []
for i,m in matches.iterrows():
  for c in blue + red:
    champ_gt_array.append(
      [m[c],
       m['bResult'] if bool(re.match('blue',m[c])) else m['rResult'],
       m['gamelength']]
    )
champ_frame = pd.DataFrame(data=champ_gt_array,columns=['champ','win','gamelength'])
champ_frame.describe()['gamelength']
Out[7]:
count    70580.000000
mean        36.941910
std          7.923698
min         17.000000
25%         31.000000
50%         36.000000
75%         41.000000
max         81.000000
Name: gamelength, dtype: float64

I munged the data to get the champ names, whether they were on the winning team, and how long their game was. It looks like the mean game length is around 37 and slightly skewed right. Since in general it's easier for games to go longer in League of Legends this makes sense.

In [8]:
wins = champ_frame[champ_frame['win'] == 1]
loses = champ_frame[champ_frame['win'] == 0]

w_gl= wins[['win','gamelength']].groupby('gamelength').count()
w_gl.columns = ['wins']
l_gl= loses[['win','gamelength']].groupby('gamelength').count()
l_gl.columns = ['loses']

fig, axs = plt.subplots(1,2,figsize=(15,5))
w_gl.plot(kind='bar',ax=axs[0],title='Wins by game length')
l_gl.plot(kind='bar',ax=axs[1],title='Loses by game length')
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x111d93d30>

Given the relatively normal distribution of game length it would probably be reasonable to try to compare the means for different champions.

In [9]:
w_gl.describe()['wins']['25%'],l_gl.describe()['loses']['25%']
Out[9]:
(42.5, 50.0)

Next I wanted to make sure I had enough data on a champion for the test to return significant results, so I looked at the bottom quartile for number of games per champ and cut off the list there.

In [10]:
play_cnt = champ_frame[['win','champ']] \
  .groupby(by='champ').count() \
  .sort_values('win',ascending=False)

top_75 = play_cnt[play_cnt['win'] >= 50]
bot_25 = play_cnt[play_cnt['win'] < 50]

len(play_cnt),len(top_75)
Out[10]:
(138, 108)

108 champions still seems like a reasonable pool that would still yield a useful amount of choices, notably those that are played in pro games more frequently.

In [11]:
def process_games(champs, wins, loses):
  mean_data = []
  for champ in champs:
    cw = wins[wins['champ'] == champ]['gamelength']
    cl = loses[loses['champ'] == champ]['gamelength']
    mean_data.append([
      champ,
      cw.count()+cl.count(),
      cw.mean(),
      cl.mean(),
      abs(cw.mean()-cl.mean()),
      stats.ttest_ind(cw,cl)[1] <= 0.05
  ])

  return pd.DataFrame(
    data=mean_data,
    columns=['champ','tot_games','mean_win_gl','mean_loss_gl','abs_diff','pvalue'])


mean_frame = process_games(top_75.index,wins,loses)
mean_frame.sort_values(by='abs_diff',ascending=False)[:10]
Out[11]:
champ tot_games mean_win_gl mean_loss_gl abs_diff pvalue
96 DrMundo 91 40.891892 35.759259 5.132633 True
101 Draven 62 38.344828 33.454545 4.890282 True
104 Rammus 54 41.103448 36.880000 4.223448 False
103 Brand 54 39.115385 35.392857 3.722527 False
100 Diana 73 37.583333 33.945946 3.637387 True
86 Malphite 134 37.857143 34.901408 2.955734 True
87 Soraka 132 36.229508 33.394366 2.835142 True
98 AurelionSol 82 35.945946 38.755556 2.809610 False
97 Quinn 85 37.195652 34.641026 2.554627 False
82 Evelynn 162 39.864198 37.543210 2.320988 True

These were some interesting results. I rejected the null hypothesis that the means were different if the p value was more than 0.05. The champion that had the largest significant absolute difference was Dr. Mundo at around 5 min. The mean of his loses is below the mean time for games, but his mean for wins is greater at around 40 min. Draven is another standout. He is known for his huge damage output which may manifest in the mid game, but in most of his loses are below the mean game time, possibly singifying when he got behind early he would lose.

In [12]:
mean_frame[mean_frame.mean_win_gl > mean_frame.mean_loss_gl]['champ'].count(),\
mean_frame[mean_frame.mean_win_gl < mean_frame.mean_loss_gl]['champ'].count()
Out[12]:
(95, 13)

Many champions had longer mean game times for wins than for loses, so I decided to look at just the games that were significantly longer than the mean. I chose to zoom in on the top quartile, or those that were above 41 minutes long.

In [13]:
x_wins = wins[wins.gamelength > 41]
x_loses = loses[loses.gamelength > 41]

x_mean_frame = process_games(top_75.index,x_wins,x_loses)
sv = x_mean_frame.sort_values(by='abs_diff',ascending=False)[x_mean_frame.pvalue == True]
with sns.axes_style("white"):
  ax = plt.axes()
  g = sns.heatmap(
    sv[['mean_win_gl','mean_loss_gl']],
    annot=True,
    cmap='Greens',
    ax=ax)
  ax.set_title('Mean win and loss time for games over 41 min')
  g.set_yticklabels(sv['champ'],rotation=0)

This heat map visualizes the significant mean win and loss time for champs who had games over 41 min. These four champions all faired worse in longer games. Anecdotally this makes sense as well, given the assassain playstyle of Fizz and LeeSin and the lack of scaling of Olaf in the late game. Kalista makes an interesting appearance here which could be explained by her shorter range making her more susceptible in late game team fights.

Summary and Future Directions

Overall it was possible to show that some of champions do have a significantly longer mean game time when they win. This data could be used as the input to another, more complex model to try to determine ideal team compositions for certain strategies or situations. The main contributors to a lack of significance were a shortage of game data for the infrequently played champions and a regression to the mean game time for the most played champions. Vaiarbles I didn't account for could include changes in game objectives and champions themselves over time.

To dig deeper I would like to look at factors like gold difference or KDA. Gold difference could be used to show the potential for some champs to get ahead early or even comeback from a deficit. KDA is a good metric since it's somewhat normalized for game time (assuming kills, deaths and assists are equally as likely to occur as the game goes on). In addition looking at multiple champions together could reveal more significant or interesting trends.