GOAL:¶

Explore data from this open source collection of professional League of Legends matches.

Analysis by Alex Bisberg

%matplotlib inline
import pandas as pd
import re as re
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

Exploring the data¶

I started by looking at the matches data set, as it seemed to be the most comprehensive set of general data

matches = pd.read_csv('data/matches.csv',delimiter=',')
print(sorted(matches.columns))

['Address', 'League', 'Season', 'Type', 'Year', 'bBarons', 'bDragons', 'bHeralds', 'bInhibs', 'bKills', 'bResult', 'bTowers', 'blueADC', 'blueADCChamp', 'blueBans', 'blueJungle', 'blueJungleChamp', 'blueMiddle', 'blueMiddleChamp', 'blueSupport', 'blueSupportChamp', 'blueTeamTag', 'blueTop', 'blueTopChamp', 'gamelength', 'goldblue', 'goldblueADC', 'goldblueJungle', 'goldblueMiddle', 'goldblueSupport', 'goldblueTop', 'golddiff', 'goldred', 'goldredADC', 'goldredJungle', 'goldredMiddle', 'goldredSupport', 'goldredTop', 'rBarons', 'rDragons', 'rHeralds', 'rInhibs', 'rKills', 'rResult', 'rTowers', 'redADC', 'redADCChamp', 'redBans', 'redJungle', 'redJungleChamp', 'redMiddle', 'redMiddleChamp', 'redSupport', 'redSupportChamp', 'redTeamTag', 'redTop', 'redTopChamp']

This data set has most of the high level properties of a League of Legends game.

l = matches[['League','Year']].groupby('League').count().sort_values(by='Year',ascending=False)
l.columns = ['matches']
l.plot(kind='bar',rot=0,figsize=(15,4),title='Number of matches by league')

<matplotlib.axes._subplots.AxesSubplot at 0x10c0952e8>

Many of the matches are from the LCK, although NA and EU are close behind. Even some relatively obscure leagues like CBLoL and tournaments like Worlds and MSI are included.

sy = matches[['Season','Year','League']].groupby(['Year','Season']).count()
sy.columns = ['matches']
sy.plot(kind='bar',rot=0,figsize=(15,4),title='Number of matches by season')

<matplotlib.axes._subplots.AxesSubplot at 0x10c0799b0>

Matches date back to 2014 with more games from more recent seaons.

kills = pd.read_csv('data/kills.csv',delimiter=',')
kills[kills['Address']==matches['Address'][0]] \
  [['Team','Killer','Victim','Time','x_pos','y_pos']][:5]

There was some interesting data in the kills.csv set. I was especially curious about the x_pos and y_pos of the kill. I thought it would be interesting to overlay the data on summoners rift and see if something could be determined about kill location. It seemed like this may take a while and be slightly out of the scope of this project.

gold = pd.read_csv('data/gold.csv',delimiter=',')
cols = [c for c in gold.columns if c not in ['Address','Type']]
onegame = gold[(gold['Address']==matches['Address'][0]) & (gold['Type'] == 'golddiff') ]\
  [cols].T.dropna()
onegame.columns = ['gold']
onegame.plot(kind='bar',figsize=(15,4))

<matplotlib.axes._subplots.AxesSubplot at 0x1129b0dd8>

The gold.csv data set kept track of all of the gold over time which I thought could be useful information, but I would likely have to join this dataset with matches to try to analyze how champions or groups of champions performed over time in certain situations. Once again, I thought this may be slightly outside the scope of this project.

monsters, structures, and bans seemed slightly less interesting to me, and at this point I figure I could find something interesting in the matches data set itself.

Are some champions actually "early game" or "late game"¶

After looking through the columns of the main matches.csv data set some variables I noticed there was data on which champions played in the game (blue{}Champ,red{}Champ), the result of the game (bResult,rResult), the game duration (gamelength). I thought about some of the common winrate vs game time graph displayed on sites like op.gg , but I wanted to ask a slightly deeper question. Is there a significant difference in the mean game time for games that a champ wins vs. games that they lose. This could give pro teams insights into which champions to pick given a playstyle that would allow them to win early or stall out to late game.

roles = ['Top','Jungle','Middle','ADC','Support']
blue = ['blue{}Champ'.format(r) for r in roles]
red = ['red{}Champ'.format(r) for r in roles]
champ_gt_array = []
for i,m in matches.iterrows():
  for c in blue + red:
    champ_gt_array.append(
      [m[c],
       m['bResult'] if bool(re.match('blue',m[c])) else m['rResult'],
       m['gamelength']]
    )
champ_frame = pd.DataFrame(data=champ_gt_array,columns=['champ','win','gamelength'])
champ_frame.describe()['gamelength']

count    70580.000000
mean        36.941910
std          7.923698
min         17.000000
25%         31.000000
50%         36.000000
75%         41.000000
max         81.000000
Name: gamelength, dtype: float64

I munged the data to get the champ names, whether they were on the winning team, and how long their game was. It looks like the mean game length is around 37 and slightly skewed right. Since in general it's easier for games to go longer in League of Legends this makes sense.

wins = champ_frame[champ_frame['win'] == 1]
loses = champ_frame[champ_frame['win'] == 0]

w_gl= wins[['win','gamelength']].groupby('gamelength').count()
w_gl.columns = ['wins']
l_gl= loses[['win','gamelength']].groupby('gamelength').count()
l_gl.columns = ['loses']

fig, axs = plt.subplots(1,2,figsize=(15,5))
w_gl.plot(kind='bar',ax=axs[0],title='Wins by game length')
l_gl.plot(kind='bar',ax=axs[1],title='Loses by game length')

<matplotlib.axes._subplots.AxesSubplot at 0x111d93d30>

Given the relatively normal distribution of game length it would probably be reasonable to try to compare the means for different champions.

w_gl.describe()['wins']['25%'],l_gl.describe()['loses']['25%']

(42.5, 50.0)

Next I wanted to make sure I had enough data on a champion for the test to return significant results, so I looked at the bottom quartile for number of games per champ and cut off the list there.

play_cnt = champ_frame[['win','champ']] \
  .groupby(by='champ').count() \
  .sort_values('win',ascending=False)

top_75 = play_cnt[play_cnt['win'] >= 50]
bot_25 = play_cnt[play_cnt['win'] < 50]

len(play_cnt),len(top_75)

(138, 108)

108 champions still seems like a reasonable pool that would still yield a useful amount of choices, notably those that are played in pro games more frequently.

def process_games(champs, wins, loses):
  mean_data = []
  for champ in champs:
    cw = wins[wins['champ'] == champ]['gamelength']
    cl = loses[loses['champ'] == champ]['gamelength']
    mean_data.append([
      champ,
      cw.count()+cl.count(),
      cw.mean(),
      cl.mean(),
      abs(cw.mean()-cl.mean()),
      stats.ttest_ind(cw,cl)[1] <= 0.05
  ])

  return pd.DataFrame(
    data=mean_data,
    columns=['champ','tot_games','mean_win_gl','mean_loss_gl','abs_diff','pvalue'])


mean_frame = process_games(top_75.index,wins,loses)
mean_frame.sort_values(by='abs_diff',ascending=False)[:10]

These were some interesting results. I rejected the null hypothesis that the means were different if the p value was more than 0.05. The champion that had the largest significant absolute difference was Dr. Mundo at around 5 min. The mean of his loses is below the mean time for games, but his mean for wins is greater at around 40 min. Draven is another standout. He is known for his huge damage output which may manifest in the mid game, but in most of his loses are below the mean game time, possibly singifying when he got behind early he would lose.

mean_frame[mean_frame.mean_win_gl > mean_frame.mean_loss_gl]['champ'].count(),\
mean_frame[mean_frame.mean_win_gl < mean_frame.mean_loss_gl]['champ'].count()

(95, 13)

Many champions had longer mean game times for wins than for loses, so I decided to look at just the games that were significantly longer than the mean. I chose to zoom in on the top quartile, or those that were above 41 minutes long.

x_wins = wins[wins.gamelength > 41]
x_loses = loses[loses.gamelength > 41]

x_mean_frame = process_games(top_75.index,x_wins,x_loses)
sv = x_mean_frame.sort_values(by='abs_diff',ascending=False)[x_mean_frame.pvalue == True]
with sns.axes_style("white"):
  ax = plt.axes()
  g = sns.heatmap(
    sv[['mean_win_gl','mean_loss_gl']],
    annot=True,
    cmap='Greens',
    ax=ax)
  ax.set_title('Mean win and loss time for games over 41 min')
  g.set_yticklabels(sv['champ'],rotation=0)

This heat map visualizes the significant mean win and loss time for champs who had games over 41 min. These four champions all faired worse in longer games. Anecdotally this makes sense as well, given the assassain playstyle of Fizz and LeeSin and the lack of scaling of Olaf in the late game. Kalista makes an interesting appearance here which could be explained by her shorter range making her more susceptible in late game team fights.

Summary and Future Directions¶

Overall it was possible to show that some of champions do have a significantly longer mean game time when they win. This data could be used as the input to another, more complex model to try to determine ideal team compositions for certain strategies or situations. The main contributors to a lack of significance were a shortage of game data for the infrequently played champions and a regression to the mean game time for the most played champions. Vaiarbles I didn't account for could include changes in game objectives and champions themselves over time.

To dig deeper I would like to look at factors like gold difference or KDA. Gold difference could be used to show the potential for some champs to get ahead early or even comeback from a deficit. KDA is a good metric since it's somewhat normalized for game time (assuming kills, deaths and assists are equally as likely to occur as the game goes on). In addition looking at multiple champions together could reveal more significant or interesting trends.

	Team	Killer	Victim	Time	x_pos	y_pos
0	bKills	TSM Bjergsen	C9 Hai	10.820	9229	8469
1	bKills	TSM WildTurtle	C9 LemonNation	16.286	13908	5960
2	bKills	TSM Bjergsen	C9 Hai	18.733	8281	7889
3	bKills	TSM Dyrus	C9 Meteos	18.880	8522	8895
4	bKills	TSM Bjergsen	C9 Balls	27.005	8902	7643

	champ	tot_games	mean_win_gl	mean_loss_gl	abs_diff	pvalue
96	DrMundo	91	40.891892	35.759259	5.132633	True
101	Draven	62	38.344828	33.454545	4.890282	True
104	Rammus	54	41.103448	36.880000	4.223448	False
103	Brand	54	39.115385	35.392857	3.722527	False
100	Diana	73	37.583333	33.945946	3.637387	True
86	Malphite	134	37.857143	34.901408	2.955734	True
87	Soraka	132	36.229508	33.394366	2.835142	True
98	AurelionSol	82	35.945946	38.755556	2.809610	False
97	Quinn	85	37.195652	34.641026	2.554627	False
82	Evelynn	162	39.864198	37.543210	2.320988	True