Ipython HPC Training Mar 16, 2016¶

Materials¶

The rendered html of the notebook is here

Run Ipython notebook on HPC (already included in upload-matetials.zip file)

title: The game story space of professional sports: Australian Rules Football

author: Dilan Patrick Kiley, Vermont Complex Systems Center, etc.

Download the uploaded-materials.zip, unzip it in local file system
Copy to HPC. From local commandline
```
scp upload-materials.zip username@philip.hpc.lsu.edu:~/
```
or use data transfer software such as FileZilla
From local computer, cd to upload-materials/ directory, edit the ‘setalias’ script, change the username to yours
source the script:
```
. setalias
```
run sshcluster alias:
```
sshcluster
```
Now you are at remote computer. If you copied the zip file to remote computer before, first unzip the folder. cd to upload-materials/ direcotory, edit the ‘run-setup’ script, change the job running time from 30 minutes to 90 minutes. source the setup script:
```
. run-setup
```
Run the ipynbhpc-philip script:
```
./ipynbhpc-philip
```
The ipython notebook server should be running on philip and ssh-tunnels are setup
From your local computer, open webbrowser, type in address : localhost:7999. You should be able to see the remote directory’s tree structure.
After you are done, go back to the terminal where you started the job (should be on philp1), Ctrl+C to stop.

Pandas data processing functions:
1. DataFrame()
2. pivot_table(), groupby()
3. plot()
4. sort()
5. add()
6. concat()
7. merge()
8. copy()
9. ix(), iloc()
Pandas DataFrame/Series attributes: empty, index, columns, dtype(s), shape, size, data, values...

IPython notebook: readformdata.ipynb
Code description: provide a year, mine stats data for that year
1. find all games and save their stats link
2. for each game, follow their stats link and download progression data
Data saved:
1. allgames_[year].txt
2. allrounds_[year].txt
3. allstats_[year].txt

IPython notebook : AFL_stats_analysis.ipynb
Data used: allstats_[year].txt
Clean up match stats, save into allstats_ex_finals_[year].csv, allstats_finals_[year].csv
Load into Pandas DataFrame, columns = [‘player1’,’score1’, ‘time’, ‘player2’, ‘score2’,’team1’,’team2’,’round’, ‘game’]
Plot multiple games as subgraphs in a grid, use matplotlib subplot2grid() function
Can choose to plot all games in a round, or all games played by a team in team1, plot all final games, etc.

http://auspost.com.au/education/afl/students/afl-basics.html. Each AFL club has a list of players. Each week, 22 are selected for the game. Eighteen players are allowed on the ground at any one time and the other four sit on the bench while waiting for their turn to play.
Are there a relationship between top players and top teams? To answer this question, we want to construct a DataFrame with player, playerscore, team, and teamscore. Below are the steps:
1. Find number of players scored in each team
2. Plot score distribution for players in each team
3. List all scored players and sum their scores in all games, construct DataFrame with columns=[‘player’, ‘playerscore’]. sort the players by their score
4. Figure out each player’s team, add a ‘team’ column to DataFrame
5. Construct another DataFrame, columns=[ ‘team’, ‘totalteamscore’, ‘rank’]. ‘rank’ is based on sorted total team score
6. Merge the two DataFrame as outer join. http://chrisalbon.com/python/pandas_join_merge_dataframe.html
7. The new DataFrame has columns [‘player’, ‘playerscore’, ‘team’, ‘teamscore’, ‘rank’]
8. Plot playerscore and teamscore as 2D scatter plot, scatter matrix plot, and parallel cooridnates plot