Ipython HPC Training Mar 16, 2016

Materials

IPython notebook and data files

The rendered html of the notebook is here

Run Ipython notebook on HPC (already included in upload-matetials.zip file)

Paper:

title: The game story space of professional sports: Australian Rules Football

author: Dilan Patrick Kiley, Vermont Complex Systems Center, etc.

download link: http://arxiv.org/pdf/1507.03886v1.pdf

Questions:

  • Which is the best/worst team?
  • Who is the best/worst player?
  • Pattern of games
  • Which game is most interesting/boring?
  • Number of plays vs. score

Get Started

  1. Download the uploaded-materials.zip, unzip it in local file system

  2. Copy to HPC. From local commandline

    scp upload-materials.zip username@philip.hpc.lsu.edu:~/
    

    or use data transfer software such as FileZilla

  3. From local computer, cd to upload-materials/ directory, edit the ‘setalias’ script, change the username to yours

  4. source the script:

    . setalias
    
  5. run sshcluster alias:

    sshcluster
    
  6. Now you are at remote computer. If you copied the zip file to remote computer before, first unzip the folder. cd to upload-materials/ direcotory, edit the ‘run-setup’ script, change the job running time from 30 minutes to 90 minutes. source the setup script:

    . run-setup
    
  7. Run the ipynbhpc-philip script:

    ./ipynbhpc-philip
    
  8. The ipython notebook server should be running on philip and ssh-tunnels are setup

  9. From your local computer, open webbrowser, type in address : localhost:7999. You should be able to see the remote directory’s tree structure.

  10. After you are done, go back to the terminal where you started the job (should be on philp1), Ctrl+C to stop.

Introducing Pandas

  • Pandas data processing functions:
    1. DataFrame()
    2. pivot_table(), groupby()
    3. plot()
    4. sort()
    5. add()
    6. concat()
    7. merge()
    8. copy()
    9. ix(), iloc()
  • Pandas DataFrame/Series attributes: empty, index, columns, dtype(s), shape, size, data, values...

Step 1: Introduce the data mining code

  • IPython notebook: readformdata.ipynb
  • Code description: provide a year, mine stats data for that year
    1. find all games and save their stats link
    2. for each game, follow their stats link and download progression data
  • Data saved:
    1. allgames_[year].txt
    2. allrounds_[year].txt
    3. allstats_[year].txt

Step 2: Load and visualize all games data

  • IPython notebook : AFL_stats_analysis.ipynb
  • Data used: allrounds_[year].txt
  • Load into Pandas DataFrame using read_csv() function
  • Build team performance data into Pandas DataFrame
  • Plot use Pandas plot(), and Matplotlib ImageGrid()

Step 3: Load, process, and visualize all stats data

  • IPython notebook : AFL_stats_analysis.ipynb
  • Data used: allstats_[year].txt
  • Clean up match stats, save into allstats_ex_finals_[year].csv, allstats_finals_[year].csv
  • Load into Pandas DataFrame, columns = [‘player1’,’score1’, ‘time’, ‘player2’, ‘score2’,’team1’,’team2’,’round’, ‘game’]
  • Plot multiple games as subgraphs in a grid, use matplotlib subplot2grid() function
  • Can choose to plot all games in a round, or all games played by a team in team1, plot all final games, etc.

Player performance study

  • http://auspost.com.au/education/afl/students/afl-basics.html. Each AFL club has a list of players. Each week, 22 are selected for the game. Eighteen players are allowed on the ground at any one time and the other four sit on the bench while waiting for their turn to play.
  • Are there a relationship between top players and top teams? To answer this question, we want to construct a DataFrame with player, playerscore, team, and teamscore. Below are the steps:
    1. Find number of players scored in each team
    2. Plot score distribution for players in each team
    3. List all scored players and sum their scores in all games, construct DataFrame with columns=[‘player’, ‘playerscore’]. sort the players by their score
    4. Figure out each player’s team, add a ‘team’ column to DataFrame
    5. Construct another DataFrame, columns=[ ‘team’, ‘totalteamscore’, ‘rank’]. ‘rank’ is based on sorted total team score
    6. Merge the two DataFrame as outer join. http://chrisalbon.com/python/pandas_join_merge_dataframe.html
    7. The new DataFrame has columns [‘player’, ‘playerscore’, ‘team’, ‘teamscore’, ‘rank’]
    8. Plot playerscore and teamscore as 2D scatter plot, scatter matrix plot, and parallel cooridnates plot