Time of Possession: technical aspects

This blog post summarizes the technical details of how the data and graphics in Analyzing time of Possession in 7s were generated. Writing that blog post was an educational exercise to get familiar with the Python statistical programming ecosystem. Up until this point, most of the analysis work at Starting 7s was conducted in the R programming language. Inspired by the words of educational technologist Seymour Papert*, who famously said “You can’t think seriously about thinking without thinking about thinking about something,” this analysis and blog post were conducted in a similar spirit. You can’t seriously learn to use a new tool without learning to use the new tool to do something.

The remainder of this blog post describes the tools and techniques used to conduct the possession analysis in Python.

* Perhaps not coincidentally, it is worth noting that Seymour Papert was South African, and occasionally used rugby examples to animate his thought experiments.

Environment

In previous posts the data and graphics were generated using the R statistical programming language. For this post the work was done exclusively with Python. The environment was managed with Miniconda and the package versions are listed below:

(rugby7s)$ python --version
Python 2.7.12 :: Continuum Analytics, Inc.

(rugby7s)$ conda list -n rugby7s
# packages in environment at /Users/user/miniconda2/envs/rugby7s:
libpng                    1.6.22                        0 
matplotlib                1.5.3               np111py27_1 
mkl                       11.3.3                        0 
numpy                     1.11.2                   py27_0  
pandas                    0.19.1              np111py27_0 
python                    2.7.12                        1 
scipy                     0.18.1              np111py27_0

Data Acquisition

The source data was gathered from match reports of all matches from the 2014-2015 season. The match reports were provided in unstructured PDF format. An example match report can be found here. A Python script was used to extract the time of possession data from the match reports. The code can be found here.

The script was divided into two functions. The first function converted the PDF to text and removed the page header and footer for easier processing.The match reports were converted from PDF to text using the PDFminer Python module. The second function reformatted the possession data into a single string of comma separated values and output it to a file. The PDF match reports were not consistently formatted so some conditional logic was needed to produce consistent text output.

A BASH driver script incremented over all of the files, converting them from PDF to text, storing the output in files with a common naming convention, and combining the results into a single CSV document for analysis. The final CSV file had the following format:

Match,Event,T1,T1p,T1t,T2,T2p,T2t
1,Dubai,SOUTH AFRICA,36,3:29,PORTUGAL,0,1:53
2,Dubai,CANADA,19,3:17,WALES,14,3:51
3,Dubai,FIJI,54,3:38,FRANCE,7,2:21
4,Dubai,ARGENTINA,17,3:58,BRAZIL,5,2:26
5,Dubai,SCOTLAND,21,3:58,SAMOA,14,3:47
6,Dubai,NEW ZEALAND,36,2:38,JAPAN,0,3:36
7,Dubai,ENGLAND,19,4:04,USA,10,2:10
8,Dubai,AUSTRALIA,29,4:07,KENYA,12,3:11
9,Dubai,SOUTH AFRICA,24,1:58,CANADA,12,4:40

Each line of the CSV file included the match number, event, and name, points, and time of possession for each team. For all matches the team with the higher score were designated as team one with the exception of ties, where the designation was arbitrary.

Data Transformation and Analysis

A helper function was added to convert the time of possession min:sec format to seconds:

def to_seconds(mmss):
    seconds= 0
    for sect in mmss.split(':'):
        seconds = seconds * 60 + int(sect)
    return seconds

The team one and team two points and possession columns were converted into single vectors for easier processing:

# convert min:sec columns to sec
fp[['T1t','T2t']] = fp[['T1t','T2t']].applymap(to_seconds)

# combine both teams to single data series
p = fp.T1p.append(fp.T2p)
t = fp.T1t.append(fp.T2t)

The analysis was done using Pandas, NumPY, and SciPy. Pandas is a specialized Python library for data analysis. Most of the generic statistical functions were imported from NumPy, with the exception of the Pearson correlation coefficient function imported from the SciPy stats module.

Plotting

The graphics were created with the pyplot library from Matplotlib using ggplot style.

The Tufte-inspired dot-dash graphs (also called a dashplot) were plotted using the etframes package located on GitHub. In a dot-dash graph, dashes on the axes mark the frequency of the data points.

Rugby points are scored in increments of 3, 5, or 7. This caused many overlapping data points to appear in the dot-dash graphs.  A helper function found on stack overflow improved visibility by adding slight jitter to the data points.

def rand_jitter(arr):
    np.random.seed(1234)
    stdev = .01*(max(arr)-min(arr))
    return arr + np.random.randn(len(arr)) * stdev

def jitter(x, y, s=20, c='b', marker='o', cmap=None, norm=None, vmin=None,\
    vmax=None, alpha=None, linewidths=None, verts=None, hold=None, **kwargs):
    return plt.scatter(rand_jitter(x), rand_jitter(y), 
        s=s, c=c, marker=marker, cmap=cmap, norm=norm, 
        vmin=vmin, vmax=vmax, alpha=alpha, linewidths=linewidths, 
        verts=verts, hold=hold, **kwargs)

Summary

The time of possession blog post was done primarily as an exercise to get familiar with Python statistical analysis tools. Most of the previous work at starting 7s was done in the R statistical programming environment using the RStudio IDE and published on Rpubs. The R modules for analysis, data transformation, and plotting are more tightly integrated so working in the R environment is more efficient.

However, the Python environment is extensible and flexible. The Python community is large and very active which means there is plenty of support available online for finding solutions to problems such as extracting unstructured text from PDFs, adding jitter to data points, and finding template functions for Tufte-inspired graphics.

I hope this write up is helpful to anyone conducting similar work. Please ping me at jliberman@utexas.edu with any suggestions, corrections, or comments.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s