This blog post summarizes the technical details of how the data and graphics in Analyzing time of Possession in 7s were generated. Writing that blog post was an educational exercise to get familiar with the Python statistical programming ecosystem. Up until this point, most of the analysis work at Starting 7s was conducted in the R programming language. Inspired by the words of educational technologist Seymour Papert*, who famously said “You can’t think seriously about thinking without thinking about thinking about something,” this analysis and blog post were conducted in a similar spirit. You can’t seriously learn to use a new tool without learning to use the new tool to do something.
The remainder of this blog post describes the tools and techniques used to conduct the possession analysis in Python.
* Perhaps not coincidentally, it is worth noting that Seymour Papert was South African, and occasionally used rugby examples to animate his thought experiments.
In previous posts the data and graphics were generated using the R statistical programming language. For this post the work was done exclusively with Python. The environment was managed with Miniconda and the package versions are listed below:
(rugby7s)$ python --version Python 2.7.12 :: Continuum Analytics, Inc. (rugby7s)$ conda list -n rugby7s # packages in environment at /Users/user/miniconda2/envs/rugby7s: libpng 1.6.22 0 matplotlib 1.5.3 np111py27_1 mkl 11.3.3 0 numpy 1.11.2 py27_0 pandas 0.19.1 np111py27_0 python 2.7.12 1 scipy 0.18.1 np111py27_0
The source data was gathered from match reports of all matches from the 2014-2015 season. The match reports were provided in unstructured PDF format. An example match report can be found here. A Python script was used to extract the time of possession data from the match reports. The code can be found here.
The script was divided into two functions. The first function converted the PDF to text and removed the page header and footer for easier processing.The match reports were converted from PDF to text using the PDFminer Python module. The second function reformatted the possession data into a single string of comma separated values and output it to a file. The PDF match reports were not consistently formatted so some conditional logic was needed to produce consistent text output.
A BASH driver script incremented over all of the files, converting them from PDF to text, storing the output in files with a common naming convention, and combining the results into a single CSV document for analysis. The final CSV file had the following format:
Match,Event,T1,T1p,T1t,T2,T2p,T2t 1,Dubai,SOUTH AFRICA,36,3:29,PORTUGAL,0,1:53 2,Dubai,CANADA,19,3:17,WALES,14,3:51 3,Dubai,FIJI,54,3:38,FRANCE,7,2:21 4,Dubai,ARGENTINA,17,3:58,BRAZIL,5,2:26 5,Dubai,SCOTLAND,21,3:58,SAMOA,14,3:47 6,Dubai,NEW ZEALAND,36,2:38,JAPAN,0,3:36 7,Dubai,ENGLAND,19,4:04,USA,10,2:10 8,Dubai,AUSTRALIA,29,4:07,KENYA,12,3:11 9,Dubai,SOUTH AFRICA,24,1:58,CANADA,12,4:40
Each line of the CSV file included the match number, event, and name, points, and time of possession for each team. For all matches the team with the higher score were designated as team one with the exception of ties, where the designation was arbitrary.
Data Transformation and Analysis
A helper function was added to convert the time of possession min:sec format to seconds:
def to_seconds(mmss): seconds= 0 for sect in mmss.split(':'): seconds = seconds * 60 + int(sect) return seconds
The team one and team two points and possession columns were converted into single vectors for easier processing:
# convert min:sec columns to sec fp[['T1t','T2t']] = fp[['T1t','T2t']].applymap(to_seconds) # combine both teams to single data series p = fp.T1p.append(fp.T2p) t = fp.T1t.append(fp.T2t)
The analysis was done using Pandas, NumPY, and SciPy. Pandas is a specialized Python library for data analysis. Most of the generic statistical functions were imported from NumPy, with the exception of the Pearson correlation coefficient function imported from the SciPy stats module.
Rugby points are scored in increments of 3, 5, or 7. This caused many overlapping data points to appear in the dot-dash graphs. A helper function found on stack overflow improved visibility by adding slight jitter to the data points.
def rand_jitter(arr): np.random.seed(1234) stdev = .01*(max(arr)-min(arr)) return arr + np.random.randn(len(arr)) * stdev def jitter(x, y, s=20, c='b', marker='o', cmap=None, norm=None, vmin=None,\ vmax=None, alpha=None, linewidths=None, verts=None, hold=None, **kwargs): return plt.scatter(rand_jitter(x), rand_jitter(y), s=s, c=c, marker=marker, cmap=cmap, norm=norm, vmin=vmin, vmax=vmax, alpha=alpha, linewidths=linewidths, verts=verts, hold=hold, **kwargs)
The time of possession blog post was done primarily as an exercise to get familiar with Python statistical analysis tools. Most of the previous work at starting 7s was done in the R statistical programming environment using the RStudio IDE and published on Rpubs. The R modules for analysis, data transformation, and plotting are more tightly integrated so working in the R environment is more efficient.
However, the Python environment is extensible and flexible. The Python community is large and very active which means there is plenty of support available online for finding solutions to problems such as extracting unstructured text from PDFs, adding jitter to data points, and finding template functions for Tufte-inspired graphics.
I hope this write up is helpful to anyone conducting similar work. Please ping me at firstname.lastname@example.org with any suggestions, corrections, or comments.