STOR-i Computing Group (STORC)
STORC is a computing group organised by students for other students on the STOR-i course. It is a chance for students to share knowledge on common programs and techniques that can be useful when tackling academic research. The sessions also allow for new software to be introduced and questions to be asked in a friendly and understanding environment.
The idea for the group came from a specific problem faced by a large research group. In many situations different students will encounter the same computing problems and find different ways to solve their own problem. This can often be time consuming when it would have been easier to ask other students that have already faced such problems. By giving each student a basic grounding in common computing techniques the group aims to reduce the likelihood of such situations. An online forum available to STOR-i students also allows for collaboration and help to be found across years and subject areas.
The group aims to meet twice a term with sessions lead by students and staff on a variety of computing issues. Information about sessions that have already happened can be found below. For more information please contact Hugo Winter, Sean Malory or Jack Baker.
Session: Good Coding Practice - Numerical Tests
Session lead: Jamie Fairbrother
An important aspect of modern research in stats and OR is the verification of new theory and methodologies through numerical testing. Such numerical tests will typically involve the following steps: implementation of new algorithms; downloading and processing real-world data, or the simulation of data to be used as test input; a performance comparison of new methods against existing methods; and finally the collation and presentation of results. This sequence of actions can make managing and running code for numerical tests cumbersome. Some common issues include specifying inputs, including correct dependencies, passing output between scripts, and the labeling output tables and plots correctly. In addition, numerical experiments should also be coded in such a way that they are flexible, scalable, and repeatable.
In this session we offer several tips on the organisation of code for numerical tests. Among other topics we will cover the use of command-line arguments in test scripts, file formats used to store test results, and automation of tests.
The session will end with an open discussion of strategies people have employed when writing their own tests.
Session: How to use the new STORM system
Session lead: Matthew Ludkin
This session will be based around the new STORM system.
STORM is STOR-i High Performance Computer Cluster and can be very useful if you need to perform a lot of computation.
Session: Good Coding Practice - Comments
Session lead: Emma Simpson
The STORC good coding practice sessions aim to develop students’ coding skills to be more professional. The third session is on comments.
We will discuss commenting and structuring your code. Comments are essential for anyone referring back to your code – including yourself! But commenting too much makes your code cluttered. One approach is to use good naming to make your code as self-documenting as possible. Comments can then be used to explain why rather than how.
Session: Good Coding Practice - Functions
Session lead: Dan Waller
The STORC good coding practice sessions aim to develop students’ coding skills to be more professional. The second session is on functions.
Functions are incredibly powerful. Used well they can keep your code easy to read, avoid repetitive code (no more copy and paste), and make your programs easier to test. But in order to get these advantages you need to use them well, and this is the subject of this talk and discussion.
Session: Good Coding Practice - Naming
Session lead: Matt Ludkin
The STORC good coding practice sessions aim to develop students’ coding skills to be more professional. The first session is on naming.
What's in a name? This talk will discuss the principles behind naming objects for writing clean and clear code. A number of guidelines will be presented to help you write code that is usable by others. A number of examples are given to highlight the concepts and explain why they are important. The talk will end with a group discussion and a collaborative application of ideas to a sample program.
Session lead: Chris Jewells
Following in 1965, Gordon Moore made the observation that the number of transistors packed into integrated circuits doubles approximately every two years. “Moore’s Law” describes the colossal expansion of computing speed experienced throughout the late 20th century and first decade of the 21st century. In terms of scientific computing, this has enabled us to harness the power of complex algorithms such as gradient-based optimisation, Monte Carlo methods, and data visualisation. In recent years, however, engineering constraints at the atomic level have begun to limit this increase in circuit density, and the speed gains of traditional serial processors have begun to plateau. The remorseless increase in both algorithmic complexity and volume of data demanded by the modern world has therefore effected a computational paradigm shift: no longer can we rely on single processors getting faster. Instead, we must employ teams of processors working in parallel. The purpose of this seminar is to describe a practical approach to parallel computing. I will introduce the requirements of parallel computing from an algorithm point of view, and then describe some common parallel computing approaches such as vectorised code, distributed and shared memory parallelism, and parallel co-processors (i.e. GPUs). My aim is therefore to give a broad overview of parallel computing that will enable you to identify where parallel programming might help you, and what technique might provide the best returns in terms of performance versus development time.
Useful R tools
Session lead: Kaylea Haynes
Following the recent useR! conference which I attended in Aalborg I have pulled together some of the things that you can do in R. I will start by giving an introduction to “slidify”, an R package which allows you to create jazzy html slides. I will then use slidify to create my slides giving you a very brief introduction to: the “checkpoint” package, getting data into R using the “readr” package and webscraping techniques using the “rvest” package. Once you have data in R I will then show how you can test your code at a run time level using “assertive” and at development using “testthat”. Finally I will show you how to create beautiful interactive graphics using the “ggvis” package.
STORM computing cluster tutorial
Session lead: Matt Ludkin
This talk with build on previous talks from Ben and Ivar on the use of the STOR-i computing cluster: STORM.
A live demo will be given alongside the slides to show how it works.
An introduction will be given on how to access and interact with STORM using the terminal in Linux.
Specifically, logging in and copying files. (This is new material but only really needed if you experience problems with methods in previous talks).
Batch jobs will be discussed. Using STORM in batch can give significant real-time speed up since many experiments can be run in parallel.
Each of the elements of a “bsub” file will be explained and a comprehensive example will be run.
Finally, some simple bash functions will be shown that can simplify the processes shown in the demonstration.
Version control using Git
Session lead: Jack Baker
Git is known as a version control system, which allows you to take snapshots of your code at different points in time. Git is powerful as it allows you to backtrack if you make mistakes when developing your code; to backup your code and work across multiple machines; to develop experimental features or changes in a safe environment, away from your original code. It is an essential tool when you are developing a project with other people.
In this talk I will give a demonstration of Git and talk about its main uses and features. I will also talk about the key differences between Github and Bitbucket, two sites which work with Git to store your code on the web, and in which situations you might use each of them.
Creating interactive application using Shiny
Session lead: Hugo Winter
Shiny is a package in R that allows the creation of interactive applications using R commands. These apps can then be run in the R terminal or uploaded to a server to be deployed online. Shiny has been very useful in the dissemination of research to my industrial partner; he likes having the freedom to experiment with different parameter values and see how changes in these values affect model output.
Behind every Shiny app there exists two R files, one that deals with the graphical interface and another that runs the code required for the visualisation. No additional programming languages are required other than R, with few new commands needed to create an app. Shiny applications can be created both on Windows and Linux operating systems, but require RStudio to be installed.
In this session I will introduce Shiny and give examples from my field of research. Students are encouraged to bring along their laptops and play about with the numerous demos that can be found at http://shiny.rstudio.com/.
Introduction to Julia
Session lead: Sean Malory
Julia is a powerful new language specifically designed for technical computing. It has an easy-to-read syntax, which is similar to Matlab; and allows for productive programming without compromising performance. It uses a Just-In-Time compiler, and is designed in a way that allows it to reach speeds comparable to C. This from a language which is as easy to write as R, Matlab or Python. In particular, it allows for fast looping, which makes it an ideal language for non-vectorizable (non-parallelisable) loops. Along with a rapidly expanding body of external packages these factors make it a perfect language for STOR-i as a whole.
In this talk, we will introduce Julia. We will demonstrate its basic syntax and basic commands. We will describe Julia's control flow constructs and highlight the importance of type stability. We will explain key packages such as JuMP (Julia's Optimisation package), Winston (one of Julia's graphics packages), Distributions (Julia's take on probability distributions) and will briefly introduce other, more specialized, packages. Further, we will explain the approach one should take when programming in Julia and we will give simple tips on how to improve performance.
Towards the end of the talk, we will demonstrate snippets of Julia code that relate to certain areas of research currently being undertaken in STOR-i (including Optimisation, Changepoints, Clustering, Bayesian regression and Particle filtering). After the talk there will be time to ask questions on anything Julia related.
Unix Power Tools
Session lead: Nicos Pavlidis
Unix (and therefore Linux) is not an operating system as much as it is a way of thinking. At the heart of the Unix philosophy is the idea that the power of a system comes more from the relationships among programs than from the programs themselves. Knowledge of Unix and its utilities from the command line will enable you to do very powerful things with a few keystrokes. This is in stark contrast to most other systems.
This session will attempt to illustrate this though hands on examples. Clearly it is impossible to provide a comprehensive introduction to all the possibilities offered by Unix. No course can ever aspire to do this. Instead the objective is to illustrate some possibilities and get you thinking of how you could apply such an approach to your everyday work, to boost productivity (i.e. get the computer to do the work for you, rather than the other way around). Unix teaches programming imperceptibly -- it is a slow but steady extension of the work you do simply by interacting with the computer. If you stick with it, before long you will be able to step outside the bounds of the tools that have already been provided by the designers of the system and solve problems that don't quite fit the mould.
Creating and editing packages in R
Session lead: Kaylea Haynes
Being able to create packages in R is a useful skill to have as it makes accessing your regularly used code much easier as well as making it easy to share code with other people. In this talk I will present a way that I have found fairly straightforward to make and edit R packages. The talk will start with a basic R package which I will then go on to talk about including C code and adding R code to the package once it is made. This method is slightly different to what some of the older year groups will have been taught at a previous session on creating R packages and what had been discussed at an earlier STORC session. The method I use is the same for both Windows and Linux but for this session I will use Windows for a change to please the Windows users.
Make: the universal build tool
Session lead: Jamie Fairbrother
Make is tool for managing dependencies in computer projects. When it is used to build a project, it detects which parts need rebuilding, and carries out all the necessary commands to achieve this. It is particular useful for C/C++ projects which may have complicated dependency structures and compile commands.
In this talk we present the basic usage of Make, show how it can be used to build different types of project, such as Latex documents and R packages, and how it can be used to provide project utilities such as automatic package archiving, and directory cleaning.
Introduction to Linux and Emacs
Session lead: Jamie Fairbrother and Chris Nemeth
Computational requirements for researchers differ greatly from the requirements of standard computer users. While software such as Microsoft Word and Excel are invaluable tools for any secretary, the suite of products offered by Windows machines often falls short of the needs of researchers.
Linux based systems are more compatible with the requirements of researchers as these systems are more flexible and offer more resources to researchers to help researchers do what they always do; try things which haven't been done before. This is why 95% of the supercomputers in the world run some variant of Linux, including the STOR-i cluster.
In this talk we'll introduce some tips, tricks and shortcuts that are useful to any Linux user. We'll also give a quick introduction to Emacs, a popular text editor (probably the most popular) that makes it easier to write and edit text documents, including LaTeX, R, C and python to name a few. Text editors can be an invaluable resource for any programmer and were strongly encouraged by Peter Frazier during his visit to STOR-i in January.
Speeding up your R code
Session lead: James Edwards
Particular characteristics of the programming language R mean that code written in a style appropriate for another language can be slow. This talk will look at some of these R specific pitfalls and how to avoid them. Benchmark and profiling of code to find where bottlenecks occur will also be discussed. Examples and a small case study will be used to illustrate these concepts.
High Performance Computing using STORM
Session lead: Ben Pickering
High performance computing (HPC) is an invaluable tool for researchers within the statistics and operational research community. HPC allows for the execution of computational activities which would otherwise require a prohibitively large amount of computation time or memory when performed on conventional machines. Such activities may include the testing of complex models, or the implementation of large-scale simulation studies. STORM is the HPC cluster which is available for use within STOR-i. This session will introduce the basic steps which will enable students to begin using STORM as part of their research, as well as highlight some more advanced features which may be of particular interest. The instructions given will be for a Linux-based operating system; Windows users are recommended to use Linux within a Virtualbox.
Useful Links: http://www.vub.ac.be/BFUCC/LSF/
Version control with Subversion
Session lead: Jamie Fairbrother
Version control is a tool for keeping track of all changes to files in a project. It can be used to monitor the evolution of a project, undo unwanted changes, and in the case of collaborative editing managing conflicting changes. It is particularly useful in programming projects where, for example, you may want to undo changes which caused bugs in your code, or to run code which may rely on a previous version of your project. Version control systems can also be used as a tool for distributing software. This session will introduce the popular open source and cross-platform version control system Subversion. The session will take the form of a tutorial where those attending will set-up their own Subversion repository, and put some code they're working on under version control.
Useful link: http://subversion.apache.org/
Advanced plotting with ggplot2
Session lead: Tom Flowerdew
ggplot2 is a graph-plotting package for R, which takes advantage of the ‘Grammar of Graphics’, a way of thinking about the process of creating informative images developed by Leland Wilkinson. The advantage of ggplot2 is that one the grammar is understood, creating graph becomes an entirely logical process, and very complex outputs can be formed with little effort.
Useful link: http://www.ggplot2.org/
Mendeley: The easy way to read papers
Session lead: Tim Park
A large part of academic research is reading and later referencing journal articles and other resources, over the course of a PhD this can amount to hundreds of documents. It is therefore good practice to have a system for saving items you have read making them easy to find later. Mendeley is a free programme for Windows or Linux which can automatically monitor .pdf files you have saved and organise them in terms in terms of authors, journal, year or your own keywords. This makes it very easy to browse the papers you have saved and read them. Mendeley also has an online function so you can access your papers remotely or share them with others. Finally Mendeley easily integrates with BibTex and allows you to create a .bib file containing references for your papers.
Useful link: http://www.mendeley.com/
Drawing figures with TikZ
Session lead: Thomas Lugrin
In papers and posters figures are what people look at first. Clear and informative figures are thus very important and it is important that we spend some time drawing them. TikZ is a LaTeX package which allows computation of high-quality graphics, based on intuitive and powerful coding. A wide range of tools, such as shadings, fadings and decorations, can be used to generate better visualizations that can help to better communicate the message of your work.
Introductory session and R package development
Session lead: Hugo Winter
In this first STORC session we discuss how to develop a package in R. Many of the functions that we take for granted when coding in R will have come from a package that has been uploaded onto CRAN. Creating a package can also be useful for research students to keep a documented record of all code that they have generated. Well documented code also allows for easier collaboration between academics and industry partners. This session expands on a talk by Markus Gesmann at the Advanced R course earlier this year. Each student will be able to discuss their issues and offer their tips on how to avoid any pitfalls.
Useful link: http://lamages.blogspot.co.uk/2013/03/create-r-package-from-single-r-file.html