Why I Use R

R
Data Science
Author

Gordon Shotwell

Published

December 30, 2019

They said the war was over…

Over the last couple of years prominent members of both the R and Python communities have tried to move past the language wars and support both R and Python workflows. This makes sense intellectually; after all, R and Python are not all that different in the scheme of things, and so we should let people use whichever language they find more productive. This conversation manifests very differently in the workplace, however.
Most of the time when a Python data scientist hears that the language wars are over, they think “Well, great — if R and Python are equally effective, then we can all just standardize on Python.”

This comes up for me personally when coworkers tell me some version of “Hey, your work is great, you’re an excellent developer, but have you thought of switching over to Python/Scala/Javascript so that you can really make a contribution?” Early in my career I took these suggestions seriously. I came to R from an Excel background, and for a long time I had internalized the feeling that serious engineers used Python, while analysts or researchers could use languages like R. Over time I’ve realized that the people making that statement often aren’t really informed. They rarely know anything about R, and often don’t really write production-quality code themselves. In contrast, most of the very senior engineers I’ve met understand that all programming languages are basically just bundles of trade-offs, and so no single language is going to be globally superior to another. There really are no production languages – only production engineers.

The thing is, I don’t use R out of some blind brand loyalty but because I don’t like working hard. Every time I’m faced with a problem, I try to figure out how I can solve that problem in a stable way with the least amount of effort; for most of the problems I face, R is the right tool. This is partially an accident of training – I know R very well at this point, so it’s usually the most efficient way for me to solve a problem – but it’s also because of core language features that don’t really exist in Python.

Overall I think there are four main features of the core R language that are essential to my work. These are things that are present in R, that I haven’t found to be available or accessible in any other single language, and that make R the best choice for my work:

  1. Native data science structures
  2. Non-standard evaluation
  3. Packaging consensus (The glory of CRAN)
  4. Functional programming

1. Native data science structures

It’s relatively easy to do data science in R without any external libraries. You can read data from a csv into a data frame, plot and clean that data, and analyse it using built-in statistical models. This is possible because R was built to do statistics, and so it includes features like vectors and data frames, and lets you invert a matrix or fit a linear model. Over time Python has added all of these capabilities with numpy, pandas, and scikitlearn, but you usually require dependencies to do data science work.

I’m generally in favour of using external libraries, but it’s nice to have the option of avoiding them in certain circumstances. Data structures in the base language tend to be more stable than those provided by external dependencies. For instance, I can run base R code from 2010 and be reasonably sure that the code will behave the same way today as it did a decade ago, because maintainers of the core language are very conservative in introducing breaking changes.

This is probably not true of R or Python code that relies on libraries like dplyr or pandas because both of those libraries prioritize feature improvement over stability. This isn’t meant as a criticism of either of those libraries, but is meant just to point out that it is beneficial to use a language that allows you to do meaningful data science work without importing external libraries.

One of the weird things that you probably won’t hear from the “R isn’t a real language” Pythonistas is that there’s this whole group of production Python engineers who don’t think that the Python scientific computing stack is appropriate for production. These people will make the same basic argument about pandas that pandas people make about R: it’s appropriate for research, but production code requires vanilla Python. Switching from R to Python often doesn’t significantly reduce deployment friction, because you still have to do some kind of microservice process in order to isolate the data science dependencies from the rest of the Python code base.

2. Non-Standard Evaluation

R includes a strange and wonderful type of metaprogramming called Non-standard evaluation, which allows you to access and manipulate the calling environment of a function. This lets you do things like use a variable name in a plot title, or evaluate a user-supplied expression in a different environment.

NSE reminds me of a niche and dangerous power tool:
it shouldn’t be the first thing you reach for, and it’s very dangerous if you don’t know what you’re doing, but it allows you to solve problems that would be otherwise unsolvable.

{{% youtube "ox93snKVuaM" %}}

There are three main ways that I use NSE:

Separate user representations from programmatic representations

There’s often a tension between the most natural way for a user to represent a problem and the best way to organize that problem internally. Internally, it’s good if the inputs to a system are unambiguously specified so that it’s crystal clear what the system should do, and how the system should be organized. In contrast, the user of a function often doesn’t need or want to know about the implementation details and instead wants to provide the inputs in the way that requires them to learn the fewest number of new things. Without NSE, it’s very hard to solve this problem because what goes into a function is what the function has to use, but NSE lets you capture and modify the expressions the user sends into the function so you can translate them into another form. For example, R lets you specify models with a formula interface like this: lm(mtcars, mpg ~ cyl). This is a natural way for statisticians to specify statistical models because they’re usually familliar with the syntax, but without NSE there’s no way to make that function work as written because mpg and cyl are not objects in the calling environment. NSE allows lm to capture the mpg ~ cyl and evaluate them within the data environment. To accomplish the same thing in a standard evaluation model you’d need to do some sort of string manipulation, which is what you find in the Python version.

Note that you give up a lot by using NSE, and in almost every context it’s the wrong tool. You might look at the R and Python formula interface and think that giving up referential transparency isn’t worth avoiding a few quotation marks, but there are lots of cases where it saves the day.

A more complex example is dplyr, which puts a consistent user facing api in front of dataframe and database backends.

The programmatic representation varies greatly across the different backends. Using NSE, the package captures the expressions supplied by the user and translates them into programmatic representations that are understood by the various backends.

Make code more concise

Recently I had a whole bunch of functions that all included a structure like this:

f1 <- function(df) {
  if (nrow(df) == 0) {
    return(
      data.frame(join_col = NA)
    )
  }
  ### Lots of time consuming code ....
}

What I wanted to do was create a new function, returnIfEmpty, which caused its calling function to return the default dataframe if it were passed a dataframe without any rows. I can do this with NSE like this:

returnIfEmpty <- function(df) {
  if (nrow(df) == 0) {
    default <- data.frame(join_col = NA)
    assign("return_data", default, envir = parent.frame()) # Modify calling environment
    call <- rlang::expr(return(return_data)) # Capture expression
    rlang::eval_bare(call, env = parent.frame()) # Evaluate expression in calling environment
  }
}

f1 <- function(df) {
  returnIfEmpty(df)
  ### Lots of time consuming code ....
}

f1(data.frame())
  join_col
1       NA

In this example, returnIfEmpty is first creating an object in its parent environment, then building a call to return that object, and finally evaluating that call in the parent environment. This was a good way for me to avoid a lot of code repetition, which I don’t think I would’ve been able to do otherwise.

Learn the user’s language

One of the great things about accessing the user’s environment is that there’s a wealth of information in that environment that is particularly meaningful to that user. In particular, if you can use the names that a user assigned to a variable in function output or error messages it makes your function much easier for them to understand.

Consider this function:

regularError <- function(df) {
  if (!inherits(df, "data.frame")) {
    warning("df must be a dataframe")
  }
}

my_var <- 1:10
regularError(my_var)
Warning in regularError(my_var): df must be a dataframe

The user doesn’t know anything about the internal arguments of your function, so they probably don’t know what the df in the warning message is referring to. In order to understand the warning, they need to read the documentation, or worse, the source code. This leads to confusion, because you’re forcing the user to learn your language rather than telling them what’s wrong in their own terms. NSE allows us to make the error much friendlier:

fancyError <- function(df) {
  class <- class(df)
  var_name <- as.character(substitute(df))
  if (!inherits(df, "data.frame")) {
    warning(glue::glue("'{var_name}' is of class '{class}' when it needs to be a dataframe"))
  }
}
fancyError(my_var)
Warning in fancyError(my_var): 'my_var' is of class 'integer' when it needs to
be a dataframe

This communicates what’s going wrong in terms that are much more meaningful to the user, because they assigned the name “my_var” to the object in the first place.

You might think that it’s too much effort to write a friendly error message, but in my work I find that details like this help build delightful products that are easy to use. It’s worth taking the time to communicate problems in the user’s language, and NSE is the best way I know to learn that language.

3. The glory of CRAN

I started programming on a Saturday morning during law school. In hindsight this was a very important morning for me because in many ways it shaped the course of my career for the next ten years. I probably had about 20 minutes to become interested and excited about the project, since I had lots of homework to do and programming wasn’t something that was on any of my to-do lists at the time. The resource I started with was “R Twotorials,” which taught you how to use R in two minute lessons.

blogdown::shortcode("youtube", "5DZkQjPyzjs")
{{% youtube "5DZkQjPyzjs" %}}

R let me get up and running, installing packages, filtering data, and printing plots in under 20 minutes, which meant that I stayed interested in the language and eventually started using it professionally. I had actually started to learn Python at around the same time but just found it too difficult. I didn’t know how to open a terminal window, I didn’t want to spend any time on configuration, and I didn’t have any time to devote to setup. Python required me to spend more than 20 minutes on setup, and R didn’t, so I picked R.

The reason why this all worked was because of CRAN. CRAN has (maybe forcibly) created a strong consensus on how to package and distribute R code, which means that nine times out of ten an R package will install and run with no user configuration. Today I am fairly comfortable at the command line and futzing around getting computer programs to work on my machine, but I’m still completely unwilling to use an R package that requires me to do much more than install.packages("package_name").

This feature is important to me because I want my code to be useable and installable by people who, like my law school self, do not think of themselves as programmers and do not have any tolerance for command line bullshittery. My goal for all the products I develop is that they are installable and runnable with a single user action, and that action cannot take place in the terminal. This means that when I’m importing dependencies, I need to be confident that all of those dependencies themselves can be installed and set up without user intervention. CRAN does this by moving a lot of the setup and configuration pain from the end user to the package maintainer, and while this probably slows down development and release of packages, it vastly improves the user experience of the average user.

Python is, well, not like that. I’m not a great Python developer, but I am a professional computer programmer and I still feel like it’s even odds that installing some Python library is going to cost me an afternoon of torture and four broken keyboards. If there are any Python evangelists still reading this, they might have a response that begins with, “Well you just,” but remember the user that I care the most about only has 20 minutes of attention and no real programming skill, so the only thing they can “just” do is copy and paste one line of code into a console. If that doesn’t work, I’ve lost them, and they’ll spend another lonely year renewing their SPSS licenses.

4. Functional programming

R is a functional programming language, which means that the natural way to accomplish something in the language is to use functions. I really like this pattern of programming because breaking complicated jobs down into small functional bricks gives me confidence that the overall solution is correct. I can work on the small functions, verify that they’re correct through tests, and then know that combining those building blocks together won’t change their behaviour.

Functional programming is not the best paradigm for all problems. For example, React is a functional paradigm for building user interfaces where you stack together functional components to build complicated web apps. The initial issue with React was that since the components were designed to be independent from one another, it made it very difficult to track application state. If a user logged in in some part of the application, you would have to pass that information back up to the top level of the app, and then back down to all the components that needed to know about it. It was easy to miss a connection, which would result in the user being logged into one part of the app, but logged out of another.

This creates a lot of difficult bugs where the application state is inconsistent across the product. React solved this problem by adding in global state stores like Rudux, which let people still use pure functional components for most things, but break that pattern when you need to set or access application state.

Functional programming is a great tool for data science problems because they are mostly stateless. When I’m building a statistical model, what I’m really doing is creating a mapping between some set of inputs and an output; in other words, a function. It’s usually more important that the mapping is clearly defined than that it takes into account user state. It is possible to do functional programming in Python, but it’s a bit like ordering soup at a pizza parlour.

While a lot of the FP tools are there, the majority of the community doesn’t use functional patterns as their main development paradigm, and you’d probably get a few “That’s not Pythonic” comments on your pull request.

Conclusion

None of this is to suggest that anyone else should use R. R, like all other languages, is a bundle of trade-offs, and those are bad trade-offs in many contexts. In my context, however, the flexibility of R is extremely useful, and I can’t give up those language features without my work suffering.