Click to learn more about author Steve Miller.
A few weeks ago, I came across a LinkedIn blog entitled “R Should Be Your Second Language (If It’s Not Your First)” by Paul Allison. Paul’s an old grad school friend/colleague from many years ago who went on to a distinguished academic career as professor at the University of Pennsylvania, and now manages and teaches with his statistical training company, Statistical Horizons. I had the opportunity to take an SA class on longitudinal analyses five years ago and was very pleased with the experience.
I’d characterize Paul’s expertise as revolving on applied research statistical methods/models pertaining to social behavior. Examples include longitudinal analysis, survival analysis, structural equation models, and causal analysis. He’s a long-standing SAS and Stata expert who’s self-trained in R in recent years.
My own work, in contrast, is more diffuse, not nearly as deeply statistical as Paul’s, instead focused more uniformly on the ABCDEs of Data Science – (in usage order) Business, Data, Computation, Exploration, and Algorithms. What I do share with Paul is expertise in SAS and R.
As a data scientist and R advocate, I have some strong opinions on R past, present, and future. It turns out that Paul, even as a self-described R neophyte, has independently adopted similar views. What follows are reactions I communicated with Paul.
Hi Paul –
Great read. Sure doesn’t seem you’re inexperienced with R to me!
I shared your frustration transitioning from SAS to S/R in the early 2000s, struggling at first to make the leap from SAS proc/data step/macro coding to the functional array/vector/dataframe approach of R. I well remember an early scripting blunder of attempting to loop through individual records of a large dataframe. I think the job’s still running!
It took a good six months to get comfortable with the new paradigm, but R then started to click for me. The “data management tasks (that) seemed much harder in R than in SAS” instead flipped to being easier in R. In addition, the powerful, consistent, and performant programming tidyverse and data.table ecosystems that have emerged over the last 10 years have greatly simplified data programming in R. Now, almost 20 years later, there’s no way I could return to full time work with SAS.
The prevalence of R (and Python) in the marketplace you’re seeing at Statistical Horizons I’ve experienced in my consulting work too. R is lingua franca of graduate statistics programs and is also growing in other areas of academia. Economists often choose between R and Stata, while computational social scientists, biologists, and finance computationists more and more prefer R/Python to commercial competitors. In my college recruiting capacity, I’ve had no difficulty finding undergraduate quants majors with backgrounds in R/Python.
You’re right there’s a curse of riches in R, with now over 12,000 available packages. How’s a developer to make sense of this largesse? It’s not easy, but one response from the R community is the development of topical task views maintained by experts. “CRAN task views aim to provide some guidance which packages on CRAN are relevant for tasks related to a certain topic. They give a brief overview of the included packages and can be automatically installed using the ctv package.” I always use task views to get started with an unfamiliar statistical topic in R.
You’re also likely to see new functionality in open source platforms like R and Python before it’s available in commercial products like SAS/Stata. Indeed, it’s almost a requirement that the latest statistical learning algorithms from academia be validated in R packages. Of course, the commercial vendors still question the quality of open source code. OS’rs counter that more eyes make for less buggy programs.
Thank goodness the days of paper user guides and language references are gone. With ubiquitous platforms like R and Python, simply googling a well-specified question will almost certainly lead to an answer, generally on stackoverflow. And for mature packages/ecosystems like tidyverse, data.table, and Pandas, comprehensive online language references are readily available as supplements.
You’re also correct about the annoying conflicts of functions with the same name in multiple packages. In some instances, I’ve taken to using the “double colon operator” to explicitly identify which package version I desire, e.g.:
dplyr::summarize(Mean.Sepal.Length=mean(Sepal.Length)). An R programmer could use the double colon prefix for all functions and not have to issue “library(….)” calls at the beginning of her script.
One topic that’s not emphasized in your blog, perhaps because you might have been insulated from it in academia, is the cost of the software. Core R is free open source, while the RStudio development platform is “freemium” — wherein much functionality is provided at no cost, but advanced features are commercially available only. (Microsoft offers a freemium version of R that provides extensions for commercial users.) Contrast that with proprietary SAS and Stata, which license their software for $$ out of the gate.
And at least in the SAS instance, that tariff is steep. Of course, there’s more to the cost of working with statistical software than simply software licensing, but for my consulting customers, most of whom are startup “data” companies, the SAS (or Stata) entry fee is a non-starter. An open source stack that includes R and/or Python for statistics is therefore sine qua non.
A few years back, I participated in a LinkedIn discussion group go-round on the merits of R vs SAS. The SAS proselytes challenged my cost argument, noting that R required expensive additional computer memory to function on large data sets. Fair enough, but I countered by contrasting the then first year SAS licensing fee of almost $9000 with the $2500 I paid for a 64GB RAM, 2 TB solid state drive Wintel notebook I’ve now been using for 4 years — and that effortlessly handles 5GB+ R data.tables. Indeed, last year I loaded 25 GB of census data into two R data.tables in a single script on the notebook without a hitch. For data larger than this, I rely on cloud computing.
As a data scientist, I enjoy working with R and Python. Both are powerful, both have rapidly-growing ecosystems, and both are fun to program. What RStudio is for R development, Jupyter Notebook is for Python. Yet I also develop with R in Jupyter and with Python in RStudio. The comprehensive Pandas library in Python solves similar Data Management challenges as tidyverse/data.table does for R. Data programming in R is easier than Python developers think, while the statistical capabilities available in the Python surpass what the R community acknowledges.
We’re also seeing more and more interoperability between the two languages – in effect the beginnings of a polyglot combination of R and Python (and perhaps Julia). I interweave the two all the time in my work now. Relatedly, we’re experiencing the emergence of highly-performant, cross platform APIs for machine learning in R, Python, and Java. H2o, XGBoost, and Keras are examples. This will inevitably lead to the demise of one-off, non-scalable “dissertation libraries”.
In summary, I agree with your assessment that R (and Python) have bright futures. Having worked closely with the R community for almost 20 years, I second that “R has reached such a critical mass that it will be very hard to stop.”
Again, thanks for the great read. You’ve motivated me to write a data science version of your blog, perhaps to be entitled “R and Python Should Be Your First and Second Languages Data Science.”
For those wishing to get jump-started with R, I’m sure the options from Statistical Horizons are a good bet.