Both Python and R are popular programming languages for Data Science. While R’s functionality is developed with statisticians in mind (think of R's strong data visualization capabilities!), Python is often praised for its easy-to-understand syntax.
Ross Ihaka and Robert Gentleman created the open-source language R in 1995 as an implementation of the S programming language. The purpose was to develop a language that focused on delivering a better and more user-friendly way to do data analysis, statistics and graphical models.
Python was created by Guido Van Rossem in 1991 and emphasizes productivity and code readability. Programmers that want to delve into data analysis or apply statistical techniques are some of the main users of Python for statistical purposes.
As a data scientist it’s your job to pick the language that best fits the needs. Some questions that can help you:
What problems do you want to solve?
What are the net costs for learning a language?
What are the commonly used tools in your field?
What are the other available tools and how do these relate to the commonly used tools?
When and how to use R?
R is mainly used when the data analysis task requires standalone computing or analysis on individual servers. It’s great for exploratory work, and it's handy for almost any type of data analysis because of the huge number of packages and readily usable tests that often provide you with the necessary tools to get up and running quickly. R can even be part of a big data solution.
When getting started with R, a good first step is to install the amazing RStudio IDE. Once this is done, we recommend you to have a look at the following popular packages:
dplyr, plyr and data.table to easily manipulate packages,
stringr to manipulate strings,
zoo to work with regular and irregular time series,
ggvis, lattice, and ggplot2 to visualize data, and
caret for machine learning
When and how to use Python?
You can use Python when your data analysis tasks need to be integrated with web apps or if statistics code needs to be incorporated into a production database. Being a fully fledged programming language, it’s a great tool to implement algorithms for production use.
While the infancy of Python packages for data analysis was an issue in the past, this has improved significantly over the years. Make sure to install NumPy /SciPy (scientific computing) and pandas (data manipulation) to make Python usable for data analysis. Also have a look at matplotlib to make graphics, and scikit-learn for machine learning.
Unlike R, Python has no clear “winning” IDE. We recommend you to have a look at Spyder, IPython Notebook and Rodeo to see which one best fits your needs.
* We recommend all our students to learn both the programming languages and use them where appropriate since many Data Science teams today are bilingual, leveraging both R and Python in their work.