Python and R

Python vs R

One common question that arises when someone first gets into the Data Science world is if they should first learn Python or R. There are numerous articles published about this and the answers is simply, "it depends". Both Python and R are two of the most popular languages for data science work. Both were developed in the early 1990s. Both are free and open source. Both have active growing communities and are highly sought by potential employers. The national average starting salary for poeple with this skill is $68,999 according ot ZipRecruiter in January 2019.

Overview of R

R was developed specifically for statistics and data science work. R is good for statistics-heavy projects, ad hoc analysis, and one-time dives into a dataset.

R has been in use for statistics and data science longer than Python. Currently there are more variations of visualization modules in R than Python.

R is mainly used when the data analysis task requires standalone computing or analysis by individual scientist.

One significant limitation of R is that it is difficult to integrate R with workflow, databases, and websites.

Another significant limitation of R is that it is difficult to program in the language. People who know how to program can find programming in R tedious and needlessly complex. I once began writing a program in R and realized it would take me about 6 hours to complete the program. The R modules were needlessly complex and hard to configure. I decided to write the program in another language and was able to complete it in about an hour and half. Why? Because R was not written to be able to make it general and easily extensible. In the other language, I was able to call functions to do much of what I wanted to do. In R I would have had to call R modules that were not well designed to do what I wanted to do. It simply wasn't worth it.

Overview of Python

With the development of several Python libraries such as Numpy (scientific computing), pandas (data manipulation), matplotlib (data visualization), and scikit-learn (machine learing) in the late 2000s, there has been rapid growth in the number of Python users in the data science community, especially in the industries.

An important advantage of Python over R is that Python is a programming language. Thus, with Python, you can automate work, run it on webservers, and incorporate data science tasks into a production environment.

For aspiring data scientists, Python is generally considered easier to pick up. One of the advantages of Python is its readability and that it is a more general programming language. Since R was built for statistians in mind, it represents the way statisticians think pretty well but many programmers find the design of R irritating because R is different to what they are used to.

Also, since this class mainly focus on machine learing which has a greater emphasis on large-scale applications and prediciton accuracy, Python is a better choice than R for its flexibility for production use, especially when the data science tasks need to be integrated with web applications.

If you are interested in some of the statistics, facts, and comparisons between R and Python, take a look at the great infographic available at https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis