1.4 Why Python?
The two most dominant programming languages used by data scientists are the open-source languages of R and Python (both interpreted languages). Conceived in 1992 and initially released in 1995, R was the favorite for many years among statisticians and is still quite prominent and growing.1 It is particularly well known for its fantastic packages and libraries for data visualization and advanced statistics. However, it fell behind Python as the most popular programming language for data science sometime between 2015 and 2016.2
Interpreted programming languages execute instructions directly and freely. In other words, they do not need to be previously compiled into machine-language instructions, which is the case with compiled programming languages. However, a programming language is not inherently interpretable or compilable. Many programming languages can be executed in either form but are typically executed in one form or the other.
While there are advantages and disadvantages to both languages, the differences between them are beginning to diminish with capabilities like “just-in-time” compiling. However, absent such technologies, the advantages for one are typically the disadvantages of the other and vice versa. Generally speaking, interpreted languages are platform-independent and allow dynamic typing, dynamic scoping, and reflection. However, they can be less reliable (because they lack static type-checking), susceptible to code injection attacks, and slower to execute.
Today, Python is a very highly rated and widely used program. It has received the following rankings:
-
#3 for search engine queries (TIOBE)
-
#1 for programming tutorial searches (PYPL)
-
#3 for StackOverflow tags and GitHub projects (Redmonk)
-
#1 for CodeEval challenge submissions
-
#4 in HackerRank’s developer skills report
-
#4 overall and #1 fastest growing language based on StackOverflow’s annual developers survey
-
#3 based on a broad range of Google searches, StackOverflow, Github, Reddit, HackerNews, and job postings on Indeed, CareerBuilder, Dice and others (IEEE Spectrum)
These rankings are out of all programming languages used for any purpose. Of the languages used for data science (Python, SQL, R, SAS, etc.), Python is ranked number one.3 However, it should be noted that the use of R is not actually shrinking as the rankings might imply. Rather, this shift toward Python is happening because the market of data scientists is growing rapidly, and most new entrants are learning Python.
What has led to this Python popularity? Like R, Python has packages used for advanced statistics that have grown. Python also includes many useful packages for data cleaning and preparation. But perhaps most importantly, the Python language is also commonly used for web application development and the deployment of machine learning models into data-driven software products. In other words, Python allows you to take a data analytics project all the way through a machine learning cycle in one language (although many others will still be needed along the way). Additionally, Python (like R) is very easy to learn and has relatively simple syntax as far as programming languages go.
Let’s continue by selecting the tool we want to use to write Python.