Machine Learning with Python

Python is an increasingly popular data analytics platform that is commonly used for research and in industry. Useful machine learning libraries exist for Python. Python can be used for stand-alone analytics, and Python programs can also be embedded in web pages.

  1. We will use Python 3.x (not Python 2.x) in this class.

  2. Jupyter Notebook. We will use this to conduct analysis and document your code and progress as you work. It allows you to work in the browser to give instructions and see results.

  3. Google Colaboratory Notebook.

  4. Various data-science-based Python libraries. NumPy, Scikit-Learn, Pandas, Matplotlib, and many more are available. Most of these machine learning libraries are preinstalled in Anaconda, so we do not need to install them. Rather, you just import the libraries that you want to use in the Jupyter Notebook.

  5. Additional Python libraries can be installed by using pip or other accepted installation utilities.

Python Notebook-Based IDEs

To write Python code, we need an integrated development environment (IDE). An IDE is a software application that provides comprehensive assistance to computer programmers for software development such as a code editor, build automation tools, and debugging features. An IDE can be installed locally or available on the cloud.

There are many good IDE options for Python, and you can select any of them for this course. However, the best IDEs for data analytics use the .ipynb file format (based on the Jupyter Notbook IDE format), which is just one of several formats in which to write Python. These notebooks are very handy for crunching data because they break programs into cells and let you execute one or more cells as a time. They also make it easy to document your code and display your results. While these are not the only options available, they are the options that I’ve personally used and would recommend and that many of my friends and associates working as professional data scientists also use.

Advantages of Working in Jupyter Notebooks

When working in Python data analytics, practioners typically work in Jupyter Notebooks because notebooks provide the following advantages:

  • Allows you to write and execute Python commands in the browser window.

  • Notebooks are designed for interactive work (not for writing long Python programs). Code can be executed cell-by-cell or as a sequence of cells.

  • It allows you to easily blend code, results, and documentation.

  • Notebook files are stored with .ipynb extension (IPython notebook file).

  • It is free to use.

What Gets Stored in a Jupyter Notebook

A notebook is a document that keeps code, results, and explanations together. Data analysts and researchers use them to write and edit code that tells a story. Because they mix code, outputs, and explanation, notebooks are ideal to run and describe your analysis. In addition, the code in notebooks can be executed to perform analysis in real time and be run again later, if desired. This means you can also download and run notebooks saved by others. This is a convenient way to learn how to do analyses.

A notebook contains:

  1. Instructions, comments, and explanations

  2. Code, including executable code and calls to libraries and functions

  3. Results, tables, and plots

  4. Hotlinks and images

Jupyter Notebook and Google Colaboratory

  • Jupyter Notebook: the original .ipynb IDE. This IDE is installed on your computer. It appears to run in the cloud because it uses a web browser (e.g., Chrome, Firefox) as its interface. However, it runs on a virtualized server on your own machine. Jupyter Notebook was formerly known as the IPython notebook. Jupyter Notebook is an acronym derived from the words Julia, Python, and R. These are the first programming languages that were the target of the Jupyter application. Now, Jupyter Notebook supports many other programming languages.

    • Advantages: It is faster than other local installation IDEs and gives you full control over the packages that you install. Jupyter Notebook is available for PC and Mac, and it comes as part of the Anaconda installation, which manages packages for you.

    • Disadvantages: If your laptop is slow, then Jupyter will run slow. If there are errors in your computer's configuration or you install incompatible libraries, then it’s up to you to figure out how to fix the problem. Bugs are not common, but dealing with them can be a pain. You typically have to Google help resources to determine how to fix your specific problem.

  • Google Colaboratory: This is a cloud-based IDE (it is served from a web server on the internet, so that your web browser acts as the client and is the only thing installed on your local machine).

    • Advantages: This option requires no installation, and there are no bugs that you have to fix. Google Colab is very fast. For large jobs, it beats Juptyer Notebook in speed because Google gives you a 4-core CPU, a GPU, and 20 GB of memory for your virtual machine compared to a 2-core CPU, no GPU, and 4 GB memory from Azure Notebooks. Colab actually timed faster than Jupyter Notebook on a laptop with 64 GB of RAM and an Intel Core i9 processor. Colab makes it easy to import libraries without installing them on Colab because many are preinstalled in Colab.

    • Disadvantages: It is difficult to import packages that aren’t already preinstalled in Colab. It nearly impossible to get the pyodbc package (for connecting to Azure SQL Server databases) to work in Colab. Accessing local CSV files and importing .py files (with your own functions) takes a bit longer and is clunky to the point of being annoying. It is easier to pull CSV files from the web over HTTP.