Roles, Topics, Intended Audience

Roles

After students watch Dr. Patil’s address, they ask me the following questions: (1) What kind of job or role should I look for in the field of data science? and (2) What do I study next to learn how to become a data scientist?

The core of the data science discipline comes from applied mathematics, statistics, and computer science. However, data science is applied to many disciplines, and it is enhanced by the subject matter expertise within each of these disciplines. The ideal data scientist “unicorn” will have some expertise across all areas (see this data scientist article). See the diagram below.

Figure 0.1: The Data Science Unicorn.

However, in practice, students of data science from academic institutions generally fall into one of two general types:

  1. The data science theorist: These people typically come from a reference discipline (mathematics, statistics, or computer science). These people understand (and improve) the formulas behind the analyses themselves and push the boundaries of new models and techniques to develop data science theory.

  2. The applied data scientist: These people typically come from a reference discipline. This means that they want to use the analyses developed by data science theorists to solve real-world problems that are relevant to their discipline. Rather than create new data science theory, these people draw from existing theory—often in unexpected and unique ways—to explain “why” and “how” businesses succeed. This includes identifying the best measures and variables needed for that explanation. These people can come from any discipline because data analytics are relevant to nearly everything in today’s world. For example, sociology might use data analytics to explain and predict relationship patterns. Engineering could use it to create artificial intelligence. Business (our context) may use it to make an organization more efficient. However, applied data scientists don’t need to understand mathematics and programming to the same degree that data science theorists do. Applied data scientists need to understand the rules, boundaries, and trade-offs between various types of analyses, but their true strength and “value-add” is in their understanding of the real-world problems where data analytics can be applied. They also understand enough data analytics to validly apply data solutions and data products to solve those problems.

    The role of applied data scientists is evolving as new technologies make it possible to build data products without being a core data scientist.

There are many different roles in this discipline that fall within one or both of those classifications. However, you can benefit from deciding as early as possible which of these two general roles you want to work toward in your academic career. If you are taking this course as a freshman or sophomore, then you have some time to choose. If you have already made significant progress in your discipline, then the information below will show you how to use this course to your advantage, given your background.

Course Topics

The purpose of this particular book is to help you start at the beginning of the data mining process using the most dominant industry-leading tool: Python programming. You will get a high-level, broad idea of the process as well as practical, technical Python skills for each phase.

The book is structured to teach (1) introductory Python programming (1/3 of the course) and (2) the data mining process (2/3 of the course). The book includes a sample of the most basic and common tasks of the data understanding and preparation phases, with only a little exposure to modeling. However, this book is also the first of a three-course sequence, which includes the following topics:

In summary, this is one of four related courses (a sequence of three and a stand-alone course) on MyEducator:

Table 0.1
Course/Book Technologies (in order of emphasis) Teaching Style
Introduction to Python Data Analytics

(beginners)
1. Python programming In-class: students follow along with the instructor writing Python code on their laptops (in either Jupyter, Google Colab, or Azure Notebooks) with time remaining for practice

Materials: video tutorial of the same material covered in class; reading and documentation available to support videos
Data Analytics and Machine Learning

(intermediate)
1. Python programming, 2. Azure ML Studio (cloud-based point-and-click)
Advanced Data Analytics

(advanced)
1. JMP, 2. Python programming In-class: students follow along with the instructor using JMP (allows more material to be covered in less time) on their laptops

Materials: video tutorial of the same material covered in class; Python code provided to replicate what the students learned in JMP
Data-Mining Projects and Database Essentials

(stand-alone course)
1. Azure ML Studio (cloud-based point-and-click), 2. (optional) Azure Data Studio SQL programming, 3. Tableau, 4. Excel In-class: students follow along with the instructor using Tableau, Azure ML Studio, and Excel on their laptops (possibly Azure Data Studio as well if the course includes SQL select statements)

Materials: video tutorial of the same material covered in class; reading and documentation available to support videos

Audience

Who could benefit from taking these courses? These books were written to support the courses we are currently teaching. As a result, they are refined every semester based on the needs and experiences of our students. Generally, we use these books to teach three types of students, who vary in terms of their technical background (statistics and programming) and the number of courses they can take:

  1. Nontechnical students who want to develop intermediate technical and analytical expertise

    • For example:

      • Freshmen and sophomores who haven’t declared a major, have heard about the discipline of data science, and want to learn more.

      • Students who have declared a major and want to get a minor in data science or business analytics to complement their major.

    • For these students, we teach the entire three-course sequence; although, it can be adapted into a two-course sequence as well.

    • Primary technology: Python (with Azure ML Studio and JMP used to a lesser degree).

  2. Nontechnical students who want a broad exposure across the discipline of data science in a single course.

    • For example:

      • Graduate students who need a strong, one-time exposure to data science but will not likely take a job as a full-time data scientist (e.g., MBA and MPA students).

      • Undergraduate business students who don’t want to learn programming but need to work closely with data analysts in project teams and understand the programming process.

    • For these students, we teach the data mining process without getting too technical. However, these students will still learn analytics skills like modeling, feature selection, text analytics, and recommendation engines in significant detail—enough that they could contribute to a project team building data products but not enough for them to lead the analysis.

    • Primary technology: Azure ML Studio (Tableau and Excel to a lesser degree).

    • Optionally: Azure Data Studio, including the SQL language focusing on read (select) statements.

  3. Technical students who already have a background in either programming or statistics.

    • For example:

      • Statistics majors who want to learn Python and learn how to write code to automate the analytical process and decision-making.

      • Information systems or computer science majors who understand programming and automation well but want to learn the analyses and the data mining process.

    • The statistics majors typically need the first course: (Introduction to Python Data Analytics). This course teaches programming for beginners and techniques for automation. The portions covering basic statistics will be unnecessary for these students. However, I’ve learned from experience that my stats majors appreciate having that content in the course because it allows them to focus more on learning the programming skills for automation while other students are catching up on their basic statistics.

      The information systems majors (my department) get either a two- or three-course sequence. Some of them are very adept at programming and can combine elements of this introductory course with the intermediate course (Data Analytics and Machine Learning) into one course. However, others who struggle with programming end up taking the entire three-course sequence for more practice.

      At my university, the computer science students already have machine learning and deep learning courses in their department. These students are usually very adept at programming. However, I still get a handful of computer science students who want to learn Python in a classroom. For these students and for computer science students at other universities who need their first course on machine learning, I recommend the intermediate course (Data Analytics and Machine Learning). This is assuming that those students have already taken a basic statistics course (as most computer science programs would require).

    • Primary technology: Python (with Azure ML Studio to a lesser degree).

In summary, this course is intended to be the first of a three-course sequence. It is targeted toward students who have a limited background in programming or statistics but want to build a thorough technical skill-base that can position them for the role of data scientist within their selected discipline. Although these three courses certainly do not cover the breadth of topics and skills available to data scientists, they can position students to continue developing those skills on the job and make them more advanced than other non-statistics majors. This book typically comes before the intermediate course (Data Analytics and Machine Learning), which is then followed by Advanced Data Analytics (advanced level). However, it can also be combined with concepts from the intermediate course (Data Analytics and Machine Learning) and delivered to more advanced students in a one- or two-course sequence as well.

In addition, there is a fourth book that is designed to be a stand-alone course that requires less technical skill (i.e., no coding in Python, just click-and-drag tools like Tableau, Azure ML Studio, and Excel). Therefore, it covers a wider range of topics, including some of the most advanced topics (e.g., text analytics, recommendation engines, and web service deployment) because new tools have made these topics more accessible.