The Data Life Cycle

Although we have already explained that data science is a major part of the IS discipline, it is a very large domain itself and deserves its own introduction. Data science is the core discpline behind academic programs like business data analtyics and AI in business.

In the prior chapter, we introduced you to several concepts to help you position the IS domain in your understanding including a Venn-like diagram typing together organizations, people, tecnology, and data. We introduced you to systems (as opposed to linear) thinking and the concept os IS as unstructured problem solving. To help you understand the related domains of business data analytics and AI in business, it may help to understand the overall life cycle of data. Then, we'll identify where all three domains (IS, business data analytics, and AI in business) play a role in this life cycle.

1. Data Creation/Acquisition

The data life cycle begins when data is generated or collected. During research and development, data is generated through experiments and research designs. In business, it's more often created when transactions occur of any kind. For example, a customer makes a purchase, an employee is hired/fired, or inventory is acquired from a supplier. Data can also be generated from Internet-capable devices like smartphones, smart refridgerators, cars, lights, plugs, locks, watches, glasses, heart monitors, washing machines, ovens, machines, robotics,a nd more. Data can also be generated from web scraping, surveys, and social media sites. Sensors for weather, traffic, air quality, light, pressure also generate data. Accelerometers, gyroscopes, infrared devices, ultrasonic devices, and laser light (LIDAR) sensors also generate data. The list of data source possibilities continually increases.

2. Data Storage

Once data is acquired, it needs to be stored securely somewhere for future use. This is where database management systems (DBMS) come in which you'll learn more about later. There are both SQL (e.g., MySQL, PostgreSQL, SQL Server) and NoSQL (e.g., MongoDB, Cassandra) DBMSs. Advanced databases for analytics have grown including data warehouses (e.g. Amazon Redshift, Google BigQuery, Snowflake, Databricks), data lakes (e.g. AWS Lake Formation, Azure Data Lake, Google Cloud Storage), and cloud file storage (AWS S3, Google Cloud Storage, Microsoft Azure Blob Storage). However, data storage can also be as simple as a file cabinet for data stored on paper.

3. Data Processing

After the data are safely stored, we often need to perform some sort of processing, or "cleaning", to get it into a more useful format. You'll learn later that this is part of the export-transform-load (ETL) process. Technically, when processing occurs in ETL, it is performed before the data are stored using tools like Apache Nifi, Talend, Informatica for ETL processing. OpenRefine, Trifacta, and many Python libraries (e.g., Pandas) are used for data cleaning after it has already been stored. Tools like Apache Kafka allow for stream processing; MuleSoft for API integration; Apache Spark for large-scale data processing.

4. Data Analysis

Once the data are stored and cleaned, we can then analyzing this processed data to generate insights. These are likely the tools you may already be most familiar with. There are programming languages (e.g. SQL, Python, and R) and point-and-click programs like Excel that support very broad analyses. There are many good point-and-click click tools for descriptive data mining including Weka, RapidMiner, and Adobe Analytics. The R programming language and python libraries like Scikit-Learn, TensorFlow, and others that allow us to perform more sophisticated predictive analyses data from code while KNIME and Azure Machine Learning Studio Designer allow you to do the same using the point-and-click method. We can also perform prescriptive analyses using IBM Decision Optimization, SAS, and Python libraries like scipy and pulp.

5. Data Visualization

In addition to descriptive, diagnostic, predictive, and prescriptive analyses, we can also generate elaborate data visualizations to make data much more understandable. You are likely already familiar with visualizations in Excel. But tools like Tableau, Microsoft Power BI, and Qlik will take you much further. Programming languages like R and Python libraries (Matplotlib and Seaborn) will help you create beautiful visualizations from code which can be easily integrated into applications. Reporting tools like Looker and Google Data Studio are useful for automatically generated visualizations while Tableau and Power BI will help you extend visualizations into dashboards and stories-telling presentations.

6. Data Sharing and Distribution

One you have discovered the relevant stores to tell from your data, you need to distribute those insights to relevant stakeholders. We can create application programming interfaces (APIs) for this purpose using Swagger, Postman (for API testing), and GraphQL. Data portals like CKA and Socrata make it easy to distribute these insights throughout company portals. And there are nearly countless collaboration tools like Microsoft Teams, Slack, and Confluence for sharing reports and dashboards. Conceptually, you might think of the Internet as the data distribution infrastructure while websites and apps offer a platform for data distribution and the individual apps that generate summaries, tables, visualizations, and statistics as the data distribution itself.

7. Data Archiving

At some point, data expires or becomes less relevant because newer data has taken its place. We typically do not want to delete that data, but we may need to archive it to another location so that it doesn't slow down the processing of newer, more relevant, data. Amazon Glacier and Google Coldline Storage are ideal for what we call data "cold storage." We may also want to set up data retention policies that will automatically perform data archival based on conditions using tools like Veeam or Veritas. Finally, there are many laws that govern the use of data that does not explicitly belong to us or is "co-owned" like customer or patient data. In these cases, we may use tools like OneTrust or TrustArc to ensure that we are following relevant regulations regarding co-owned data.

8. Data Disposal

When data are co-owned, there may be laws giving one or both owners the right to have their data deleted. The European Union's (EU) General Data Privacy Rules (GDPR) formalized this concept as the "right to be forgotten." If a consumer requests it, companies must securly delete their data from their primary and backup databases. There are many tools called "secure delete utilities" and "data sanitization tools." Blancoo, DBAN (Darik's Boot and Nuke) will virtually shred files from hard drives to make sure they are not recoverable.

Summary

In summary, the data life cycle reviewed above provides a framework for what you will learn in this book. Like the prior chapter introducing IS, this chapter introduces concepts that we will touch upon throughout the remainder of the book. There is no way to cover every concept and technology that is relevant to IS and the data life cycle in a single course or book. But everything we cover in the book is relevant to some portion of both the information systems development life cycle or the data life cycle. And those cycles both impact and reinforce each other.