1.1 Introduction
Assuming you have learned the basics of programming in Python (simple variables and data structures, if statements, iterations, and packages), you are ready to learn the primary data structure used in Python data science: the Pandas DataFrame. DataFrames may be one of the primary reasons that Python has become the most prominent language for data science. DataFrames offer so many useful features for analyzing, editing, and extracting data from rows and columns in a table structure. The table below is an example of how DataFrames are printed in .ipynb format:
A DataFrame is a size-mutable two-dimensional labeled data structure with columns of potentially different types. Think of it as an in-memory spreadsheet. Review the constructor for DataFrames below:
DataFrame([data, index, columns, dtype, copy])
data: the actual data to be stored in a tabular format (i.e., rows and columns); can be a dictionary, list, pandas series object, or many other list-like objects
index: the index of each row; can be a number or a name; can be specified in a separate list (list n must equal the number of rows in the data) or as one of the existing columns; default to RangeIndex if no indexing information part of input data and no index provided
columns: the label names of each column; default to RangeIndex if no indexing information part of input data and no index provided
dtype: the intended data type of each column; if set, then it must be the same dtype for all columns; otherwise, will be inferred from the data individually for each column; if set, must be appropriate for the data (i.e., a runtime error will occur if a column is set to be an int when there are non-numeric characters in the column)
copy: defaulted to False; if set to True, then the new DataFrame will be a copy of the original; updates to one will not affect the other
A constructor is used whenever a variable needs to be created to store an object that is more complicated than the basic data types you learned previously (e.g., int, float, str, bool). Typically, this means that the variable requires that certain parameters be specified about the object. For example, DataFrames include data, indexes, columns, dtype, and copy.
Depending on the specific constructor, you often have the option of ignoring the parameters, which means that default values of each will be set or implied from other parameters. In other words, you can create a DataFrame using the command DataFrame() and default values will be implied. Or, you can set one or all of them as you’ll see next.