Creating DataFrames

First, let’s create a DataFrame by casting a Python dictionary of lists into Pandas DataFrame type. Remember when you learned how to import packages? The object "pd" that represents the package can be named anything you want. It is a variable that stores all of the capabilities of the Pandas library. To use the DataFrame constructor, you first need to reference pd, as shown below:

      # Import the pandas package
      import pandas as pd

      # Create a dictionary of lists where each key represents a column label and the list represents the cell values in each row of the column
      HeartRates = {'participant': ['p1', 'p2', 'p3', 'p4'], 'hr1': [98.1, 78, 65, 64], 'hr2': [110, 120, 129, 141], 'hr3': [76, 87, 77, 59]}
      df = pd.DataFrame(HeartRates) # Cast dictionary into DataFrame using the pandas package
      

Next, let’s view the DataFrame we just created. You know how to use the print() command; so let’s begin with that:

      print(df)
      

In these results, you see the four key-value pairs of the original dictionary organized in table form. Let’s view df again, but this time without the print() command, which removes the default viewing style that Pandas includes with the DataFrame:

      df
      

That looks a bit better, right? We will typically use that notation (without the print() command) to view DataFrames. You may have noticed that we can write any object name and it is the same as using the print() command. However, it is important to note that it only works if it appears after any other print() commands in a particular code blow. For example, notice that the nicely formatted version doesn't show up when we use the print command after stating df:

      df        # This is the nicely formatted version, but it is overwritten by the next line
      print(df) # This is the version you see below

      # Output
      # participant  hr1  hr2  hr3
      # 0           p1   98  110   76
      # 1           p2   78  120   87
      # 2           p3   65  129   77
      # 3           p4   64  141   59
      

Let’s examine a few more details. First, there is an unlabeled list of numbers from 0 to 3 on the leftmost column. This is a RangeIndex that was automatically added by the DataFrame constructor to uniquely identify each row. The dictionary keys that became column names were used as a labeled index for the columns. Although it is not displayed in the output, the columns also have a RangeIndex (0 to n), just like the rows. However, when you print a DataFrame, only the labeled index (for both columns and rows) is displayed (if one exists). Otherwise, the RangedIndex will be displayed—as it is for the rows in the example above. In summary, both rows and columns have both a numbered index (referred to as RangeIndex) and a labeled index. However, the rows will use the RangeIndex as the labeled index if an existing column is not specified. Columns require that labels be specified.

Now let’s return to the constructor in the example above. We only put in the data. The row index, column labels, and data types were automatically implied. We have the ability to read those parameters if we want to verify what was implied:

      print(df.index)
      print('\n')
      print(df.columns)
      print('\n')
      print(df.dtypes)

      # Output
      # RangeIndex(start=0, stop=4, step=1)

      # Index(['participant', 'hr1', 'hr2', 'hr3'], dtype='object')

      # participant    object
      # hr1           float64
      # hr2             int64
      # hr3             int64
      # dtype: object
      

Notice that df.index refers to the RangeIndex of the rows. Because we didn’t specify one of the columns as the index in the original constructor, the row index defaulted to a RangeIndex from 0 to n—similar to what you learned to use for Python lists and dictionaries. The df.columns method returned the labeled indexes of the columns (called Index) that were inferred from the key values of the original dictionary passed into the DataFrame constructor. Lastly, the DataFrame constructor correctly inferred that participant was non-numeric (object is a Pandas data type that means the values could be either string or numeric), hr1 was a float, and hr2 and hr3 were integers.

Let’s create a DataFrame directly in the constructor without converting a dictionary, and let’s see the differences. We will begin first with an empty DataFrame:

      import pandas as pd
      df = pd.DataFrame(columns=['participant', 'hr1', 'hr2', 'hr3'])
      df.set_index('participant', inplace=True)
      df          
      

We often create empty DataFrames like the one above when we want to iterate through another DataFrame or iterable object and generate a summary table of results. Before we enter that loop to generate the summary table, we have to first create an empty DataFrame and then add the rows one at a time as we iterate through another DataFrame. But we’ll come to that later.

Now let’s create a DataFrame with data in the constructor:

      # Option 1:
      # Data in list of lists (i.e. lists of rows)
      # Column labels declared separately
      # Set index in the constructor
      import pandas as pd
      df = pd.DataFrame(data=[[98.1, 110, 76], [78, 120, 87], [65, 129, 77], [64, 141, 59]], index=['p1', 'p2', 'p3', 'p4'], columns=['hr1', 'hr2', 'hr3'])
      df
      

Notice that the column name “participant” is gone because it is now the labeled row index; and although the RangeIndex still exists, it is not displayed (for simplicity). Also, the data is input as a list of row lists. Whereas a dictionary grouped the data into columnname/columnvalues pairs, the "data=" parameter of the DataFrame constructor groups the data into a list of rows value lists.

You can also set the index after a DataFrame has been created. The advantage of this technique is that it allows you to keep the column name of the index in case it is needed later.

      # Option 2: 
      # Data in list of lists including row index labels
      # Set index after the constructor
      import pandas as pd
      df = pd.DataFrame(data=[['p1', 98.1, 110, 76], ['p2', 78, 120, 87], ['p3', 65, 129, 77], ['p4', 64, 141, 59]], columns=['participant', 'hr1', 'hr2', 'hr3'])
      df.set_index('participant', inplace=True)
      df        
      

If you want to keep the dictionary form and input the data directly into the DataFrame constructor, that will allow you to keep the data in columns and specify the index separately:

      # Option 3: 
      # Inputting data as a dictionary allows a column format with labels
      # Set index in the constructor and optionally specify a column label for the row index
      import pandas as pd
      df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
      df.index.names = ['participant'] # This is optional if you don't care if there's a name for your row index
      df        
      

Each technique above may be useful depending on the scenario.

Why create an index at all? Well, you’ll see for yourself later in this course. For now, just know that indexes are important for search speed and access. Also, note that while RangeIndex will always be unique, labeled indexes (for rows and columns) do not have to be unique. In fact, it’s common to use a categorical grouping of some sort as a labeled index to speed up the process of returning all records where the index == some categorical group.

Next, let’s learn how to read, update, and delete information from DataFrames.