Reading DataFrames

Columns

Now you have a fancy new DataFrame. How do you read data from it? Let’s learn to read columns, rows, and cells beginning with columns. You can read a DataFrame column by index using either the labeled or numbered index. Let’s read the entire hr3 column:

        # Example with labeled row index
        import pandas as pd
        df = pd.DataFrame(data=[['p1', 98.1, 110, 76], ['p2', 78, 120, 87], ['p3', 65, 129, 77], ['p4', 64, 141, 59]], columns=['participant', 'hr1', 'hr2', 'hr3'])

        print(df['hr3'])
        print('\n')

        # Example without labeled row index
        df = pd.DataFrame(data=[['p1', 98.1, 110, 76], ['p2', 78, 120, 87], ['p3', 65, 129, 77], ['p4', 64, 141, 59]], columns=['participant', 'hr1', 'hr2', 'hr3'])
        df.set_index('participant', inplace=True) # Option 2a from prior example

        print(df['hr3'])

        # Output:
        # 0    76
        # 1    87
        # 2    77
        # 3    59
        # Name: hr3, dtype: int64

        # participant
        # p1    76
        # p2    87
        # p3    77
        # p4    59
        # Name: hr3, dtype: int64
        

Reading an entire column (no matter how many values are in it) is as simple as adding the column index name in brackets: df['hr3']. I gave you an example with and without a row index just so that you could see how the results get labeled. There isn’t really a practical difference when we’re just reading column data (although there may be, depending on what we want to do with the column data after reading it). Can you read a column by the range index number? Yes, but let’s come back to that in a bit. It’s not as simple as replacing the label with a number.

Rows

Next, select all values of a particular row using the .loc() and .iloc() methods:

        # Read a row by RangeIndex
        df.iloc[3]

        # Read a row by (labeled) Index
        df.loc['p4']
        
        # Output:
        hr1     64.0
        hr2    141.0
        hr3     59.0
        Name: p4, dtype: float64
        

As you can see, reading a row is easy to do by either the RangeIndex number or the Index label. However, you only have the option of reading the Index label if a designated index was set either in the DataFrame constructor or afterward using set_index(). Also, notice that the index (“participant”) does not get printed in the row results.

Pandas Series

When you select a row or a column of a DataFrame, the results are typically some form of a list: either all column values for a particular row/case or all row values for a column. Either way, the result is saved in a Pandas Series. A Series is similar to the list (array) collection object in native Python, but it has most of the same properties and methods as a DataFrame. In fact, when selecting a row or column from a DataFrame, you often don’t realize that you are now working with a Series because almost all relevant properties and methods will work exactly the same. However, you will see this concept come up as you search StackOverflow and other websites for help.

Pandas Series()

Series([data=None, index=None, dtype=None, name=None, copy=False]): documentation

  • data: the actual data to be stored in a tabular format (i.e., rows and columns); it can be a dictionary, list, pandas series object, or many other list-like objects

  • index: the index of each row; can be a number or a name; it can be specified in a separate list (list n must equal the number of rows in the data) or as one of the existing columns; default to RangeIndex if no indexing information part of input data and no index provided

  • columns: the label names of each column; default to RangeIndex if no indexing information part of input data and no index provided

  • dtype: the intended data type of each column; if set, then it must be the same dtype for all columns; otherwise, it will be inferred from the data individually for each column; if set, it must be appropriate for the data (i.e., a runtime error will occur if a column is set to be an int when there are non-numeric characters in the column)

  • copy: defaulted to False; if set to True, then the new DataFrame will be a copy of the original; updates to one will not affect the other

Cells

Finally, you’ll often need to read a particular cell value (row/column combination). There are a couple of valid ways to do this. First, let’s adapt the same .loc() and .iloc() properties you previously learned to retrieve cell values. To do this, simply add a comma followed by the appropriate column index.

        # Return the cell value of the 4th row (index 3 of the rows) and the 3rd column (index 2 of the columns)
        df.iloc[3, 2]         # Numeric range indexes for iloc
        df.loc['p4', 'hr3']   # Labeled text indexes for loc

        # Output: 59
        

You might recall that I mentioned a way to refer to a column index number to retrieve all row values of a column. To do that, substitute the row index number for the colon ":" character:

        df.iloc[:, 2]       # The ':' character means return all row values of the 3rd column (index 2 of the columns)
        df.loc[:, 'hr3']

        # Output:
        # participant
        # p1    76
        # p2    87
        # p3    77
        # p4    59
        # Name: hr3, dtype: int64
        

While .loc() and .iloc() are reliable and appropriate techniques for reading (and updating) DataFrame cells, they are less efficient than another option. That is because .loc() and .iloc() both select an entire row of data before filtering out everything except the column value you are looking for. Therefore, if you just need to access a cell value (or a scalar value), it will be faster to use the .at() or .iat() method, which only selects a cell value in the first place without the additional need to filter out the other cell values in the row:

        df.iat[3, 2]         # Numeric range indexes for iat
        df.at['p4', 'hr3']   # Same as above, but labeled text indexes for at

        # Output: 59
        # The output only prints once because the last line always
        # overwrites the first line if the print() command is not used
        

Indexes

There will also be times when you’ll want to read the indexes of a DataFrame—usually the column label indexes but sometimes the row indexes as well. Typically, this will be associated with a loop through the index values. Let’s read the row and column labels and then print them out one at a time:

        # Option 1: Cast the DataFrame to a Python list which will only keep the column labels
        columns = list(df) 
        columns

        # Output:
        # ['hr1', 'hr2', 'hr3']
        

Notice that if you have a row index label column set, then it won’t be included in the list of column labels (i.e., “participant” didn’t show up).

Although the list-cast method above works just fine, there is a built-in method in the DataFrame object that will return column labels:

        # Option 2: Use built-in Pandas DataFrame.columns method
        columns = df.columns 
        columns

        # Output:
        # Index(['hr1', 'hr2', 'hr3'], dtype='object')
        

As you can see, this technique returns the same results but also some extra metadata that can be used by other Pandas objects and methods. We will typically use this method. There is a similar method for returning the row index labels:

        rows = df.index
        rows

        # Output:
        # Index(['p1', 'p2', 'p3', 'p4'], dtype='object')
        

Now let’s loop through these labels and print them out one at a time.

        # Loop through the column labels; print them out
        for col in columns:
          print(col)

        # Output:
        # hr1
        # hr2
        # hr3
        

Recall from the chapter on iterations that the "col" in the code above refers to the value in the iterable object we are looping through. In this case, that object is a list of column labels. Therefore, "col" refers to each label name (type=string) as we iterate. However, because col represents a column label, we can also use it in the context of the original DataFrame (df). That gives us some powerful options for iterating through a dataset. For example:

        # Loop through the column labels: use them to refer to the DataFrame ('df')
        for col in df:
          print("Column: " + str(col) + "\n" + str(df[col]) + "\n")

        # Output:
        # Column: hr1
        # p1    98.1
        # p2    78.0
        # p3    65.0
        # p4    64.0
        # Name: hr1, dtype: float64

        # Column: hr2
        # p1    110
        # p2    120
        # p3    129
        # p4    141
        # Name: hr2, dtype: int64

        # Column: hr3
        # p1    76
        # p2    87
        # p3    77
        # p4    59
        # Name: hr3, dtype: int64
        

Notice a few things. First, we didn’t even need to create a list of column labels. You can iterate through DataFrame columns simply by referring to the DataFrame (df). Next, the variable used to iterate (col) still refers to the column label when you iterate through a DataFrame. Last, when it is used in the context of the DataFrame (i.e., "df[col]"), you can reference an entire column (or Pandas Series). If you recall, that’s what you learned at the top of this screen for reading DataFrame columns. We will use this looping pattern many times in the remainder of this course.