Modifying DataFrames

Add Rows

Adding values to a DataFrame is done differently for rows versus columns. Let’s begin by manually adding rows to our empty heart rate DataFrame. For rows, we will use either the .loc or .iloc properties:

        import pandas as pd
        df = pd.DataFrame(columns=['participant', 'hr1', 'hr2', 'hr3'])
        df.set_index('participant', inplace=True)

        p1_list = [98.1, 110, 76]
        p2_list = [78.0, 120, 87]
        p3_list = [65.0, 129, 77]
        p4_list = [64.0, 141, 59]

        df.loc['p1'] = p1_list
        df.loc['p2'] = p2_list
        df.loc['p3'] = p3_list
        df.loc['p4'] = p4_list

        # ...or, add the lists directly, like so:

        df.loc['p1'] = [98.1, 110, 76]
        df.loc['p2'] = [78.0, 120, 87]
        df.loc['p3'] = [65.0, 129, 77]
        df.loc['p4'] = [64.0, 141, 59]

        df
        

Notice that all we had to do was refer to the row index label. These labels didn’t exist before we used them in those four lines. As a result, the row index labels were also created by those lines.

However, this technique doesn’t work the same for the .iloc() method using RangeIndex numbers. Rather, you have to prespecify the number of rows for the DataFrame in order to refer to them by the RangeIndex number:

        import pandas as pd
        df = pd.DataFrame(index=[0, 1, 2, 3], columns=['participant', 'hr1', 'hr2', 'hr3'])
        df.set_index('participant', inplace=True)

        df.iloc[0] = [98.1, 110, 76]
        df.iloc[1] = [78.0, 120, 87]
        df.iloc[2] = [65.0, 129, 77]
        df.iloc[3] = [64.0, 141, 59]

        df
        

Notice that if we set a labeled row-index column as we did previously, the values end up being empty (NaN) because we never specified them. You can avoid this by adding the row index label along with the other values and not setting a row index until after all values are added. Also, a native Python function commonly used in this scenario is range(), which will return a list of numbers from 0 to n - 1 of whatever number is passed in:

        import pandas as pd
        df = pd.DataFrame(index=range(4), columns=['participant', 'hr1', 'hr2', 'hr3'])

        df.iloc[0] = ['p1', 98.1, 110, 76]
        df.iloc[1] = ['p2', 78.0, 120, 87]
        df.iloc[2] = ['p3', 65.0, 129, 77]
        df.iloc[3] = ['p4', 64.0, 141, 59]

        df.set_index('participant', inplace=True)
        df
        

Add Columns

Adding columns is simple on the surface but can come with some added complications, depending on why we’re adding new columns. Let’s start by using the .insert() method to add simple, native Python list objects:

        import pandas as pd
        df = pd.DataFrame(data=[['p1', 98.1, 110, 76], ['p2', 78, 120, 87], ['p3', 65, 129, 77], ['p4', 64, 141, 59]], columns=['participant', 'hr1', 'hr2', 'hr3'])
        df.set_index('participant', inplace=True)

        # These are native Python lists
        hr4 = [81, 84, 75, 64]
        age = [25, 49, 51, 18]

        df['hr4'] = hr4             # Add to the end of the columns 
        df.insert(0, 'Age', age)    # Insert into location (location, column name, value list)
        df
        

Adding lists is very easy to do. However, it comes with some limitations. The list must be added in the order it is stored (although you can sort lists). That means you don’t have the ability to match records by row index. Sometimes that won’t matter, but other times it may be very useful. The .join() method resolves some of these issues by allowing two DataFrames, Series, or a combination of both to be joined based on an index or key using relational types: inner, outer, left, and right.

        import pandas as pd
        df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])

        # This is a Pandas Series; notice there are five records instead of four
        age = pd.Series([25, 49, 51, 18, 36], name='Age', index=['p1', 'p2', 'p3', 'p4', 'p5'])

        # This is a Pandas DataFrame; notice the index is in a different sort order
        df2 = pd.DataFrame({'hr4':[81, 84, 75, 64, 72], 'hr5':[88, 92, 79, 67, 80]}, index=['p4', 'p1', 'p3', 'p5', 'p2'])

        df = df.join(age, how='inner')   # Series and DataFrames need to be added to a new version of the DataFrame (even 
        df = df.join(df2, how='inner')   # if it's the same name) because the join occurs on a copy of the original object
        df
        

As you can see, adding DataFrame Series or DataFrames to another DataFrame has its own (dis)advantages. You can’t specify where you want the new column(s) added. However, you can use the indexes to your advantage to perform database-like joins (inner, left, right, and outer) that allow you to join mismatched numbers of rows and match them up by row index label, even if the two DataFrames are sorted differently.

Let’s say you needed to add this heart rate data to some other patient information. For example, maybe you were having the patients walk on a treadmill for some measurements, and you want to indicate what their activity status was at the time of measurement. In that case, you need to add both columns and rows. That is what the .merge() method was made for. Merging two DataFrames (or Series) allows a relational database style merge, in which the values in a DataFrame can be duplicated for multiple relational records in another DataFrame.

For example, let’s say the heart rate data were structured a bit differently:

        import pandas as pd
        hr_df = pd.DataFrame([98.1, 78, 65, 64, 76, 87, 77, 59], columns=['hr'], index=['p1', 'p2', 'p3', 'p4', 'p5', 'p2', 'p3', 'p4'])

        # Looks like:
        #      hr participant
        # 0  98.1          p1
        # 1  78.0          p2
        # 2  65.0          p3
        # 3  64.0          p4
        # 4  76.0          p5
        # 5  87.0          p2
        # 6  77.0          p3
        # 7  59.0          p4

        age = pd.DataFrame([[25, 49, 51, 18, 36], ['m', 'f', 'f', 'm', 'f']], columns=['Age', 'Gender'], index=['p1', 'p2', 'p3', 'p4', 'p5'])

        #  Looks like:
        #    age gender participant
        # 0   25      m          p1
        # 1   49      f          p2
        # 2   51      f          p3
        # 3   18      m          p4
        # 4   36      f          p5

        print(hr_df)
        print(age_df) 
        age_df.merge(hr_df)
        

Notice that Pandas knew to use “participant” as the primary/foreign key relationship because the column names were equal. However, you can specify two different column names (assuming they represent a relational key) to use as the merge index: age_df.merge(hr_df, left_key='participant', right_key='participant').

Join versus Merge

Basic differences or use cases for join versus merge:

  • Join can automatically combine based on set indexes; however, each DataFrame must have a specified index.

  • Merge can automatically combine based on two column names that are equal; however, you need to have a column in each DataFrame with the same name, and which theoretically map to each other like a primary key/foreign key relationship, and which are not the designated indexes for each DataFrame.

Therefore, if you have set indexes, use join. If not, as long as you have two column names that are equal, use merge.

Edit/Update

Next, let’s edit and update values. Updating an entire row or column is accomplished almost exactly the way new rows and columns are added. The only difference is that you specify row and column labeled indexes that already exist. For example:

        import pandas as pd
        df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])

        df.loc['p1'] = [99, 111, 77]  # This changes the top record to these values
        df.iloc[0] = [99, 111, 77]    # This changes the top record to these values
        df
        

Updating columns works the same way:

        import pandas as pd
        df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])

        df['hr1'] = [99.1, 78.4, 76.6, 63.9]      # This changes the first column to these values
        df.iloc[:, 0] = [99.1, 78.4, 76.6, 63.9]  # This changes the first column to these values
        df
        

Note that if you update a row or a column, you must specify a value for every case or field (even if some of them are being set to their original value). Next, let’s update individual cell values. Again, although this can be done with the .loc() and .iloc() methods, it is more efficient to use .at() and .iat(), which only access the single value rather than accessing the entire Series before filtering down to one value.

        df.at['p1', 'hr1'] = 99   # This changes 99.1 to 99.0
        df.iat[1, 0] = 78         # This changes 78.4 to 78.0
        df
        

Delete

Now let’s learn to delete both rows and columns.

        import pandas as pd
        df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])

        df.drop(['p1', 'p3'], inplace=True)         # Works for one or more rows based on label index 
        df.drop(['hr1'], axis=1, inplace=True)      # Works for one or more columns based on label index
        df
        

As you might guess, the axis=1 property is necessary to drop columns rather than rows. The default is axis=0 (i.e., rows). You can also drop based on a row RangeIndex number:

        import pandas as pd
        df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])

        df.drop(df.index[2], inplace=True)            # Remember that .index refers to a list of row labeled indexes, but we are referring to it by number because .index returns a list 
        df.drop(df.columns[1], axis=1, inplace=True)  # Similarly, columns returns a list of columns which we can refer to by the list index number
        df
        

Or a range of indexes:

        import pandas as pd
        df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])

        df.drop(df.index[[1, 2]], inplace=True)
        df
        

Or from the bottom up:

        import pandas as pd
        df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])

        df.drop(df.index[[-2]], inplace=True) # Using a negative index indicates that we start from the end of the list and move forward (not zero-based)
        df