1.4 Modifying DataFrames
Add Rows
Adding values to a DataFrame is done differently for rows versus columns. Let’s begin by manually adding rows to our empty heart rate DataFrame. For rows, we will use either the .loc or .iloc properties:
import pandas as pd
df = pd.DataFrame(columns=['participant', 'hr1', 'hr2', 'hr3'])
df.set_index('participant', inplace=True)
p1_list = [98.1, 110, 76]
p2_list = [78.0, 120, 87]
p3_list = [65.0, 129, 77]
p4_list = [64.0, 141, 59]
df.loc['p1'] = p1_list
df.loc['p2'] = p2_list
df.loc['p3'] = p3_list
df.loc['p4'] = p4_list
# ...or, add the lists directly, like so:
df.loc['p1'] = [98.1, 110, 76]
df.loc['p2'] = [78.0, 120, 87]
df.loc['p3'] = [65.0, 129, 77]
df.loc['p4'] = [64.0, 141, 59]
df
Notice that all we had to do was refer to the row index label. These labels didn’t exist before we used them in those four lines. As a result, the row index labels were also created by those lines.
However, this technique doesn’t work the same for the .iloc() method using RangeIndex numbers. Rather, you have to prespecify the number of rows for the DataFrame in order to refer to them by the RangeIndex number:
import pandas as pd
df = pd.DataFrame(index=[0, 1, 2, 3], columns=['participant', 'hr1', 'hr2', 'hr3'])
df.set_index('participant', inplace=True)
df.iloc[0] = [98.1, 110, 76]
df.iloc[1] = [78.0, 120, 87]
df.iloc[2] = [65.0, 129, 77]
df.iloc[3] = [64.0, 141, 59]
df
Notice that if we set a labeled row-index column as we did previously, the values end up being empty (NaN) because we never specified them. You can avoid this by adding the row index label along with the other values and not setting a row index until after all values are added. Also, a native Python function commonly used in this scenario is range(), which will return a list of numbers from 0 to n - 1 of whatever number is passed in:
import pandas as pd
df = pd.DataFrame(index=range(4), columns=['participant', 'hr1', 'hr2', 'hr3'])
df.iloc[0] = ['p1', 98.1, 110, 76]
df.iloc[1] = ['p2', 78.0, 120, 87]
df.iloc[2] = ['p3', 65.0, 129, 77]
df.iloc[3] = ['p4', 64.0, 141, 59]
df.set_index('participant', inplace=True)
df
Add Columns
Adding columns is simple on the surface but can come with some added complications, depending on why we’re adding new columns. Let’s start by using the .insert() method to add simple, native Python list objects:
import pandas as pd
df = pd.DataFrame(data=[['p1', 98.1, 110, 76], ['p2', 78, 120, 87], ['p3', 65, 129, 77], ['p4', 64, 141, 59]], columns=['participant', 'hr1', 'hr2', 'hr3'])
df.set_index('participant', inplace=True)
# These are native Python lists
hr4 = [81, 84, 75, 64]
age = [25, 49, 51, 18]
df['hr4'] = hr4 # Add to the end of the columns
df.insert(0, 'Age', age) # Insert into location (location, column name, value list)
df
Adding lists is very easy to do. However, it comes with some limitations. The list must be added in the order it is stored (although you can sort lists). That means you don’t have the ability to match records by row index. Sometimes that won’t matter, but other times it may be very useful. The .join() method resolves some of these issues by allowing two DataFrames, Series, or a combination of both to be joined based on an index or key using relational types: inner, outer, left, and right.
import pandas as pd
df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
# This is a Pandas Series; notice there are five records instead of four
age = pd.Series([25, 49, 51, 18, 36], name='Age', index=['p1', 'p2', 'p3', 'p4', 'p5'])
# This is a Pandas DataFrame; notice the index is in a different sort order
df2 = pd.DataFrame({'hr4':[81, 84, 75, 64, 72], 'hr5':[88, 92, 79, 67, 80]}, index=['p4', 'p1', 'p3', 'p5', 'p2'])
df = df.join(age, how='inner') # Series and DataFrames need to be added to a new version of the DataFrame (even
df = df.join(df2, how='inner') # if it's the same name) because the join occurs on a copy of the original object
df
As you can see, adding DataFrame Series or DataFrames to another DataFrame has its own (dis)advantages. You can’t specify where you want the new column(s) added. However, you can use the indexes to your advantage to perform database-like joins (inner, left, right, and outer) that allow you to join mismatched numbers of rows and match them up by row index label, even if the two DataFrames are sorted differently.
Let’s say you needed to add this heart rate data to some other patient information. For example, maybe you were having the patients walk on a treadmill for some measurements, and you want to indicate what their activity status was at the time of measurement. In that case, you need to add both columns and rows. That is what the .merge() method was made for. Merging two DataFrames (or Series) allows a relational database style merge, in which the values in a DataFrame can be duplicated for multiple relational records in another DataFrame.
For example, let’s say the heart rate data were structured a bit differently:
import pandas as pd
hr_df = pd.DataFrame([98.1, 78, 65, 64, 76, 87, 77, 59], columns=['hr'], index=['p1', 'p2', 'p3', 'p4', 'p5', 'p2', 'p3', 'p4'])
# Looks like:
# hr participant
# 0 98.1 p1
# 1 78.0 p2
# 2 65.0 p3
# 3 64.0 p4
# 4 76.0 p5
# 5 87.0 p2
# 6 77.0 p3
# 7 59.0 p4
age = pd.DataFrame([[25, 49, 51, 18, 36], ['m', 'f', 'f', 'm', 'f']], columns=['Age', 'Gender'], index=['p1', 'p2', 'p3', 'p4', 'p5'])
# Looks like:
# age gender participant
# 0 25 m p1
# 1 49 f p2
# 2 51 f p3
# 3 18 m p4
# 4 36 f p5
print(hr_df)
print(age_df)
age_df.merge(hr_df)
Notice that Pandas knew to use “participant” as the primary/foreign key relationship because the column names were equal. However, you can specify two different column names (assuming they represent a relational key) to use as the merge index: age_df.merge(hr_df, left_key='participant', right_key='participant').
Join versus Merge
Basic differences or use cases for join versus merge:
-
Join can automatically combine based on set indexes; however, each DataFrame must have a specified index.
-
Merge can automatically combine based on two column names that are equal; however, you need to have a column in each DataFrame with the same name, and which theoretically map to each other like a primary key/foreign key relationship, and which are not the designated indexes for each DataFrame.
Therefore, if you have set indexes, use join. If not, as long as you have two column names that are equal, use merge.
Edit/Update
Next, let’s edit and update values. Updating an entire row or column is accomplished almost exactly the way new rows and columns are added. The only difference is that you specify row and column labeled indexes that already exist. For example:
import pandas as pd
df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
df.loc['p1'] = [99, 111, 77] # This changes the top record to these values
df.iloc[0] = [99, 111, 77] # This changes the top record to these values
df
Updating columns works the same way:
import pandas as pd
df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
df['hr1'] = [99.1, 78.4, 76.6, 63.9] # This changes the first column to these values
df.iloc[:, 0] = [99.1, 78.4, 76.6, 63.9] # This changes the first column to these values
df
Note that if you update a row or a column, you must specify a value for every case or field (even if some of them are being set to their original value). Next, let’s update individual cell values. Again, although this can be done with the .loc() and .iloc() methods, it is more efficient to use .at() and .iat(), which only access the single value rather than accessing the entire Series before filtering down to one value.
df.at['p1', 'hr1'] = 99 # This changes 99.1 to 99.0
df.iat[1, 0] = 78 # This changes 78.4 to 78.0
df
Delete
Now let’s learn to delete both rows and columns.
import pandas as pd
df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
df.drop(['p1', 'p3'], inplace=True) # Works for one or more rows based on label index
df.drop(['hr1'], axis=1, inplace=True) # Works for one or more columns based on label index
df
As you might guess, the axis=1 property is necessary to drop columns rather than rows. The default is axis=0 (i.e., rows). You can also drop based on a row RangeIndex number:
import pandas as pd
df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
df.drop(df.index[2], inplace=True) # Remember that .index refers to a list of row labeled indexes, but we are referring to it by number because .index returns a list
df.drop(df.columns[1], axis=1, inplace=True) # Similarly, columns returns a list of columns which we can refer to by the list index number
df
Or a range of indexes:
import pandas as pd
df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
df.drop(df.index[[1, 2]], inplace=True)
df
Or from the bottom up:
import pandas as pd
df = pd.DataFrame({'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
df.drop(df.index[[-2]], inplace=True) # Using a negative index indicates that we start from the end of the list and move forward (not zero-based)
df