Filtering

Oftentimes, we don’t actually want to delete any records or rows unless they are truly junk data. Usually, we prefer to create a new, filtered version of the DataFrame instead. Pandas has some great filtering capabilities.

Filter by Row

Pandas gives us the ability to filter out rows using the following syntax:

        df[conditional]
        

The conditional referred to in that code is the same form of binary expression you learned for if statements, and it represents a certain form of data we want to keep or remove and store in a new DataFrame. It is also important to note that you can access a column of the DataFrame within the expression by specifying:

        df.columnlabel
        

Let’s work through an example with a few more columns.

        import pandas as pd
        df = pd.DataFrame({'age':[29, 55, 65, 18], 'gender':['male', 'female', 'male', 'female'], 'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
        filtered_df = df[df.age > 30] # This conditional is evaluated as True or False for each record
        filtered_df
        

Notice that the entire record is returned if the particular cell value (age) is evaluated as True for the conditional. See if you can repeat the code above, but this time filter out all males and ignore age.

        import pandas as pd
        df = pd.DataFrame({'age':[29, 55, 65, 18], 'gender':['male', 'female', 'male', 'female'], 'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
        filtered_df = df[df.gender == 'female']
        filtered_df
        

Now return all records of females over 30 years old.

        import pandas as pd
        df = pd.DataFrame({'age':[29, 55, 65, 18], 'gender':['male', 'female', 'male', 'female'], 'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
        filtered_df = df[(df.gender == 'female') & (df.age > 30)]
        filtered_df
        

Notice that Pandas is a bit different from native Python in this regard. You must include parentheses around each condition, use & in place of 'and', and use '|' in place of 'or'. Now, return all records where hr1 or hr3 are over 90.

        import pandas as pd
        df = pd.DataFrame({'age':[29, 55, 65, 18], 'gender':['male', 'female', 'male', 'female'], 'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
        filtered_df = df[(df.hr1 > 90) | (df.hr3 > 90)]
        filtered_df
        

As you’ll eventually see, filtering DataFrames is a great way to avoid looping through every record to find the values you want (and it is significantly faster).

Filter by Column

Pandas offers a few ways to filter data by column. One of these allows you to select all of the columns you need by name, using the following syntax:

        df[[column_list]]
        

Let’s try it out:

        import pandas as pd
        df = pd.DataFrame({'age':[29, 55, 65, 18], 'gender':['male', 'female', 'male', 'female'], 'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
        filtered_df = df[['age', 'gender']]  # Notice that we need an extra set of brackets '[]' which indicates we're inputting a list
        filtered_df
        

DataFrame.filter()

Pandas also has a powerful, built-in .filter() method for DataFrames that allows the use of regular expressions (i.e., 'regex') and the SQL-based 'like' functionality for searching. We could spend way too much time demonstrating these tools, but instead, let’s review a simple example. This first example uses the 'like' operator to return any column where the label contains the letters 'hr':

        import pandas as pd
        df = pd.DataFrame({'age':[29, 55, 65, 18], 'gender':['male', 'female', 'male', 'female'], 'hr1':[98.1, 78, 65, 64], 'hr2':[110, 120, 129, 141], 'hr3':[76, 87, 77, 59]}, index=['p1', 'p2', 'p3', 'p4'])
        filtered_df = df.filter(like='hr', axis=1) # filter based on column labels that include 'hr'
        filtered_df