Dates and Times

A common early step in data cleaning is parsing date and time values into numerical or standardized categorical formats for use in models. For example, a date like "11/11/2011" cannot be directly used in a predictive model because it is stored as a string due to the "/" symbols. While categorical values can be useful, they must adhere to the 5% rule, which states that each category should represent at least five percent of the dataset. If a category is too rare, it may need to be dropped or merged. Date/time values, however, are often too unique to be treated as categorical data.

One way to handle date/time data is by extracting numeric and standardized categorical features, such as weekday (Sunday through Saturday), year, month, day (of the month), hour, minute, or second. These derived features provide structured information that can be effectively used in modeling.

Another approach is to calculate time spans. For example, if we want to analyze whether the length of employment affects annual performance evaluation scores, we can compute the number of days or months between an employee’s start date and today’s date. Similarly, if we are predicting a customer’s product rating on Amazon, we might calculate the number of hours between the order date and the received date.

The function below demonstrates how to parse date values:

      def parse_date(df, features=[], days_since_today=False, drop_date=True, messages=True):
        import pandas as pd
        from datetime import datetime as dt
        
        for feat in features:
          if feat in df.columns:
            df[feat] = pd.to_datetime(df[feat])
            df[f'{feat}_year'] = df[feat].dt.year
            df[f'{feat}_month'] = df[feat].dt.month
            df[f'{feat}_day'] = df[feat].dt.day
            df[f'{feat}_weekday'] = df[feat].dt.day_name()
          
            if days_since_today:
              df[f'{feat}_days_until_today'] = (dt.today() - df[feat]).dt.days
            if drop_date:
              df.drop(columns=[feat], inplace=True)
          else:
            if messages:
              print(f'{feat} does not exist in the DataFrame provided. No work performed.')
        
        return df
      

This function takes a DataFrame and a list of column names containing date values. It iterates through the provided columns, first checking if they exist in the DataFrame to prevent runtime errors. If a column is found, it is converted to a Pandas datetime format using pd.to_datetime(). The function then extracts key components such as year, month, day, and weekday, storing them in new columns. If the days_since_today parameter is enabled, an additional column calculates the number of days between today's date and the given date. The function also includes an option to suppress messages for non-existent columns.

Next, let's test this function using the Airbnb dataset. We'll include one valid column name (last_review) and one that does not exist (doesnt exist) to check the function's error handling.

      df_airbnb = parse_date(df_airbnb, days_since_today=True, features=['last_review', 'doesnt exist'])
      df_airbnb.head()
      
      # Output:
      # See the output in your own code
      

Since the modified dataset contains multiple new columns, the output is not displayed here. However, in your own results, scroll to the far right of the DataFrame to locate the newly added columns: last_review_year, last_review_month, last_review_day, last_review_weekday, and last_review_days_until_today.