Datasets to Explore (optional)

There are so many interesting datasets that can be explored and visualized. You may want to practice with some of these below. You will notice that some of them are fun and entertaining topics. However, others are quite serious like the mass shootings and suicide datasets. I debated on whether to include these topics and ultimately decided that one appropriate way to address such grave events is with objective data analysis so that we can understand these issues without the biases of political or cultural beliefs. As you analyze datasets like those, remember: 1) these datasets may not be perfect or 100 percent complete, and 2) if you share your results, please be sensitive to those who have experience with these topics.

Also, keep in mind that most of the datasets are regularly updated and the sources may not exactly match the linked data files below.

  • AirBnB listings: 20,025 rental listings (source: Insideairbnb.com)

    Air BnB Listings
  • Airline customer satisfaction: 129,880 ratings (source: Kaggle.com)

  • Auto accidents: 252,500 accidents (source: Utah.gov)

  • Gas mileage: 392 automobiles (source: Kaggle.com)

  • Bike buyers: 1,000 random customers (source: Kaggle.com, this is a snapshot of the classic Microsoft AdventureWorks database that came with the original ML add-in for Excel; it is no longer published by Microsoft but can be found on Kaggle)

    Bike Buyers (with numeric versions included) CSV
  • Credit card fraud: 100,000 transactions (source: Kaggle.com)

  • Crime statistics: 2,688 state annual reports (source: Kaggle.com)

  • Disney movie revenue: 513 movies (source: Kaggle.com)

  • Fake news: 12791 claims evaluated for truthfulness (source: Paperswithcode.com)

  • Health app reviews: 274 reviews across 137 apps (source: University of Michigan)

  • Heart attacks: 8763 patients (source: Kaggle.com)

  • Home sales: 1,460 homes sold (source: Kaggle.com)

  • IMDB movie ratings: 1,000 movies (source: Kaggle.com)

  • Lending Club loans: 10,476 loans (source: Kaggle.com)

    LendingClub (small)
  • Mass shootings: 128 cases (source: Kaggle.com)

  • Mental health of tech workers: 1,259 surveys (source: Kaggle.com)

  • NBA games: 121,107 unique players playing in unique games from 2020-2022 (source: Sportsdata.io)

  • NFL Plays: 35267 plays from the 2022 season including the number of yards gained

  • Netflix Titles: 8801 movie and series titles available in the platform (source https://www.kaggle.com/datasets/shailajakodag1/netflix-titlescsv. Found elsewhere in the book: 19.2

  • Network traffic (hacking attempts): 125,973 packets of network traffic labeled as "normal" or by the name of a know attack type. Great for multiclass classification models

  • Personality and culture: 432 students (generated by the author)

    available in 10.4
  • Pokemon: 781 characters (source: Kaggle.com)

  • Spotify music: 232,726 songs (source: Spotify.com)

  • Student mental health: 101 students (source: Kaggle.com)

  • Suicide statistics (1950-2021): 9504 (source: Kaggle.com)

  • Tweets about Covid and autism: 24,037 tweets (generated by the author)

    Twitter "AWS" Tweets
  • USA school shootings 1840 to 2023 (partial): 1818 cases (source: Kaggle.com)

  • Video game sales: 16,599 games (source: Kaggle.com)