21.1 Datasets to Explore (optional)
There are so many interesting datasets that can be explored and visualized. You may want to practice with some of these below. You will notice that some of them are fun and entertaining topics. However, others are quite serious like the mass shootings and suicide datasets. I debated on whether to include these topics and ultimately decided that one appropriate way to address such grave events is with objective data analysis so that we can understand these issues without the biases of political or cultural beliefs. As you analyze datasets like those, remember: 1) these datasets may not be perfect or 100 percent complete, and 2) if you share your results, please be sensitive to those who have experience with these topics.
Also, keep in mind that most of the datasets are regularly updated and the sources may not exactly match the linked data files below.
AirBnB listings: 20,025 rental listings (source: Insideairbnb.com)
Air BnB ListingsAirline customer satisfaction: 129,880 ratings (source: Kaggle.com)
Auto accidents: 252,500 accidents (source: Utah.gov)
Gas mileage: 392 automobiles (source: Kaggle.com)
Bike buyers: 1,000 random customers (source: Kaggle.com, this is a snapshot of the classic Microsoft AdventureWorks database that came with the original ML add-in for Excel; it is no longer published by Microsoft but can be found on Kaggle)
Bike Buyers (with numeric versions included) CSVCredit card fraud: 100,000 transactions (source: Kaggle.com)
Crime statistics: 2,688 state annual reports (source: Kaggle.com)
Disney movie revenue: 513 movies (source: Kaggle.com)
Fake news: 12791 claims evaluated for truthfulness (source: Paperswithcode.com)
Health app reviews: 274 reviews across 137 apps (source: University of Michigan)
Heart attacks: 8763 patients (source: Kaggle.com)
Home sales: 1,460 homes sold (source: Kaggle.com)
IMDB movie ratings: 1,000 movies (source: Kaggle.com)
imdb.csvLending Club loans: 10,476 loans (source: Kaggle.com)
LendingClub (small)Mass shootings: 128 cases (source: Kaggle.com)
Mental health of tech workers: 1,259 surveys (source: Kaggle.com)
NBA games: 121,107 unique players playing in unique games from 2020-2022 (source: Sportsdata.io)
NFL Plays: 35267 plays from the 2022 season including the number of yards gained
Netflix Titles: 8801 movie and series titles available in the platform (source https://www.kaggle.com/datasets/shailajakodag1/netflix-titlescsv. Found elsewhere in the book: 19.2
Network traffic (hacking attempts): 125,973 packets of network traffic labeled as "normal" or by the name of a know attack type. Great for multiclass classification models
Personality and culture: 432 students (generated by the author)
available in 10.4Pokemon: 781 characters (source: Kaggle.com)
pokemon.csvSpotify music: 232,726 songs (source: Spotify.com)
Student mental health: 101 students (source: Kaggle.com)
Suicide statistics (1950-2021): 9504 (source: Kaggle.com)
Tweets about Covid and autism: 24,037 tweets (generated by the author)
Twitter "AWS" TweetsUSA school shootings 1840 to 2023 (partial): 1818 cases (source: Kaggle.com)
Video game sales: 16,599 games (source: Kaggle.com)