Reducing the number of records

If your goal is to reduce your dataset in size, you can do this by taking a simple random sample of your data. This approach means each record has an equal likelihood of being selected.

This jupyter notebook shows how to sample a data set in Python.

The following video shows how to do this in JMP.

The following video shows how to do this in R.

You can also do this with a simple R script. An example is shown in the image below. In this example, R is used to read in a large data file ("brooklyn_homes.csv") that contains 24,209 records. Then it creates a sample of 10,000 records and write the sample out to a new file ("Brooklyn10k.csv").

Figure 23.1: R script for sampling data