In the Chapter 4 exercise, you used a data set that contained information about cars. For this exercise, we will enhance this data set and then build a linear regression model on it. Complete the following steps:

  1. Open the data set for the Chapter 4 exercise (car data) in your spreadsheet program.

  2. Split your data set's observations in two: a training portion and a scoring portion. Give the training portion 261 observations and put the other 135 observations in the scoring portion. Save these as two separate CSV files with descriptive names.

  3. Import both of your data sets into your RapidMiner repository or connect to them via Read CSV. Add them to a new process and rename them "Training" and "Scoring" so you can tell them apart.

  4. Use a Set Role operator to designate the MPG attribute as the label for the training data, if you did not designate MPG as the label during data import. Do not designate MPG as the label in the scoring data.

  5. Add a linear regression operator and apply your model to your scoring data set.

  6. Run your model. In Results perspective, examine your attribute coefficients and the predictions for the cars in your scoring data set.

  7. Report your results:

    1. Which attributes have the greatest predictive power?

    2. Were any attributes dropped from the data set as non-predictors? If so, which ones and why do you think they weren't effective predictors?

    3. Compare the predicted MPG values to the actual MPG values in the scoring data set. How close are the predictions? On average, how far off are your model's predictions?

    4. What other attributes do you think would help your model better predict fuel efficiency?

  8. Open the R Console and construct the same linear regression model in R. Examine the model to determine if any predictor attributes should be dropped. Determine the statistical significance and predictive ability of your model. Apply your model to your cars scoring data. Repeat step 7 above using your results in R.