Final Project

Introduction

Project Purpose. The purpose of this project is to allow you to deepen your skills and demonstrate each of the techniques you've learned in this course in a relevant context.

Group Size. Groups should be of 2 to 3 people. Group sizes or less or more than 2 or 3 poeple should be approved by the instructors. I sometimes allow an individual to do a project because the person is working on a proprietary dataset or the individual has a very specific interest. I sometimes allow four people to work as a group if the project is more complex or requires more effort for data collection or processing.

Assume that you are data mining analysts turning your report in to your supervisor (me), who is the chief data scientist at the organization that we work for.

Project Selection Criteria

Your project should discover and articulate additional knowledge about the chosen problem domain. You should begin by identifying a predictive modeling problem you'd like to solve and select a dataset that you can analyze to provide decision support for that problem.

Consider the following criteria to help you select an appropriate project.

  • Choose an interesting topic. Part of the grade of your report and presentation will be based on whether it is interesting, so choose a topic that you find interesting that will also be interesting to the class and professor.

  • Project Scope. Your project should be of sufficient size as to be non-trivial but not so big as to be overwhelming. Its scope should be commensurate with the number of group members to allow all members of your group to make a meaningful contribution.

  • It must be a predictive problem. Since the project is intended to enable you to demonstrate and extend your knowledge of predictive data mining principles and tools, your outcome variable(s) must be a categorical prediction, a continuous numeric prediction, or both. You must therefore have training data that includes one or more outcome variables. In addition, it must have adequate training data so that the machine learning algorithms can learn. For example, if you are interested in predicting which drivers will and will not have automobile accidents, your dataset must include records of people who did and did not have accidents. If your data included only records on drivers who had accidents, you can only do descriptive statistics. Thus, it would not meet the requirements for the research project.

  • You should have some good predictive information in multiple input variables. It is impossible to do a good predictive project if you do not have input variables that contain good predictive information. For categorical prediction, this means that your prediction error percent should be substantially lower than the base rate and that AUC should be significantly higher than random (0.5). For numeric prediction, this means that on the validation or test partitions, your predictive model should produce reasonable MAE, RMSE, and R2 values. Moreover, your project should include more than one or two input variables because projects with only one or two input variables are trivial.

Do not waste a lot of time on a project that is not a good prediction problem. It is impossible to make a project interesting if there are no meaningful findings. The best way to avoid this is to do early testing to see if you are getting at least reasonably good prediction. If not, you can redirect your time to more appropriate projects before you invest a lot of time in a dubious project.

Sources of Data

You are encouraged to identify your own dataset. You may not use datasets that are used as examples in this class. There are a variety of possible sources of data.

Your current or past job or internship can be a source of data. You can also do projects with organizations for which you have done volunteer work. Organizations may have questions they would like answered and datasets upon which you can conduct your analysis. I will keep such data confidential and not share it with anyone unless you explicitly state that it is sharable.

It can also be a dataset you find online or data that your harvest from a website or an online database. For example, in the past, some students have done website screen scraping to get information from online web resources. Interesting projects come from interesting problems and data, so find a project you like.

Existing data mining datasets are available from Kaggle, the UC Ivine repository and others. Below are some data sources. You may be able to find other sources online.

High-Level Requirements

  1. Your project documentation should include

    • Project Report. This will be in the form of an electronic PDF file; no need to print anything. The format is "as-short-as-you-can-make-it-while-meeting-all-requirments." There is no specific font, margin, or other style guidelines. Just make it as professional looking and readable as possible.

    • Your slide deck that you use to present your project.

  2. Your analysis should be a 1) continuous numeric prediction, 2) classification prediction, or both. Consult with your instructor if you have questions about whether your project meets this criterion.

  3. Evidence of sufficient model comparison and evaluation should be documented. This includes both 1) variable selection, and 2) algorithm selection.

  4. Your final, cleaned data file(s) and data description.

  5. Create an Azure ML Web Service to allow easy predictions based on input variables and an algorithm you developed. Upload the Excel file that contains the call to the thatt web service to make predictions.

Detailed Requirements

  1. Data Collection and Preparation

    • Collect additional related data. Sometimes the original dataset lacks useful predictors and could be supplemented by collection of additional data. For example, assume you had crime rate data at locations. You may want to collect additional information about each location that would likely be related to the crime rate. For example, is it in a business or residential area? Is it in a rich or poor part of town? By adding such additional data, it may significantly increase your ability to predict the crime rate.

    • Convert data to a usable form. Sometimes your data comes in formats that are unusable and needs to be cleaned or put into an analyzable form before it can be used. For example, if you scraped the data from a website, remove the extraneous tags.

    • Resolve missing data. Generally speaking, columns and rows with more than 50% of the data missing will likely need to be eliminated altogether. For those with less than 50% missing, decide whether to inpute a value from other variables or use an average or some other value. In some cases you may use "blank" which will allow that record to be analyzed.

    • Evaluate potential outliers. Determine if the outliers represent errors or valid values and whether they should be included or excluded in your analysis.

  2. Data Understanding (completed recursively with the Data Preparation stage)

    You can create data visualization in in JMP, Python, Azure ML Studio, Excel or some combination of these tools.

    • Data Visualizations. Create and document a series of visualizations that describe what insights you can find from descriptive data analyses. Use a combination of histograms, Boxplots, scatterplots, etc.,There is no minimum number of visualizations required. However, your story--and your documentation--should briefly but adequately indicate why you created each visualization and what you learned from it.

    • Correlation Matrix. Include a correlation matrix for all numeric variables that identifies which input variables are most correlated with the dependent variable and which independent variables are possibly collinear or multicolinear. The correlation matrix should have the following characteristics. Note: In this class we did one like this in Python and on in MS Excel, so you could use either one of these.

      • Color-saturated to easily visually identify collinear variables and separate negative from positive correlations.

      • The outcome variable should be the first column in the correlation matrix. This is most easily done by changing the order of the input data so the outcome variable is in the first column.

      • The correlation matrix should only show each correlation once. It should not be the one produced in JMP that is mirrored so that each correlation is shown twice above and below the diagonal axis.

    • Check for linear and non-linear relationships.Use scatterplots and calculate R2 when appropriate to determine linear and curvilinear relationships.

    • Explain category variable choices and manipulations. How did you determine which categorical variables to include and exclude in the model? Did you create meaningful categories from not categorical data? Did you consolidate some categories into fewer categories?

    • Feature Engineering. If necessary create useful features from the existing data. For example, you may need to do ratios of existing variables such as cost/squarefoot. Or you may need to extract categories like Mr., Mrs., Miss, Master, like we did with the Titanic Survivors data exercise. Or you may need to create other variables from dates such as dummy variables and time indexes.

  3. Input Variable Evaluation

    • Determine which input variables are predictive. Demonstrate a comprehensive set of testing to determine which input variables you included and excluded and why.

    • Determine which input variables are invalid. Just because an input variable is predictive does not mean it is appropriate as a predictor. The following types of variables are not valid predictors.

      • Diagnostic tests of the outcome variable. It is inappropriate to use the result a diagnostic test of an outcome to predict the value of the outcome variable. For example, if a doctor uses a specific diagnostic test to determine whether or not a person has a disease, the result of the diagnostic test determined whether the person has the disease. So to know the result of the test is to know the value of the outcome variable.

      • Y predicting Y. One or more measures of a construct should not be used to predict another measure of the construct. For example, net income, earnings before income taxes (EBIT), and earnings per share (EPS) are all measure of income. So it is inappropriate to predict one of these measures from another of the measures. For example, you should not predict net income from EBIT or redict EPS with net income.

      • Values not known at initiation. The value of some variables are not known until the outcome variable is known. Assume one is trying to predict the number of Youtube likes a video will receive at the time the video is posted. There are some attributes that would be known at the time a video is posted such as who the author of the video is, the number of subscribers an author has, and the theme of the video. These would be considered valid candidate predictors for the stated purpose because they are known before the video is posted. But the number of comments posted in response to a video would only be known later, so it would be an inappropriate predictor of the number of likes.

  4. Evaluate algorithms. Which algorithm produced the best results? Show comparative model quality evaluation indicators for numeric prediction (R2, MAE, RMSE, MAPE) or categorical prediction ( misclassification rate, recall, precision, F-Measure), to demonstrate why you chose a particular algorithm over other algorithms.

  5. Azure Web service-based estimator

    • Create a web service on Azure ML studio. Download the Excel file that calls the web service to create estimates in the spreadsheet. Include the data column names and sample input data in the Excel file. Most of the time, Azure algorithms give somewhat comparable results to JMP and Python implementations, but there are some exceptions where an Azure implementation of an algorithm just does not perform as well as another tool. Not to worry. The requirement for you to do a Azure web service predictor is not to make sure it is as good as can be achieved. It is so you can practice making a predictor. So you can do another algorithm that gives "okay" results. Then, just note in your report that you could not get Azure to perform well on the same algorithm, so you used another.

Report Outline

This is a general outline of what to include in your report.

Document every step outlined above in the requirements description. However, do not write more than is necessary. Bullet points are acceptable where appropriate. Only include what is necessary; but do not leave important content out shows your work.

There is no specific styling format requirement. Just make it a professional, clear business document with a consistent style.

  1. Title Page

    • Descriptive project name
    • Submission date

    • Your group number (e.g., S2G12) that was assigned by the professor

    • First and last name of each member of your group

  2. Project Abtract (about one page long). Summarizes the main research question(s), gives succinct information on the purpose, methods, results and conclusions reported.

    • Problem Overview: Define and explain your research question(s), what is the outcome variable and whether it is a continuous numeric value or a categorical value, the population you are studying, and why your research questions are interesting and worth answering.
    • Summarize your research methods. Briefly summarize the steps you took to solve the problem.

    • Summarize your findings: What did you find?

      • Briefly state the model quality indicators for your ultimate best predictive model(s).

        • For numeric outcome variable: R2, RMSE/RASE; MAE, MAPE

        • For categorical outcome variable: Recall, Precision, F-score, or Accuracy/error)

      • Which predictive algorithm(s) were best?

      • What did you provide for a explanatory model (e.g., CART, MLR, other)?

  3. Data Description and Preparation

    • Describe the source of the data, including the URL, if it was obtained online. If you scraped data, where did you scrape it from and what tool did you use to scrape the data? If you collected data from multiple sources describe what data was collected from the multiple sources and how you combined the data. State when the data was gathered.

    • State the number of variables and number of records.

    • Variables Table. Provide a brief but clear definition of the variables in a table that includes the following columns: attribute name, data type, and description. For categorical variables, if there is just a few categories, list them. If there are many categories, include a few example categories. Place your outcome variable(s) at the top of the table. If you start with around 12 input variables or fewer, include all of them. If you started with many input variables, state how many you started with. Then provide details in the table of just the top 8 to 12 most important variables.

    • Explain if you excluded some of the data to make the problem more manageable.

    • Was there missing data? If so, explain how you resolved the missing data.

    • Did outliers exist? What did you do to deal with potential outliers? Did you delete the records or use them. Why?

    • Explain any feature engineering or recategorization of data that you did, if any.

  4. Data Understanding. The requirements for this phase are described in detail under detailed requirements above. Include the color coded correlation table and the visualizations along with brief description of what they portray.

  5. Input Variable Evaluation and Model Testing.

    • Describe what model quality indicator(s) matters the most to your problem and why.

      • For numeric prediction: R2, RMSE, MAE, MAPE?

      • For categorical prediction: recall, precision, F-measure, accuracy or error? Also, for a categorical prediction problem, state whether TP or TN is most important. Is FP or FN the worst outcome?

    • For your predictive model(s), include tables that show which algorithms you tried with what input variables. Include model quality indicators for the various tests. Highlight the best model(s)that you chose. Use model simplicity as the tie breaker when model performance indicators are very close. Summarize in a paragraph what the best model is and why you chose it.

    • If your most predictive model is a directly interpretable algorithm (MLR, CART, LogReg), interpret it directly. Conversely, sometimes your best predictive model is not directly interpretable (ANN, KNN, Boosted Trees, Random Forests, SVM). This means that the relationship between inputs and the outputs are difficult or impossible to explain. For example, it is possible to interpret the results of an ANN model (confusion matrix, RMSE, etc.) but it is difficult or impossible to explain how specific inputs are related to the outcome because the influence of each input variable is transferred through multiple links, neurons, and transfer functions in the network. When this is the case, also include a descriptive model that is interpretable. How? Describe the relative contributions of the input variables.

    • Explain what predictors are the most important. Show the results of the evaluation that determines which input variables matter and how much better the algorithm predicts when important predictors are added.

    • Describe whether you looked for subpopulations in the data and whether you tested to see if clusters improved the results.

    • Economic Analysis: If it is possible to associate net profit for TPs and treatment costs to predicted positives (TPs and FPs), perform an economic analysis for the problem. The net profits and treatment costs do not have to be exact. If you are building a pricing model (e.g., cars, real estate), identify a few of the most underpriced items that you think would be worth investigating.

  6. Web-service-based Excel estimatator. Briefly describe your web service-based estimator. Is it based on the same algorithm that you found to be best for the overall project? Or, did you use another Azure algorithm because the Azure implementation of the model that did the best overall for your project did not perform well in Azure?

  7. Future Research. What could be done in the future to improve the model? Is there data that could be collected in the future that might improve the results?

Presentation Outline

  1. Title Page. Project name, section and group number, presentation date, and the names of all group members.

  2. Introduction. Introduce the purpose and goals of the project. Summarize background material necessary to understand the presentation.

  3. Source of the data. Describe where and how you obtained the data.

  4. Describe the outcome variable and the most important input variables

  5. Results.

    1. Briefly summarize what you did to conduct the analysis.

    2. Show a summary of the agorithms attempted and the one that you selected as best.

    3. Explain the largest contributing variables to the outcome variable.

  6. Conclusion

  7. Questions