UCI Machine Learning

Part 2– Run an exercise on theimports-85dataset from imports-85.csv (noteagainthat we are NOT using the credit approvalnor the vertebral columndataset this week), completing this report and providing the commands, output screenshots, and discussion/interpretation as requested. Ensure that all commands are saved in this report AND in an R script.

For Reference: UCI Machine Learning Repository: Imports 85 a. Introduction:

i. Identify the dependent variable and independent variables in theimports-85 data set.

ii. Based on what you have learned this weekabout multiple linear regression, provide a one-paragraph masters-level response describing what you anticipate that thelmalgorithm will accomplish for theimports-85data? Be specific about the behavior and structure of multiple linear regression model.

2

b. Data Pre-Processing: Load theimports-85data into R Studio using the read.csv command (do not use File > Import Dataset > From CSV in the R Studio GUI as this uses read_csv() resulting in significant different variable types!!!). i. Run the commands to remove the following variables: engine_type, make, num_of_cylinders, fuel_system. Include the commands and output screenshot.

Command(s): >

Output:

ii. What additionaldata pre-processing (if any) does thelm()method require for theimports-85data? Include the commands you ran and the output screenshot. Command(s): >

Output:

3

c. Multiple Linear Regression –Running the Methodwith Training Data: i. Run ‘set.seed(12345)’ and then split the data into a training set consisting of 70% of the instances and a test set containing the remaining 30% of the instances. Includes the commands below.

Commands: >

ii. Runthelm() function to build themultiple linear regression modelstoring theresults in a variable called ‘mlr_model’. Include the command you ran and a brief discussion about the default input parameters used.

Command: >

Discussion:

iii. Run the command ‘summary(mlr_model)’. Include the output screenshot and answer the following questions:

Output:

4

How does the model represent the relationship between dependent and independent variables in the import-85 dataset?

How does the methodhandle categorical variables?

What does the residuals section of the output mean?

What are the coefficients and what do they mean?

What is an intercept and what does it mean?

What do the p-values tell about thesignificance of each variable?

What is the overall accuracy of the model?

5

d. Multiple Linear Regression –Evaluate the Modelwith Test Data: i. Run the command to evaluate the ‘mlr_model’ on the imports-85 test data Include the command below.

Command: >

ii. Run the command to build the predicted vs. actual (observed) value scatter plot. Add a diagonal line to this plot. Include the commands and the final plot with the diagonal line below. Commands: >

Output:

iii. What does the distance between points and the diagonal line tell us about the accuracy of the prediction?

6

e. Multiple Linear Regression – Residual Plots: i. Run the ‘plot(mlr_model)’ command to build the residuals plots. Interpret at least one of theplots. Include the command, the plot, andthe interpretation of that plot below.

Command: >

Output:

Interpretation:

f. Multiple Linear Regression –Minimum Adequate Model: i. What is the minimal adequate model? Why do we build it? Provide a oneparagraph, masters-level response.

7

ii. Run the command to build the minimum adequate model and store the model in a variable named ‘mlr_model_min’. Include the command and output screenshot.

Command: >

Output:

iii. Run the ‘summary(mlr_model_min)’ command. Include the command, output screenshot, and answers to the following questions: Command: >

Output:

Which variables were eliminated and which variables remain?

What are the coefficients and theintercept? What do the coefficient and intercept mean?

8

Compare the prediction accuracy of the minimum adequate model with the prediction accuracy of the original model. Provide a one-paragraph, masterslevel response.

g. New Instance: i. Suppose that we have a new car added to the imports-85 data set. We know the values of the independent variables. How would you use the model to predict the value of the dependent variable for the new car? (Hint: Use the lessons learned and hints from theprior week to complete this exercise). Include the command you would run below:

Command: >

h. Summary: i. Is the multiple linear regression method appropriate for predicting the values of dependent variables in the imports-85dataset? Explain why or why not. Provide a one-paragraph, masters-level response.

ii. (Not graded) Which part of this exercise did you find the most challenging and what steps did you take to resolve the challenge?

Sample Solution