Develop a better understanding of the
HBAT dataset more specifically to explore the characteristics of its customers and the
relationship between their perception of HBAT and their actions towards HBAT.
Make sure to address all the following questions.
- [2.5 points]. Run a thorough univariate and bivariate graphical and statistical
 examination of your data. Do you notice any irregularities in your data? How does your
 data look like? Normal, skewed?
 Tips. Make sure to number and label each table and graph. (e.g., Table 1. Summary
 statistics for HBA missing data set).
 Provide a title and a detailed interpretation of each chart.
 Remember the rule of thumb: for any table and/or graph, you can have at most 3
 genuine findings. If you have more, you have probably made them up (����).
- [2.0 points]. Missing values analysis. Do you have any missing values in your data? If
 yes, determine the extent of missing values per variable and case. Are there any
 variables/cases that you need to delete? Use ≥ 30% of missing values as the threshold
 for deletion.
 a. After deleting variables and/or cases with ≥ 30% missing data, construct a
 summary statistic for your data. Do you still observe any missing values? If yes,
 decide on how to impute these missing values. Limit your imputation
 technique to mean or median substitution. Justify your choice.
- [2.0 points]. Detection and treatment of outliers.
 Math and Statistics for Analytics Nov 2020
 3
 Are there any univariate outliers in your dataset? Use both the Tukey´s fences and z
 score approach (with z threshold set at 2.5 since you have a small sample size) to
 identify them. Do you notice any discrepancy between the two methods? Explain.
 How many values were detected as outliers? Will you keep these outliers or delete
 them? Justify your decision. Discuss impact of your decision on remaining data
 analysis.
- [2.5 points]. After treating missing values and outliers, construct a summary statistic of
 your data. Compare and contrast your results with question 1. Develop two
 hypothetical questions that you can answer using graphical and/or empirical
 techniques. Provide correct answers.
- [1.0 point]. In the case that you did not treat missing values and/or outliers, what
 would be the impact on subsequent data analysis.
 Dataset: HBAT Industries Dataset – HBAT missing. (You can access the dataset in Additional
 Documentation / Assignment 1)
 Context: HBAT is a manufacturer of paper products. Hypothetical dataset based on surveys of
 HBAT customers completed on a secure web site managed by an established marketing
 research company.
 Sample size: 70 observations on 14 separate variables based on a market segmentation study
 of HBAT customers: newsprints industry and the magazine industry.
 Categories of data:
 • Numerical variables: V1 to V9.
 • Categorical variable: V10 to V14.
 Additional information related to the variables is available in the excel file (HBAT missing) in
 the Metadata spreadsheet.
 Pre-requisite: before working on this assignment, you need to watch the series of videos: Data
 examination – Excel (Campus Online / Additional Documentation / Module 2 / Recorded
 Videos)
 For detecting outliers, you are already familiar with the boxplot method as well as the Tukey´s
 fences (video), you can also use the z score approach, to calculate the z value for each
 observation, you can use Excel built-in function (STANDARDIZE).
 For missing value analysis, you can use the COUNT function to count the numbers of cells
 containing data in a range that contains numbers and use it to determine the extent of missing
 values per case and per variable.
 Additional resources:
 Introduction to data analysis in Excel: https://www.youtube.com/watch?v=Rs4082ewxgA
 Introduction to Pivot tables: https://www.youtube.com/watch?v=9NUjHBNWe9M
Sample Solution