You are currently working for a marketing company that has a demographic dataset pertaining to a set of
audience analytics. However, with this dataset, you receive a document containing information about all of the
cities that this marketing company operates within. Therefore before you can begin your analysis, you first need to
collect and process your data in order to prepare your dataset to be joined with demographic data and provide a
proper analysis for your manager. Additionally, you want to follow best practices so that your colleagues can reuse
your city dataset and possibly contribute in the future.
Data
City: Boston
State: MA
Latitude: 42.3188
Longitude: -71.0846
Population: 4,637,537
Input Date: 2020-02-01
City: Houston
State: TX
Latitude: 29.7869
Longitude: -95.3905
Population: 5446468
Input Date: 2020/02/01
City: Dallas
State: T.X.
Latitude: 32.79
Longitude: -96.7662
Population:
Input Date: 02/01/2020
City: San Francisco
State: CA
Latitude: 37.75
Longitude: -122.443
Population: 3603761
Input Date: 2020-02-01
City: Los Angeles
State: california
Latitude: 34.1139
Longitude: -118.4068
Population: -
Input Date: 2020-02-01
City: Miami
State: FL
Longitude: -80.2102
Population: 6381966
Input Date: 2020-02-01
City: Manhattan
State: ny
Latitude: 40.7834
Longitude: -73.96
Population: 1643734.00
Input Date: 2020-02-01
Assignment
Perform the following steps using google sheets and provide a write up within a google doc.
- Create a raw data set from the above data.
- Create a tall and wide dataset from your raw data.
a. Explain the benefits and tradeoffs as it pertains to your data.
b. Move forward with your tall dataset - Define your values, variables and observations and provide reasoning as to why you have made these
decisions. - Perform the following data cleaning analysis on your data set. Provide your findings (there may be no
findings), and what you will do to correct it.
a. Validity checks
i. Data Types
ii. Ranges
iii. Missing
iv. Unique
v. Membership
vi. Regex
b. Completeness
c. Uniformity - Create a data dictionary.
- Correct your curated data set to ensure a valid set of data.
- Write a README.
Be sure to follow the best practices below (as outlined in lecture) when developing your solutions: - Consistency
- Named ranges
- Organization
- Naming conventions (files and variables)
- Dates
- Missing data
- Formatting
- Index column
Sample Solution