Cleaning and Profiling the tweets by removing hashtags, emoticons, or any redundant data which is not useful for analysis. Organize the user_location column in a common standard format. Dataset has been attached. Or you can get it from the link below:
https://www.kaggle.com/datasets/taniaj/australian-election-2019-tweets
Tasks:
Data profiling: Write MapReduce java code to characterize (profile) the data in each column.
Data cleaning: Cleaning and Profiling the tweets by removing hashtags, emoticons, or any redundant data which is not useful for analysis. Write MapReduce java code to ETL (extract, transform, load) data source. Drop some unimportant columns, Normalize data in a column, and Detect badly formatted rows.
Sample Solution