General information:
General rule : There may be some underspecified parts in the project description. This is on
purpose! In those cases, make your own design choices and document them!
You can choose from the following project types:
Software : Pick a dataset (see Resources below) and define a problem you want to solve.
Select as many data mining techniques that you would like to use in order to solve the problem, implement them from scratch , clean and analyze the data, compare the results from the different techniques, and present the findings. Alternatively, you may define a problem associated with data that can be obtained from a some website or web platform. Write a crawler that scrapes data from that platform, making sure that you respect all the crawling policies that the website has in place (usually looking at the robots.txt file of the website, or looking at their data policy). Choose the same number of data mining techniques that you want to use; If you have to do non-trivial implementation work for the crawling and the data preparation, it is OK if you implement one less technique.
Research : In this project type you can propose your own idea along the lines of your own
research or along the lines of improving the state of the art in solving an existing data mining problem. After you propose the idea, if there are any well-known techniques for that problem, you should use them as baselines, and you should propose a novel solution to that problem. You
must implement at least one method, as in the Software option. This project type can earn extra
credit.
Project Deliverables:
Project Proposal
Description :
In the proposal you must briefly but concisely introduce your project. In particular, you have to clearly define the problem your project proposes to solve. You should be able to distill the
essence of your proposal to a statement like:
Given Use To
For example:
Given Netflix data Use Collaborative Filtering algorithms To recommend new movies to users
or
Given Twitter data Use Matrix Factorization To detect fake followers
In special cases, you may be able to relax the above format for the problem statement, but it is
fairly generic and applies to a wide variety of problem statements. In any case, make sure you
define what problem you are going to solve, and very importantly, describe how you are
planning to evaluate your approach.
In addition to the above, make sure you include:
- The type of the project.
- Evaluation plan
Depending on the project type you chose, you need to clearly describe your plan on obtaining
the data that you will use.
– Here is how I will find labeled data
– Given labeled data, here’s what I’ll do
– Without labeled data, here’s what I’ll do
The page limit for the proposal is 2 pages, single column.
Final Project Deliverable
Description :
The final project deliverable should include:
- The project report in .pdf format.
- The code for your implementation.
- If you collected any dataset(s) for your project, include it/them in your deliverable, if that
is possible. If the dataset comes with restrictions, there is no need to include it.
Details for the report:
Your final report should resemble a KDD paper (download the ACM “tight” format here
http://www.acm.org/publications/proceedings-template) and the page limit is 10 pages in double column format including the references.
For all project types you have to include 1) an Introduction where you describe and motivate
the problem, give an outline of your contributions and motivate your approach; if you have
Research you also have to argue that your proposed approach is sufficiently novel with respect
to the state-of-the-art, by providing statements on how existing methods do not adequately
address the problem you are solving., 2) a Related Work section where you outline relevant
papers that work on the same problem, a 3) Proposed Method section where you describe the
method(s) you used to solve the problem, 4) an Experimental Evaluation section where you
compare the methods used; if you have Research you have to further demonstrate that the
proposed approach outperforms the baselines (at least in some cases); this can earn extra
credit, and 5) a Discussion & Conclusions section where you draw the conclusions of your
paper and outline potential future research directions.
For the code , make sure you include:
- All source files you wrote with comments that explain your implementation.
- A README file that describes what each file does.
Page limit: 5 pages + 1 for references (KDD-style double column format, ACM “tight” style)
Project Implementation
You need to implement one method. “Implementation” means writing the code for the method from scratch. For those implementations, you may use packages like Pandas, NumPy etc., but only for their basic functionality. You may not use an existing library implementation for your implementation.
If you find a website/tutorial/blog that outlines the implementation, you may use it as inspiration/guide but anything you submit must be your own implementation. Verbatim (or nearly verbatim) copies will not be allowed or tolerated (see the academic integrity section below).
There are some techniques for which, by exception, you may use existing implementations in
packages :
• Neural Networks: You may use packages like TensorFlow, PyTorch etc., and as part of your implementation you should do a thorough experimentation of the different architectures
• You may use an existing implementation of the Singular Value Decomposition
Academic Integrity
This is EXTREMELY IMPORTANT, please read carefully:
Project submission: You must only submit work that is yours. If you receive help by any
external sources, you must properly credit those sources and describe the exact amount of overlap, and if the help is significant, the appropriate grade reduction will be applied. For example, if your entire code is taken from an online source, which you properly credit, you will not receive any credit, but this will not be regarded as plagiarism, since you properly credited the source.
Regarding Plagiarism : Always cite your sources! Never (ever!) copy any text, figure, method,
or any part of a paper/book/source code/any intellectual work verbatim from that source. If you have to use a short piece of original text verbatim, always put it in quotes and cite the original source right next to the quote. For code, you must specify exactly where a snippet of code came from. Plagiarism is a very serious offence that can get a researcher banned from publishing for a number of years and is taken very seriously.
Resources
COVID-19 Related Projects
Given the current situation of the COVID-19 pandemic, there is a lot of interest in using data
science to help scientists with tackling this global problem. As a result, there is already a very
rich collection of datasets and problems that you are encouraged to consider for your class
project. Below are the resources at the time of writing of this document:
- COVID-19 Open Research Dataset (CORD-19) : This dataset contains research papers
that talk about different strains of the Coronavirus and the goal here is to develop
techniques that can help experts sift through the literature more efficiently.
https://www.semanticscholar.org/cord19 - COVID-19 Global Forecasting : This challenge aims at predicting the spread of the virus.
https://www.kaggle.com/c/covid19-global-forecasting-week-1 - COVID-19 Twitter Dataset from USC : Analyze what people on Twitter talk about as it
relates to the pandemic and the virus.
https://github.com/echen102/COVID-19-TweetIDs
Problem ideas
You can find ideas for problems in the following links:
- KDD Cup Archives https://www.kdd.org/kdd-cup
- WSDM 2019 Cup http://www.wsdm-conference.org/2019/wsdm-cup-2019.php
- Yelp dataset challenge https://www.yelp.com/dataset/challenge
- Kaggle https://www.kaggle.com/
Datasets
You can find data for your project in the following links:
Stanford SNAP Datasets http://snap.stanford.edu/data/index.html
Aminer Network Data https://www.aminer.org/data
Koblenz Network Data http://konect.uni-koblenz.de/
Microsoft Research Asia T-Drive Data https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/
Sample Solution