CS 236: Project Description
The goal of this project is to understand and implement basic
top-k algorithms available in the literature.
A Top-k query returns the k objects with the highest aggregated scoring
value. The aggregate score is calculated using a function f using certain
attributes of the object. An example of top-k query is: given a relational table
Restaurants with the following attributes: price, quality and
location, return the k restaurants having the highest score for
f(t)=sum(-price(t), quality(t), -distance(t, my_hotel)) (i.e. the cheapest,
with the best quality, and closest to my hotel).
What you have to implement in this project
In this project, you are required to implement BOTH top-k algorithms:
- Fagin's Algorithm (FA): this algorithm performs sorted access
in parallel to all the lists. It stops when all $k$ objects appear in all
the lists;
- No Random Access (NRA): this algorithm performs
sorted access in parallel at all the lists. It uses lower and upper bounds
to early stop the algorithm. (In the slides, this algorithm is also called
Restricting Random Access.)
In order to compare the performance of each algorithm, you have to
implement a data structure suitable to do the access pattern defined by the
tested algorithm. The data structure can be implemented in any programming
language of your choice. In particular, if you want to implement using
C++, you should take advantage of the STL library.
For both datasets, use the following function to rank the objects (attrN is N each attribute of
t):
- f1(t) = sum( t.attr1, t.attr2, t.attr7, t.attr8, t.attr9)
You can rank the objects in the dataset using the ranking
score provided by the function above.
Data set description
The two datasets available for this project are available at datasets.tar.gz. The first dataset contains
20 attributes (in the range of 0 to 1), while the second one has 9
attributes (in the range of -1 to 1). The first attribute in both
of the datasets is the object's identification.
What you should submit
You must write a report with your results, findings, errors/problems/bugs,
as well as a detail description of your source code. In this document you have
to show some statistics of your implementation for some queries. You must report,
for each ranking function, the total number of data items accessed per list (i.e.
how many items are accessed in each list) and the average running time for each
execution of the ranking algorithms for 10 queries, starting
with k=5 until k=50, with increments of 5 (i.e., k=5, 10, 15, ..., 50). For the
running time, you have to take the average of at least 3 runs.
You also have to report the answer (only the object's id and the final
score) for each query of your experiment (you only have to report the result
for each query regardless of the algorithm, since both algorithms return the
same result).
Along with your report (in PDF format), you also have submit your source code along
with a README file that explains how to reproduce your experiments.
The project is to be done on your own. Any copying/sharing source code is prohibited.
Deadline for the project
The deadline for this project is Sunday, December, 11 11:59pm.
Please submit a tarball named "username1_username2" (username=your CS email account) to
with the subject "cs236: final project".
References:
Reporting question or problems to
last day modified: 11/21/2011 3:03pm