CS 236

CS 236: Project Description

The goal of this project is to understand and implement basic top-k algorithms available in the literature.

A Top-k query returns the k objects with the highest aggregated scoring value. The aggregate score is calculated using a function f using certain attributes of the object. An example of top-k query is: given a relational table Restaurants with the following attributes: price, quality and location, return the k restaurants having the highest score for f(t)=sum(-price(t), quality(t), -distance(t, my_hotel)) (i.e. the cheapest, with the best quality, and closest to my hotel).

What you have to implement in this project

BOTH

Fagin's Algorithm (FA): this algorithm performs sorted access in parallel to all the lists. It stops when all $k$ objects appear in all the lists;
No Random Access (NRA): this algorithm performs sorted access in parallel at all the lists. It uses lower and upper bounds to early stop the algorithm. (In the slides, this algorithm is also called Restricting Random Access.)

In order to compare the performance of each algorithm, you have to implement a data structure suitable to do the access pattern defined by the tested algorithm. The data structure can be implemented in any programming language of your choice. In particular, if you want to implement using C++, you should take advantage of the STL library.

For both datasets, use the following function to rank the objects (attrN is N each attribute of t):

f1(t) = sum( t.attr1, t.attr2, t.attr7, t.attr8, t.attr9)

You can rank the objects in the dataset using the ranking score provided by the function above.

Data set description

The two datasets available for this project are available at datasets.tar.gz. The first dataset contains 20 attributes (in the range of 0 to 1), while the second one has 9 attributes (in the range of -1 to 1). The first attribute in both of the datasets is the object's identification.

What you should submit

You must write a report with your results, findings, errors/problems/bugs, as well as a detail description of your source code. In this document you have to show some statistics of your implementation for some queries. You must report, for each ranking function, the total number of data items accessed per list (i.e. how many items are accessed in each list) and the average running time for each execution of the ranking algorithms for 10 queries, starting with k=5 until k=50, with increments of 5 (i.e., k=5, 10, 15, ..., 50). For the running time, you have to take the average of at least 3 runs.

You also have to report the answer (only the object's id and the final score) for each query of your experiment (you only have to report the result for each query regardless of the algorithm, since both algorithms return the same result).

Along with your report (in PDF format), you also have submit your source code along with a README file that explains how to reproduce your experiments.

The project is to be done on your own. Any copying/sharing source code is prohibited.

Deadline for the project

The deadline for this project is Sunday, December, 11 11:59pm. Please submit a tarball named "username1_username2" (username=your CS email account) to with the subject "cs236: final project".

References:

R. Fagin. Combining fuzzy information: an overview. SIGMOD Record, 2002

Reporting question or problems to

CS 236: Project Description

What you have to implement in this project

Data set description

What you should submit

Deadline for the project

last day modified: 11/21/2011 3:03pm