#
# (C) Copyright University of California Riverside. 2001-2002.
#
# Peerware - P2P Simulation Infrastructure
#
# @version         : 1.0
# @document author : Demetris Zeinalipour (csyiazti@cs.ucr.edu)
# Project Supervision : Dimitris Gunopulos (dg@cs.ucr.edu)
# Computer Science Department , University of California, Riverside
#
# Document Description
# ********************
This document provides a brief overview of the peerware system.


A) Overview
=========================================================================
Peerware is a prototype Peer-to-Peer Information Retrieval system which
can be deployed on a Network of Workstations (e.g. Instructional Labs,
Clusters, etc). It consists of the following modules:
1) A P2P environment written in JAVA (Java Middleware)
2) The Lucene Information Retrieval API used by each peer to locally
   lookup queries
3) A set of bash shell scripts which can be used to deploy the Java
   Middleware.

Peerware is generally modular in a sense that you can replace various
to construct different realistic simulation environments. 
For example you could substitute the random graph generator by a powerlaw
graph generator and keep everything else unaffected. 
Although peerware has been designed and tested over a LAN (in 3 subnets), 
in practice it could also be deployed over a WAN (assuming that nodes
are not firewalled.


B) Installation
=========================================================================
* Prerequisites:
1) A network of linux boxes (or windows machines equipped with cygwin).
   Each box should have the sshd up and running.
   The peerware software has to be placed on a folder which is accessible
   by all machines on the same path (e.g. a folder on NFS - typical setting)
   

1) Install JAVA from http://www.java.com 
   The system has been tested with Sun's JAVA 1.4.2 although any other java
   should also work.
   
2) Set public/private keys: Consult docs/sshkeys.txt for more details
   The main reason peerware requires this setting is to allow the various 
   shell scripts to automatically connect to remote machines and to 
   perform various tasks.

* Install
1) Unzip: tar -zxvf peerware-1.0-rc1.tar.gz

2) cd peerware

3) make 
   Sets all appropriate parameters and compiles the sources

4) Edit LINUX_DIR and LINUX_TEMP_DIR in config.txt (set them to your preferred path)

5) Optional: Put the peerware folder into your PATH (e.g. in ~/.profile)


C) Execute  - Order of Commands
=========================================================================
1) cd peerware

2) Set Machines: vi net.txt.all
Set the available machines in net.txt.all using a text editor 

3) Probe Machines: ./createNetFile.sh
This creates net.txt that contains all active machines (some of the machines
in net.txt.all might not be available)

4) Create Graph: ./1-graphgen.sh {new|rebuild}
Refer to docs/graphgen.txt for more details.

5) Start network: ./startAll.sh
This connects to all machines on net.txt and launches the respective peers.
(./stopAll - kills all remote processes)

You will note that a number of strings appear on the screen (">>", "<<", "|").
">>"  -> Some peer tries to establish an outgoing connection
"<<"  -> Some peer received a request for an incoming connection
"||"  -> Some peer accepted a connection

When these messages stop (it means that all nodes established their connections
and you can proceed to the querying phase


6) Traceroute: ./4-query.sh trace                        
Perform a tracroute at the overlay to see the paths taken by the trace message
(useful for having a glance at the overlay structure)

7) Query the system: ./4-query.sh
This connects to the first host in net.txt and sends him all queries
in keywords.txt (sequentially). The results are stored into various files
recall.txt	  =>	Number of documents found per query until next query
recallFULL.txt	  =>	Number of documents found per query overall 
                        ("infinite" query waiting time window)
querytime.txt	  =>	Milliseconds to receive answers (until next query)
querytimeFULL.txt =>	Milliseconds to receive answers overall 
                        ("infinite" query waiting time window)

8) Traceroute (again): ./4-query.sh trace                        
This allows each peer to dump to local storage various logging parameters
buffered in memory (for efficiency purposes). Do not execute ./stopAll
before issuing this command because any results that haven't been logged
to files and are still in memory will be lost

9) Kill the system: ./stopAll

10) Run ./ka.sh at the queryhing node to kill the querying process

11) Gather all logged messages: ./5-fetchlogs.sh
Since each node might create an arbitrary large number of log files for
messages routed in the system, those messages are stored locally at each node.
(in /tmp/peerware). This script visits all machines, tars the logs,
retrieves them on the filesystem.

12) Merge logged messagess: ./6-mergelogs
This script merges all individual logs and calculates how many messages
have been spent per query. The results are stored in "messages.txt"


Directory Structure
=========================================================================

+ classes/  -> contains the compiled java classes

+ data/  -> contains the local data of each peer. 
  Currently the format is a Lucene IR index strucutre. The system
  can easily support other types of format/indexes if the class
  src/ucr/app/Peer.java is changed accordingly

+ docs/  -> contains Java documentation and other documents 

+ extra/  -> contains additional data sources/queries

+ jars/  -> contains various utilized jar Files
	
	Apache Lucene: lucene-1.3-rc1.jar 
	=> http://lucene.apache.org/java/docs/
	
	IBM XML4JAVA XML Parser: xml4j_2_0_13.jar
	=> http://www.alphaworks.ibm.com/tech/xml4j
	
	PDOM-XQL Persistent Document Object Model: xqlpdom_1_0_2.jar
	http://www.ipsi.fraunhofer.de/oasys/projects/pdom/index_e.html

+ conf/	-> contains configuration files for the various peers.
  (one text file per peer)

+ rawdata/ -> a number of xml files to be indexed by the lucene indexer

+ scripts -> contains directories of initialization files 
  (1 directory per participating pc). These initialization files 
  allow the system to boot the respective peers on the various 
  machines. This folder is automatically constructed by graphgen
  based on the available resources

+ src/ -> The project sources

+ logs/   -> contains various logs that allow that can be used
  for monitoring/debugging purposes.
  
	 
ADDING/CHANGING data sources
=========================================================================
Adding new/additional data sources to the peerware system is straightforward:
The ./buildIndexes script allows you to create lucene indexes that
can be utilized by the Peers. The idea is that you want to create
a bunch of indexes in data/ and then execute graphgen to construct
a random p2p graph among these indexes. In order to reconstruct the 
lucene indexes do the following:

1) Put any collection of XML documents into rawdata/
   (e.g. city1.xml, city2.xml... OR newspaper-us.xml, news-uk.xml etc)
   * the name can be anything
   
   Note: Some additional data sources are locate in extra/datasets/ 
         Copy those over to rawdata/ to use them

2) Change DATA_XML_PATH in config.txt. This is essentially the path you 
   want to extract from the XML collection (at any granularity!)
   <XML>
   	<CAR>
   	   <PRICE>
   	   <DESCR>Blah Blah</DESCR>
   	   <AUX>
   	</CAR>
   	<CAR>
   	   <PRICE>19.2M</PRICE>
   	   <DESCR>Blah Blah</DESCR>
   	   <AUX>
   	</CAR>   	
   	....
   </XML>
   
   e.g. 
   DATA_XML_PATH=//XML//CAR//DESCR => only descriptions
   
   DATA_XML_PATH=//XML//CAR        => the whole record

3) execute:  ./buildIndexes.sh

4) The new indexes will be placed into data/ (one per xml file)
   
5) Finished!


B) Auxiliary Shell Script Tools
=========================================================================

+ ./dclearlogs.sh
Distributed Clear Logs: This script connects to all machines in the net.txt 
file and deletes the folder /tmp/peerware.

+ ./startAll.sh
Start All Peers: This script starts all peers on all machines. Before
executing this script make sure that you have generated the respective
configuration files for each peer through the ./graphgen.sh script.

+ ./2-nodes.sh
Start All Peers on a particular machine: You can use this script to start
all peers on a particular machine. Just ssh to that machine and then execute
this script. Assuming that the respective conf/ and script/ fildes are in 
place this should start all machines on that machine.

+ ./2-node.sh [node_id]
Start peer [node_id] on this particular machine: You can use this script to start
a particular peer on a particular machine - in case the peer process crashed or 
for some other reason. Just ssh to that machine and then execute
this script. Assuming that the respective conf/ and script/ fildes are in 
place this should start the respective peer 

+ ./stopAll.sh
Stop All Peers: This script stops all peers on all machines. (based on net.txt)
If you want to kill all processes individually at a host use:

	$./ka.sh

The above will kill all java process on a particular machine

+ ./pcstarted.sh
PC's Started: This method can be invoked at any time after all peers are
started using $startall. This will allow you to understand which PCs were
not sucessfully started so that you can take them off the network.

+ ./dtop 
Distributed Top: This script performs a sequential top at each remote host
It is particularly useful for checking if all remote processes
were successfully killed


C) Compilation
=========================================================================
Although the current tar file contains a precompiled version of Peerware
you can use "make" to rebuild in part of it. You have the foltowing options:

- make => compiles everything (graph generator, core classes, peer/query node, 
          lucene tools into classes/  and java documentation into docs/javadoc
	  
- make graphgen => compile the graph generator into classes/

- make core	=> compile core protocol into classes/

- make app	=> compile peer/query node into classes/

- make javadoc	=> compile java documentation into classes/

- make tools	=> compiles tools helpful for constructing lucene indices

- make clean	=> deletes all classes - so that you can compile it from scratch

- make fresh	=> deletes all files generated during the execution of one or more 
                   components of the peerware system (temporary files, statistics, etx)