# # (C) Copyright University of California Riverside. 2001-2002. # # Peerware - P2P Simulation Infrastructure # # @version : 1.0 # @document author : Demetris Zeinalipour (csyiazti@cs.ucr.edu) # Project Supervision : Dimitris Gunopulos (dg@cs.ucr.edu) # Computer Science Department , University of California, Riverside # # Document Description # ******************** This document provides a brief overview of the peerware system. A) Overview ========================================================================= Peerware is a prototype Peer-to-Peer Information Retrieval system which can be deployed on a Network of Workstations (e.g. Instructional Labs, Clusters, etc). It consists of the following modules: 1) A P2P environment written in JAVA (Java Middleware) 2) The Lucene Information Retrieval API used by each peer to locally lookup queries 3) A set of bash shell scripts which can be used to deploy the Java Middleware. Peerware is generally modular in a sense that you can replace various to construct different realistic simulation environments. For example you could substitute the random graph generator by a powerlaw graph generator and keep everything else unaffected. Although peerware has been designed and tested over a LAN (in 3 subnets), in practice it could also be deployed over a WAN (assuming that nodes are not firewalled. B) Installation ========================================================================= * Prerequisites: 1) A network of linux boxes (or windows machines equipped with cygwin). Each box should have the sshd up and running. The peerware software has to be placed on a folder which is accessible by all machines on the same path (e.g. a folder on NFS - typical setting) 1) Install JAVA from http://www.java.com The system has been tested with Sun's JAVA 1.4.2 although any other java should also work. 2) Set public/private keys: Consult docs/sshkeys.txt for more details The main reason peerware requires this setting is to allow the various shell scripts to automatically connect to remote machines and to perform various tasks. * Install 1) Unzip: tar -zxvf peerware-1.0-rc1.tar.gz 2) cd peerware 3) make Sets all appropriate parameters and compiles the sources 4) Edit LINUX_DIR and LINUX_TEMP_DIR in config.txt (set them to your preferred path) 5) Optional: Put the peerware folder into your PATH (e.g. in ~/.profile) C) Execute - Order of Commands ========================================================================= 1) cd peerware 2) Set Machines: vi net.txt.all Set the available machines in net.txt.all using a text editor 3) Probe Machines: ./createNetFile.sh This creates net.txt that contains all active machines (some of the machines in net.txt.all might not be available) 4) Create Graph: ./1-graphgen.sh {new|rebuild} Refer to docs/graphgen.txt for more details. 5) Start network: ./startAll.sh This connects to all machines on net.txt and launches the respective peers. (./stopAll - kills all remote processes) You will note that a number of strings appear on the screen (">>", "<<", "|"). ">>" -> Some peer tries to establish an outgoing connection "<<" -> Some peer received a request for an incoming connection "||" -> Some peer accepted a connection When these messages stop (it means that all nodes established their connections and you can proceed to the querying phase 6) Traceroute: ./4-query.sh trace Perform a tracroute at the overlay to see the paths taken by the trace message (useful for having a glance at the overlay structure) 7) Query the system: ./4-query.sh This connects to the first host in net.txt and sends him all queries in keywords.txt (sequentially). The results are stored into various files recall.txt => Number of documents found per query until next query recallFULL.txt => Number of documents found per query overall ("infinite" query waiting time window) querytime.txt => Milliseconds to receive answers (until next query) querytimeFULL.txt => Milliseconds to receive answers overall ("infinite" query waiting time window) 8) Traceroute (again): ./4-query.sh trace This allows each peer to dump to local storage various logging parameters buffered in memory (for efficiency purposes). Do not execute ./stopAll before issuing this command because any results that haven't been logged to files and are still in memory will be lost 9) Kill the system: ./stopAll 10) Run ./ka.sh at the queryhing node to kill the querying process 11) Gather all logged messages: ./5-fetchlogs.sh Since each node might create an arbitrary large number of log files for messages routed in the system, those messages are stored locally at each node. (in /tmp/peerware). This script visits all machines, tars the logs, retrieves them on the filesystem. 12) Merge logged messagess: ./6-mergelogs This script merges all individual logs and calculates how many messages have been spent per query. The results are stored in "messages.txt" Directory Structure ========================================================================= + classes/ -> contains the compiled java classes + data/ -> contains the local data of each peer. Currently the format is a Lucene IR index strucutre. The system can easily support other types of format/indexes if the class src/ucr/app/Peer.java is changed accordingly + docs/ -> contains Java documentation and other documents + extra/ -> contains additional data sources/queries + jars/ -> contains various utilized jar Files Apache Lucene: lucene-1.3-rc1.jar => http://lucene.apache.org/java/docs/ IBM XML4JAVA XML Parser: xml4j_2_0_13.jar => http://www.alphaworks.ibm.com/tech/xml4j PDOM-XQL Persistent Document Object Model: xqlpdom_1_0_2.jar http://www.ipsi.fraunhofer.de/oasys/projects/pdom/index_e.html + conf/ -> contains configuration files for the various peers. (one text file per peer) + rawdata/ -> a number of xml files to be indexed by the lucene indexer + scripts -> contains directories of initialization files (1 directory per participating pc). These initialization files allow the system to boot the respective peers on the various machines. This folder is automatically constructed by graphgen based on the available resources + src/ -> The project sources + logs/ -> contains various logs that allow that can be used for monitoring/debugging purposes. ADDING/CHANGING data sources ========================================================================= Adding new/additional data sources to the peerware system is straightforward: The ./buildIndexes script allows you to create lucene indexes that can be utilized by the Peers. The idea is that you want to create a bunch of indexes in data/ and then execute graphgen to construct a random p2p graph among these indexes. In order to reconstruct the lucene indexes do the following: 1) Put any collection of XML documents into rawdata/ (e.g. city1.xml, city2.xml... OR newspaper-us.xml, news-uk.xml etc) * the name can be anything Note: Some additional data sources are locate in extra/datasets/ Copy those over to rawdata/ to use them 2) Change DATA_XML_PATH in config.txt. This is essentially the path you want to extract from the XML collection (at any granularity!) Blah Blah 19.2M Blah Blah .... e.g. DATA_XML_PATH=//XML//CAR//DESCR => only descriptions DATA_XML_PATH=//XML//CAR => the whole record 3) execute: ./buildIndexes.sh 4) The new indexes will be placed into data/ (one per xml file) 5) Finished! B) Auxiliary Shell Script Tools ========================================================================= + ./dclearlogs.sh Distributed Clear Logs: This script connects to all machines in the net.txt file and deletes the folder /tmp/peerware. + ./startAll.sh Start All Peers: This script starts all peers on all machines. Before executing this script make sure that you have generated the respective configuration files for each peer through the ./graphgen.sh script. + ./2-nodes.sh Start All Peers on a particular machine: You can use this script to start all peers on a particular machine. Just ssh to that machine and then execute this script. Assuming that the respective conf/ and script/ fildes are in place this should start all machines on that machine. + ./2-node.sh [node_id] Start peer [node_id] on this particular machine: You can use this script to start a particular peer on a particular machine - in case the peer process crashed or for some other reason. Just ssh to that machine and then execute this script. Assuming that the respective conf/ and script/ fildes are in place this should start the respective peer + ./stopAll.sh Stop All Peers: This script stops all peers on all machines. (based on net.txt) If you want to kill all processes individually at a host use: $./ka.sh The above will kill all java process on a particular machine + ./pcstarted.sh PC's Started: This method can be invoked at any time after all peers are started using $startall. This will allow you to understand which PCs were not sucessfully started so that you can take them off the network. + ./dtop Distributed Top: This script performs a sequential top at each remote host It is particularly useful for checking if all remote processes were successfully killed C) Compilation ========================================================================= Although the current tar file contains a precompiled version of Peerware you can use "make" to rebuild in part of it. You have the foltowing options: - make => compiles everything (graph generator, core classes, peer/query node, lucene tools into classes/ and java documentation into docs/javadoc - make graphgen => compile the graph generator into classes/ - make core => compile core protocol into classes/ - make app => compile peer/query node into classes/ - make javadoc => compile java documentation into classes/ - make tools => compiles tools helpful for constructing lucene indices - make clean => deletes all classes - so that you can compile it from scratch - make fresh => deletes all files generated during the execution of one or more components of the peerware system (temporary files, statistics, etx)