**Abstract**

Peer-to-peer (P2P) databases are becoming prevalent on the Internet for distribution and sharing of documents, applications, and other digital media. The problem of answering large-scale ad hoc analysis queries, for example, aggregation queries, on these databases poses unique challenges. Exact solutions can be time consuming and difficult to implement, given the distributed and dynamic nature of P2P databases. In this paper, we present novel sampling-based techniques for approximate answering of ad hoc aggregation queries in such databases. Computing a high-quality random sample of the database efficiently in the P2P environment is complicated due to several factors: the data is distributed (usually in uneven quantities) across many peers, within each peer, the data is often highly correlated, and, moreover, even collecting a random sample of the peers is difficult to accomplish. To counter these problems, we have developed an adaptive two-phase sampling approach based on random walks of the P2P graph, as well as block-level sampling techniques. We present extensive experimental evaluations to demonstrate the feasibility of our proposed solution.

n we present novel sampling-based techniques for approximate answering of ad hoc aggregation queries in such databases. Computing a high-quality random sample of the database efficiently in the P2P environment is complicated due to several factors:

n The data is distributed (usually in uneven quantities) across many peers, within each peer, the data is often highly correlated, and, moreover, even collecting a random sample of the peers is difficult to accomplish.

n To counter these problems, we have developed an adaptive two-phase sampling approach based on random walks of the P2P graph, as well as block-level sampling techniques

Two Phases of Approximate Query Processing (AQP)

1. Offline pre-processing of the database

E.g. generate histograms or random samples

OK to use considerable space and time (hours)

2. Runtime query processing

Query answers must be fast (seconds)

Only time to access small amount of data

E.g. extrapolate from random sample

Small Group Sampling

The main idea is to treat small and large groups differently

Large Group

Use Uniform Random Sample

Well-represented in sample, Good quality of Large Groups.

**Existing System**

Let us now discuss what it takes for sampling-based AQP techniques to be incorporated into P2P systems. We first observe that two main approaches have emerged for constructing P2P networks today: structured and unstructured. Structured P2P networks (such as Pastry [33] and Chord [37]) are organized in such a way that data items are located at specific nodes in the network, and nodes maintain some state information to enable efficient retrieval of the data. This organization maps data items to particular nodes and assumes that all nodes are equal in terms of resources, which can lead to bottlenecks and hot spots. Our work focuses on unstructured P2P networks, which makes no assumption about the location of the data items in the node, and nodes are able to join the system at random times and depart without a priori notification. Several recent efforts have demonstrated that unstructured P2P networks can be used efficiently for multicast distributed object location and information retrieval.

**Limitation of Existing System**

Ø It involves scanning the entire P2P repository, which is difficult.

Ø Since no centralized storage exists, it is not clear where the precomputed sample should reside.

Ø The very dynamic nature of P2P systems indicates that precomputed samples will quickly become stale, unless they are frequently refreshed

**Proposed System**

We briefly describe the framework of our approach. Essentially, we abandon trying to pick true uniform random samples of the tuples, as such samples are likely to be extremely impractical to obtain. Instead, we consider an approach where we are willing to work with skewed samples, provided that we can accurately estimate the skew during the sampling process. To get the accuracy in the query answer desired by the user, our skewed samples can be larger than the size of a corresponding uniform random sample that delivers the same accuracy; however, our samples are much more cost efficient to generate.

Our approach has two major phases. In the first phase, we initiate a fixed-length random walk from the query node. This random walk should be long enough to ensure that the visited peers represent a close sample from the underlying stationary distribution (the appropriate length of such a walk is determined in a preprocessing step). We then retrieve certain information from the visited peers, such as the number of tuples, the aggregate of tuples (for example, SUM, COUNT, AVG, and so forth) that satisfy the selection condition, and send this information back to the query node. This information is then analyzed at the query node to determine the skewed nature of the data that is distributed across the network, such as the variance of the aggregates of the data at peers, the amount of correlation between tuples that exists within the same peers, the variance in the degrees of individual nodes in the P2P graph (recall that the degree has a bearing on the probability that a node will be sampled by the random walk), and so on.

Once this data has been analyzed at the query node, an estimation is made on how much more samples are required (and in what way should these samples be collected) so that the original query can be optimally answered within the desired accuracy, with high probability. For example, the first phase may recommend that the best way to answer this query is to visit m0 more peers and, from each peer, randomly sample t tuples. We mention that the first phase is not overly driven by heuristics. Instead, it is based on underlying theoretical principles such as the theory of random walks as well as statistical techniques such as cluster sampling, block-level sampling, and cross validation.

The second phase is then straightforward: A random walk is reinitiated, and tuples are collected according to the recommendations made by the first phase. Effectively, the first phase is used to “sniff” the network and determine an optimal-cost “query plan,” which is then implemented in the second phase. For certain aggregates such as COUNT and SUM, further optimizations may be achieved by pushing the selections and aggregations to the peers; that is, the local aggregates instead of raw samples are returned to the query node, which are then composed into a final answer.

We introduce the important problem of AQP in P2P databases, which is likely to be of increasing significance in the future.

**Advantages of Proposed System**

Peer to peer compelling for many reasons:

Ø Scalability,

Ø Robustness

Ø Lack of need for administration

Ø Anonymity and resistance to censorship

**The modules that are included in this project are**

Ø Peer-to-Peer Node Construction

Ø Random Selection of Node

Ø Random Selection of Records

Ø Performance Evaluation

**Module 1: Peer –to-Peer Node Construction **

In this module we make peer to peer connections. Here we construct one server and more than one clients. This will capable to act as a Distributed Model. The server has maintained all information about the clients.

**Module 2: Random Selection of Node**

It is well known that if this walk is carried out long enough, then the eventual probability of reaching any peer p will reach a stationary distribution.

To make this more precise, let P = {p1; p2; . . . ; pM} be the entire set of peers,

**Module 3: Random Selection of Records**

consider a fixed-size sample of peers S ( fs1; s2 . . . sm); where each si is from P. This sample is picked by the random walk in the first phase. We can approximate this process as that of picking peers in m rounds, where in each round, a random peer si is picked from P, with probability prob(si). We also assume that peers may be picked with replacement; that is, multiple copies of the same peer may be added to the sample, as this greatly simplifies the statistical derivations below.

**Module 4: Performance Evaluation**

Our algorithms are evaluated based on the cost of execution and how close they get to the desired accuracy. As discussed earlier, we use latency as a measure of our cost, noting that in our case, it is proportional to the number of peers participating in the random walk. In fact, if the number of tuples to be sampled is the same for all peers (which is true in our experiments), then latency is also proportional to the total number of sample tuples drawn by the overall algorithm. Thus, we use the number of sample tuples used as a surrogate for latency in describing our results.

**Benefits of Query Processing**

The query processing have the following advantages.

n Large data warehouses

n Gigabytes to terabytes of data

n Data analysis applications

n Decision support

n Data Mining

n Query characteristics:

n Access large fraction of database

n Seek to identify general patterns / trends

**Hardware Specification**

Ø Hard disk : 40 GB

Ø RAM : 512 MB

Ø Processor Speed : 3.00GHz

Ø Processor : Pentium IV Processor

**Software Specification**

Ø Operating System : Windows Xp Professional.

Ø Required Software : JDK 1.5

Ø Front End : Java Swing

Ø Back End : MS-Sql Server2000.