Data availability analysis in P2P networks
Date
2012-09-28Author
Sanaullah, Nazir
Metadata
Show full item recordUsage
This item's downloads: 2554 (view details)
Abstract
P2P network architectures have gained popularity as applications for sharing files
between users. A P2P network provides a scalable, robust, and economical storage
architecture. These features have led to the extended use of P2P network applications,
ranging from file sharing to data sharing for video and telecommunication
domains.
The shift in storage system being used from high cost, reliable servers to usercentered
storage devices led to reliability and availability problems for the P2P
network. Peers are machines of users that can go offline at any time. The data
stored on the machines are not available during the offline time. Data replication
is a common approach for handling data unavailability, which is where multiple
copies of files are placed on different peers in the network. In data replication, peers
transfer complete/partial data to other nodes. Therefore, data replication provides
higher data availability in case of churn.
I present data replication algorithms in this thesis to improve the availability
of data in the network. With an increase in availability and overhead, the basic
challenges faced during the development of data replication algorithm are: (i) How
many replicas for a data object should be created? (ii) On which peer(s) should the
replicated data objects be stored? (iii) Which files should be replicated?
Initial work in data replication considered the static replication of data based on
the overall availability of nodes in the network. These approaches overestimated
the number of replicas, which lead to high maintenance costs. Dynamic approaches
for estimating replica numbers were developed to handle this issue. From the analysis
of the current approaches, I found that the proposed mechanisms for dynamic
approaches to replication did not provide a balanced replication of data. Data were
only replicated to highly available nodes, which were overloaded with data. The
second issue was the inability to adapt to the changing behaviour of peers. In this
thesis, I present an approach that selects a node set comprised of both highly available
and lowly available nodes, in order to provide load balancing in the network.
I provide a feedback-based approach where previous behaviours are incorporated in the next behavioural analysis. Compared to the existing approaches to replica
calculation, this approach is able to determine the appropriate number of replicas
and placement locations with the changing dynamics of the system.
The replication system relies on node behaviour prediction algorithms using
Monte Carlo simulation and Time series analysis. Each node performs an analysis
on the historical traces of its online and offline times in the network. Each node
shares the availability log with the replication initiator node, and the prediction of
future behaviour is made based on the logs received. The data-owning peer uses this
information to run the replica placement algorithm to select nodes that are present
for a particular duration, supporting the presence of each others in the network.
Partial data replication is supported by the system by applying Zipf distribution to
calculate the most popular files.
I performed the evaluation using my replication approach and dynamic replica
placement algorithms, based on the following parameters: replica count, reliability
of data, average availability of nodes in the replica set, and failure analysis for
querying data. The replica count analysis shows that the number of replicas required
were almost half compared to the previous dynamic approaches. The reliability
analysis shows that overall reliability of the data was better in this approach
compared to the other dynamic replica placement algorithms. My replication algorithm
produced replica sets with a lower average availability compared to the replica
set of the other approaches, but the reliability analysis suggests that my approach
distributes data more evenly between nodes, resulting in better overall data availability.
The availability of data in the network was higher than other approaches.
The failure analysis for request failures for data shows that my replication algorithm
has a better node selection mechanism compared to other approaches, with better
data availability.