pStore: A Secure Distributed Backup System

 Kenneth Barr, Christopher Batten, 
  Arvind Saraf, Stanley Trepetin
 6.824 Project Proposal
 October 18, 2001

Current backup systems for personal and small-office computer users
usually rely on secondary on-site storage of their data.  Although
these on-site backups provide data redundancy, they are vulnerable
to localized catastrophe.  More sophisticated off-site backups are
possible but are usually expensive, difficult to manage, and are still
a centralized form of redundancy.  We propose pStore, a distributed
backup system based on an adaptive peer-to-peer network.  pStore
exploits the growing amount of unused personal hard drive space
attached to the Internet to provide the distributed redundancy needed
for reliable and effective data backup.  

Approach

We envision users who wish to achieve redundant, distributed backups
will be willing to donate some of their hard drive space to the pStore
community in exchange for some amount of space distributed amongst
the community.  A pStore participant (or his automatic software agent)
will mark certain files for backup and insert them into the pStore
peer-to-peer network as they are changed.  The network will be
bestowed with the ability to appropriately replicate and distribute
the files to assure they can be recovered in situations where node(s)
are removed from the network or node(s) remove the files.  Backed up
files may be retrieved from the network when needed.

We are currently evaluating several peer-to-peer frameworks such as
Gnutella, Freenet, and Chord to see if they meet our needs.  Usage of
the pStore peer-to-peer network is likely to be significantly
different than traditional file-sharing or one-to-many publication
usage patterns.  In pStore, data is inserted into the network often
but that data is only requested rarely and always by the original
inserter.  We are also considering a custom peer-to-peer framework to
better handle this usage pattern.

Furthermore we must consider file compression to save bandwidth, file
encryption and authentication to secure data, file caching to avoid
retrieval from the network in some cases, and file versioning to
handle updates to previously stored files.  We must decide where to
store the data and how many copies must be stored and how to manage
the data store when it becomes full.  We must establish the set of
metadata needed to support these operations and allow future extensions.  
The user interface could be as simple as a command line interface or
something more complex such as a GUI or file system interface.

Challenges

Initial brainstorming and conversation has yielded the following
challenges.  We plan on discussing these challenges and selecting a
subset on which to focus our initial implementation. 

+ Social Contract:  We must construct a compelling model which
  entices people to participate in the peer-to-peer network.  While it
  is likely that a single user can not fill his hard drive to capacity
  with his own data, he will still be reluctant to let other people
  share the space unless he is confident he will get something in
  return.  We need to enforce this sort of quota or this system will be
  restrict to use in trusted environments. 

+ Ease of use, retrieval:  The retrieval system could be
  implemented as a file system inspired by Network Appliances .snapshot
  directories and the Plan 9 dumpfs.  Semantics for such a file system
  are unclear and must be defined.  The user interface could be
  completely transparent (eg, a daemon started at boot-time).  We could
  then add configurability to support advanced users (eg, fine-grained
  selection of files to backup, priority, server selection, manual
  backups - perhaps through a writable filesystem interface).  

+ Encryption and Security:  Users must feel confident that their
  data cannot be read or altered by those who volunteer to store it.
  Long keys should be chosen to resist an attack even in the future when
  computer hardware is faster.  Key revocation must be considered to
  handle lost or stolen keys.  Anonymity should be considered so users
  do not know what they are storing.  This protects them and reduces the
  incentive to meddle with it.  The system must be resistant to
  crackers.

+ Reliability through redundancy:  Services such as Idrive and
  Livevault seem to be obvious single points of failure.  Peer-to-peer
  lets us replicate data quickly to multiple hosts.  If one host is lost
  the data is likely to exist on several others.  More important data
  could be replicated to a greater extent.  If network
  performance/proximity metrics could be collected, they may influence
  the decision of where to place data. 

+ Impact to network traffic:  Many users will be content to
  backup their personal data, but others will want to store entire disk
  images.  We believe block-sharing (SFSRO) is a difficult problem in an
  encrypted environment.  We must be conscious of the impact replicating
  gigabytes of data will have on the Internet. 

+ Choice of Peer-to-peer system:  Our system functions in the
  opposite direction of Napster and Gnutella (Our common operation is
  the push of data into the network rather than retrieval from peers).
  In addition, we have different goals from systems such as Freenet
  (which is push-centric, but strives for anonymity and availability of
  frequently requested documents).  While we will investigate existing
  systems, we foresee ourselves implementing a simple system of our own
  if only to have a deep understanding of its operation.  Chord offers a
  proven way of locating data, but as a low-level primitive, it does not
  consider our unique direction of traffic or (in)frequency of
  requests. 

+ Versioning:  While most users will want to retrieve the most
  up-to-date backup, others will want to retrieve particular backed up
  versions.  We hope to support both schemes.  This may entail
  non-encrypted expiration data and storing file differences instead of
  new copies. 

Evaluation

Our evaluation will include two parts: a proof-of-concept implementation
on a small number of machines and simulations illustrating how our system
scales to larger networks.  We will examine and explain various
trade-offs in the system design and how our system responds to node
failures.  We hope to show that pStore is a viable distributed backup
system.