Publications

On the interplay between data redundancy and retrieval times in P2P storage systems On the interplay between data redundancy and retrieval times in P2P storage systems

Lluís Pàmies-Juárez, Marc Sánchez-Artigas, Pedro García-López, Rubén Mondéjar, Rahma Chaabouni

Computer Networks, Vol. 59. 2014.

Abstract

Peer-to-peer (P2P) storage systems aggregate spare storage resources from end users to build a large collaborative online storage solution. In these systems, however, the high levels of user churn—peers failing or leaving temporarily or permanently—affect the quality of the storage service and might put data reliability on risk. Indeed, one of the main challenge of P2P storage systems has traditionally been how to guarantee that stored data can always be retrieved within some time frame. To meet this challenge, existing systems store objects with high amounts of data redundancy, rendering data availability values close to 100%, which in turn ensure optimal retrieval times (only constrained by network limits). Unfortunately, this redundancy reduces the overall net capacity of the system and increases data maintenance costs. To alleviate these problems data redundancy can be reduced at the expense of lengthening retrieval times. The problem is that both the rewards and disadvantages of doing so are not well understood. In this paper we present a novel analytical framework that allows us to model retrieval times in P2P storage systems and describe the interplay between data redundancy and retrieval times for different churn patterns. Using availability traces from real P2P applications, we show that our framework provides accurate estimation of retrieval times in realistic environments.

PoWerStore: Proofs of Writing for Efficient and Robust Storage PoWerStore: Proofs of Writing for Efficient and Robust Storage

Dan Dobre, Ghassan O. Karame, Wenting Li, Matthias Majuntke, Neeraj Suri, Marko Vukolić

20th ACM Conference on Computer and Communications Security

Abstract

Existing Byzantine fault tolerant (BFT) storage solutions that achieve strong consistency and high availability, are costly compared to solutions that tolerate simple crashes. This cost is one of the main obstacles in deploying BFT storage in practice.

In this paper, we present PoWerStore, a robust and efficient data storage protocol. PoWerStore’s robustness comprises tolerating network outages, maximum number of Byzantine storage servers, any number of Byzantine readers and crash-faulty writers, and guaranteeing high availability (wait-freedom) and strong consistency (linearizability) of read/write operations. PoWerStore’s efficiency stems from combining lightweight cryptography, erasure coding and metadata write-backs, where readers write-back only metadata to achieve strong consistency. Central to PoWerStore is the concept of “Proofs of Writing” (PoW), a novel data storage technique inspired by commitment schemes. PoW rely on a 2-round write procedure, in which the first round writes the actual data and the second round only serves to “prove” the occurrence of the first round. PoW enable efficient implementations of strongly consistent BFT storage through metadata write-backs and low latency reads.

We implemented PoWerStore and show its improved performance when compared to existing robust storage protocols, including protocols that tolerate only crash faults.

Reducing Costs in the Personal Cloud: Is BitTorrent a Better Bet? Reducing Costs in the Personal Cloud: Is BitTorrent a Better Bet?

Rahma Chaabouni, Marc Sánchez-Artigas, Pedro García-López

IEEE P2P'14. September 8-12, 2014. London. England.

Abstract

Lately, Personal Cloud storage services, like Dropbox, have emerged as user-centric solutions that provide easy management of the users’ data. To meet the requirements of their clients, such services require a huge amount of storage and bandwidth. In an attempt to reduce these costs, we focus on maximizing the benefit that can be driven from the interest of users in the same content by the introduction of the BitTorrent protocol. In general, it is assumed that BitTorrent is only effective
for large files and/or large swarms, while the client-server approach is more suited for small files and/or small swarms. However, there is no concrete study on the comparative efficiency of both protocols for small files yet.

In this paper, we study the download time and offload ratio in BitTorrent compared to HTTP. Based on this study, we propose an algorithm for the management of these protocols. The choice of the protocol is made based on the prediction of the efficiency of BitTorrent and HTTP for each case. We validate our algorithm on a real trace of the Ubuntu One file service, achieving important savings in the cloud bandwidth without degrading the download time.

SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks

Raúl Gracia-Tinedo, Danny Harnik, Dalit Naor, and Dmitry Sotnikov, Sivan Toledo and Aviad Zuck

13th USENIX Conference on File and Storage Technologies (FAST 15). February 16-19, 2015.

Abstract

Storage system benchmarks either use samples of proprietary data or synthesize artificial data in simple ways (such as using zeros or random data). However, many storage systems behave completely differently on such artificial data than they do on real-world data. This is the case with systems that include data reduction techniques, such as compression and/or deduplication.

To address this problem, we propose a benchmarking methodology called mimicking and apply it in the domain of data compression. Our methodology is based on characterizing the properties of real data that influence the performance of compressors. Then, we use these characterizations to generate new synthetic data that mimics the real one in many aspects of compression. Unlike current solutions that only address the compression ratio of data, mimicking is flexible enough to also emulate compression times and data heterogeneity. We show that these properties matter to the system’s performance.

In our implementation, called SDGen, characterizations take at most 2:5KB per data chunk (e.g., 64KB) and can be used to efficiently share benchmarking data in a highly anonymized fashion; sharing it carries few or no privacy concerns. We evaluated our data generator’s accuracy on compressibility and compression times using real-world datasets and multiple compressors (lz4, zlib, bzip2 and lzma). As a proof-of-concept, we integrated SDGen as a content generation layer in two popular benchmarks (LinkBench and Impressions).

Separating Data and Control: Asynchronous BFT Storage with 2t+1 Data Replicas Separating Data and Control: Asynchronous BFT Storage with 2t+1 Data Replicas

Christian Cachin, Dan Dobre, Marko Vukolić

16th International Symposium on Stabilization, Safety, and Security of Distributed Systems, September 28-October 1, Paderborn, Germany.

Abstract

The overhead of Byzantine fault tolerant (BFT) storage is a primary concern that prevents its adoption in practice. The cost stems from the need to maintain at least 3 t +1 copies of the data at different storage replicas in the asynchronous model, so that t Byzantine replica faults can be tolerated. This paper presents MDStore, the first fully asynchronous BFT storage protocol that reduces the number of replicas that store the payload data to as few as 2t + 1 and maintains metadata at 3t + 1 replicas on (possibly) different servers. At the heart of MDStore lies a metadata service built upon a new abstraction called "times-tamped storage." Timestamped storage allows for conditional writes (facilitating the implementation of the metadata service) and has consensus number one (making it implementable with wait-free semantics in an asynchronous system despite faults). In addition to its low replication overhead, MDStore offers strong guarantees by emulating a multi-writer multi-reader atomic register, providing wait free termination, and tolerating any number of Byzantine readers and crash-faulty writers.

Smart Cloud Seeding for BitTorrent in Datacenters Smart Cloud Seeding for BitTorrent in Datacenters

Xavier León, Rahma Chaabouni, Marc Sánchez-Artigas, Pedro García-López

IEEE Internet Computing, Vol. 18. 2014, pp. 47-54.

Abstract

Cloud content providers must deliver vast amounts of data to an ever-growing number of users while maintaining responsive performance, thus increasing  bandwidth-provisioning expenditures. To mitigate this problem, the authors  transparently integrate BitTorrent into the cloud provider infrastructure  and leverage users’ upstream capacity to reduce bandwidth costs. They also allocate seeder bandwidth optimally among swarms to maximize throughput. Their system delivers higher performance when dealing with large volumes of data compared to the traditional client-server paradigm

StackSync: Bringing Elasticity to Dropbox-like File Synchronization StackSync: Bringing Elasticity to Dropbox-like File Synchronization

Pedro García-López, Marc Sanchez-Artigas, Sergi Toda, Cristian Cotes, John Lenton

ACM/IFIP/USENIX Middleware'14. December 8-14, 2014. Bordeaux, France.

Abstract

The design of elastic file synchronization services like Dropbox is an open and complex issue yet not unveiled by the major commercial providers, as it includes challenges like fine-grained programmable elasticity and efficient change notification to millions of devices. In this paper, we propose a novel architecture for file synchronization which aims to solve the above two major challenges. At the heart of our proposal lies ObjectMQ, a lightweight framework for providing programmatic elasticity to distributed objects using messaging. The efficient use of indirect communication: i) enables programmatic elasticity based on queue message processing, ii) simplifies change notifications offering simple unicast and multicast primitives; and iii) provides transparent load balancing based on queues. 

Our reference implementation is StackSync, an open source elastic file synchronization Cloud service developed in the context of the FP7 project CloudSpaces. StackSync supports both predictive and reactive provisioning policies on top of ObjectMQ that adapt to real traces from the Ubuntu One service. The feasibility of our approach has been extensively validated with an open benchmark, including commercial synchronization services like Dropbox or OneDrive.

StoreSim: Optimizing Information Leakage in Multicloud Storage Services StoreSim: Optimizing Information Leakage in Multicloud Storage Services

H. Zhuang, R. Rahman, K. Aberer

IEEE CloudCom 2015

Abstract

Many schemes have been recently advanced for storing data on multiple clouds. Distributing data over different cloud storage providers (CSPs) automatically provides users with a certain degree of information leakage control, as no single point of attack can leak all user's information. However, unplanned distribution of data chunks can lead to high information disclo- sure even while using multiple clouds. In this paper, to address this problem we present StoreSim, an information leakage aware storage system in multicloud. StoreSim aims to store syntactically similar data on the same cloud, thus minimizing the user's infor- mation leakage across multiple clouds. We design an approximate algorithm to efficiently generate similarity-preserving signatures for data chunks based on MinHash and Bloom filter, and also design a function to compute the information leakage based on these signatures. Next, we present an effective storage plan generation algorithm based on clustering for distributing data chunks with minimal information leakage across multiple clouds. Finally, we evaluate our scheme using two real datasets from Wikipedia and GitHub. We show that our scheme can reduce the information leakage by up to 60% compared to unplanned placement.

The Curious Case of the PDF Converter that Likes Mozart: Dissecting and Mitigating the Privacy Risk of Personal Cloud Apps The Curious Case of the PDF Converter that Likes Mozart: Dissecting and Mitigating the Privacy Risk of Personal Cloud Apps

H. Harkous, R. Rahman, B. Karlas, K. Aberer

Privacy Enhancing Technologies Symposium (PETS'15) 2015

Abstract

Third party apps that work on top of personal cloud services, such as Google Drive and Dropbox, require access to the user’s data in order to provide some functionality. Through detailed analysis of a hundred popular Google Drive apps from Google’s Chrome store, we discover that the existing permission model is quite often misused: around two-thirds of analyzed apps are over-privileged, i.e., they access more data than is needed for them to function. In this work, we analyze three different permission models that aim to discourage users from installing over-privileged apps. In experiments with 210 real users, we discover that the most successful permission model is our novel ensemble method that we call Far-reaching Insights. Farreaching Insights inform the users about the data-driven insights that apps can make about them (e.g., their topics of interest, collaboration and activity patterns etc.) Thus, they seek to bridge the gap between what third parties can actually know about users and users’ perception of their privacy leakage. The efficacy of Farreaching Insights in bridging this gap is demonstrated by our results, as Far-reaching Insights prove to be, on average, twice as effective as the current model in discouraging users from installing over-privileged apps. In an effort to promote general privacy awareness, we deployed PrivySeal, a publicly available privacy-focused app store that uses Far-reaching Insights. Based on the knowledge extracted from data of the store’s users (over 115 gigabytes of Google Drive data from 1440 users with 662 installed apps), we also delineate the ecosystem for 3rd party cloud apps from the standpoint of developers and cloud providers. Finally, we present several general recommendations that can guide other future works in the area of privacy for the cloud. To the best of our knowledge, ours is the first work that tackles the privacy risk posed by 3rd party apps on cloud platforms in such depth.

Understanding Data Sharing in Private Personal Clouds Understanding Data Sharing in Private Personal Clouds

Raúl Gracia-Tinedo, Pedro García-López, Alberto Gómez, Anastasio Illana

IEEE 9th International Conference on Cloud Computing. June 27-July 2, 2016, Santa Francisco, CA, USA

Abstract

Data sharing in Personal Clouds blurs the lines between on-line storage and content distribution with a strong social component. Such social information may be exploited by researchers to devise optimized data management techniques for Personal Clouds. Unfortunately, due their proprietary nature, data sharing is one of the least studied facets of these systems.

In this work, we present the first study of data sharing in a private Personal Cloud. Concretely, we contribute a dataset collected at the metadata back-end of NEC: an enterprise oriented Personal Cloud. First, our analysis provides a deep inspection of the storage layer of NEC, comparing it with a well-known public vendor (UbuntuOne). Second, we study the social structure of NEC user communities, as well as the storage characteristics of user sharing links via multiplex network techniques.

Finally, we discuss a battery of data management optimizations for NEC derived from our findings, which may be of independent interest for other similar systems. Our proposals include content distribution, caching and data placement. We believe that both our study and dataset will foster further research in this field.

You are here: Home Publications Publications