NEC Personal Cloud Trace

The NEC dataset integrates two sources of information: storage layer and sharing interactions. The information to build our dataset was collected directly by the provider from the back-end (SQL Server & OpenStack Swift). 

Regarding storage (CS_FileTraces), the trace is a snapshot of the data store contents (OpenStack Swift). To wit, the trace contains log lines that identify and describe files (size, extension), as well as the file owner and the container/folder where it is stored. This enables us to analyze the storage layer of this service in detail.

The sharing trace (CS_SharingTraces) contains log lines describing sharing interactions across users and information about shared files. This trace contains all sharing interactions in the NEC Personal Cloud (Madrid datacenter) from March 7th 2013 to September 9th 2015. To our knowledge, this is the most extensive trace of data sharing in Personal Clouds to date.

Dataset limitations: The collected dataset has some limitations. For instance, we did not trace information about the locations of users (e.g., IP addresses) due to privacy reasons. Moreover, it should be noted that while the sharing trace captures the whole sharing activity of users for several months, the storage layer information corresponds to a snapshot at a specific point in time. Consequently, not all the files captured in the sharing trace exist in the storage trace, and vice-versa. The sharing trace also lacks from temporal information.

Download the NEC trace (.zip) - 7.7MB (md5: 27f9f5478d7cbffcedc1b317e8765321)

Citation Policy.

To benefit from this dataset in your research you should cite the original measurement paper: "Understanding Data Sharing in Private Personal Clouds".  R. Gracia-Tinedo, Pedro García-López, Alberto Gómez and Anastasio IllanaIEEE International Conference on Cloud Computing (CLOUD'16), San Francisco, June 2016 (http://thecloudcomputing.org/2016/). 


 

UbuntuOne (U1) Measurement

Protocol Entities

Before describing the trace format, it is important to understand few concepts:

  1. Volume: it can be considered as a directory. During the installation of the U1 client, the client creates an initial volume to store files with id=0 (root/predefined). There are 3 types of volumes: i) root/predefined, ii) udf (user defined folder, which is a folder created by the user) and iii) share (sub-volume of another user to which the current user has access).

  2. Node: A node is a file or a directory in the system. We can distinguish between both in the node_type field.

  3. Session: A user interacts with the server in the context of a U1-storage protocol session (not HTTP or any other session type). This session is used to identify the requests of a single user during the session lifetime. Sessions are not usually expired automatically. A client may disconnect, or a server process may go down, and that will end the session.

  4. Request types: There are different request types. Type “storage”/”storage_done” is related with the storage operations of users (put, get, delete,...). Type “session” denote those operations related with the management of user sessions. Type "rpc" are the translation of storage protocol messages (of type “storage*”) into RPC calls to the metadata backend. 

Protocol Operations

 Meaning of operation types:

  • ListSharesthis operation lists all the volumes of a user that are type “share”. In this operation, their field “shared_by” is the owner of the volume and “shared_to” is the user to which that volume was shared with. In this operation, the field `shares` is like `udfs`; it represents the number of volumes type share of this user.
  • PutContentResponse/GetContentResponsethese are the actual file uploads and downloads, respectively. The notification goes to the U1 backend but the actual data is stored in a separate service (Amazon S3).
  • Unlink: Delete a file or a directory from a volume.
  • Make: Request/Response to a touch operation in the backend. Normally this is associated with RPC upload jobs, in consequence to a file upload.
  • QuerySetCapsResponse: "frozenset([u'resumable-uploads', u'no-content', u'account-info', u'volumes', u'generations', u'fix462230'])
  • GetDeltaResponse: Get the differences between the server volume and the local one (generations).
  • AuthenticateRequest/Response: Operations managed by the servers to create sessions for users.
  • ListVolumesoperation that is normally done at the beginning of a session, that lists all the volumes of a user (own, shared,..).

Trace Description

Trace logfile name. Consider a line in the trace like this "production-whitecurrant-23-20140128". They'll all be "production-", because we didn't look at logs from staging which is the other prefix. After that prefix is the name of the physical machine (whitecurrant), followed by the number of the server process (23) and the date. The server process number can migrate between nodes to load balance. After that is the date the logfile was "cut" (there is one log file per server/service and day).

Normally there are between 8-16 processes per physical machine. The identifier of the process is unique within a machine, although it can migrate from one machine to another. Sharding is in the metadata storage backend, so it is behind the point where traces are taken. This means that in these traces any combination of server/process can handle any user. To have a strictly sequential notion of the activity of a user we should take into account the session and sort the trace by timestamp (a user may have more than one parallel connection). A session starts in the least loaded machine and lives in the same node until it finishes, making user events strictly sequential.

Explanation of the columns in the trace file (.csv): 

  • T: trace type. Distinct types of traces have different columns associated.
  • addr: Inbound address and port. Probably not particularly interesting, as they'll be internal addresses.
  • caps: the caps of a client. Newer clients will have more capabilities than older ones. It's like feature flags. This allow U1 to distinguish which clients can do certain actions, and even to notify a client about the necessity of updating his client.
  • client_metadata: metadata of a client. For example, operating system, client build information, etc.
  • current_gen: current volume generation. A volume's generation is a monotonic counter increased for every operation on the volume. It is needed for clients to detect changes between the server version and their own.
  • ext: file extension.
  • failed: error message (if any).
  • free_bytes: available space in the volume.
  • from_gen: client generation. This is for an operation that requests all the changes to a volume from a certain generation onwards. The client keeps track of the generation it's at, and on reconnection asks for changes from that point on.
  • hash: sha1sum of the file
  • level: trace level.
  • logfile_id: trace file identifier.
  • method: rpc method name.
  • mime: file type.
  • msg: request message.
  • node_id: id a file or a directory.
  • nodes: is the number of files and directories within a volume.
  • pid: # of server process.
  • req_id: request identifier. There are separate entries for the start and end of every operation. `req_id` is a unique identifier for this server and session, where the server is part of the logfile_id (without date).  `req_id` is present in entries of type `storage` and `storage_done` (specialization of storage). ‘Storage’ requests are also specialization of ‘Session’.
  • req_t: request type.
  • root: root of the volume. All volumes have a root node that would be similar to the mount point in a file system.
  • server: server type.
  • shared_by: owner of the shared volume.
  • shared_to: user to which the shared folder is shared with.
  • shares: number of shared volumes for this user.
  • sid: id of the ubuntuone-storageprotocol session (not http).
  • size: file size.
  • time: request processing time (database processing).
  • tstamp: trace timestamp.
  • type: node type (here we can distinguish between files and directories).
  • udfs: number of user volumes.
  • user_id: user identifier.
  • user: user name (anonymized using int32 random numbers). Only used when the users starts a session.
  • vol_id: volume identifier. "0" is the identifier of the root volume for users. This means that all the users have a volume with vol_id=0.

Download the U1 dataset (.gz) - 84GB (md5: 62cf355792e59e85a95886e8e23b4ae3).

Citation Policy.

To benefit from this dataset in your research you should cite the original measurement paper: "Dissecting UbuntuOne: Autopsy of a Global-scale Personal Cloud Back-end".  R. Gracia-TinedoYongchao Tian, Josep Sampé, Hamza HarkousJohn Lenton, Pedro García-López, Marc Sánchez-Artigas and Marko Vukolic. ACM Internet Measurement Conference (IMC'15), Tokyo, October 2015 (http://conferences2.sigcomm.org/imc/2015/). 

 


 

Active Personal Cloud Measurement

We have created a script named "measurement.py". It takes two arguments:

  • Provider: -p or --provider and [dropbox|box|sugarsync]
  • Test: -t or --type and [load_and_transfer|service_variability]

Depending on the test type, the script executes one of the following files:

  • load_and_transfer: File load_and_transfer_test.py
  • service_variability: File service_variability.py

Load and Transfer Workload

The objective of this workload was twofold: Measuring the maximum up/down transfer speed of operations and detecting correlations between the transfer speed and the load of an account. Intuitively, the first objective was achieved by alternating upload and download operations, since the provider only needed to handle one operation per account at a time. We achieved the second point by acquiring information about the load of an account in each API call. The execution of this workload was continuously performed at each node as follows: First, a node created synthetic files of a size chosen at random from the aforementioned set of sizes. That node uploaded files until the capacity of the account was full. At this point, that node downloaded all the files also in random order. After each download, the file was deleted.

Implementation: First of all, the script creates 4 different files of different sizes (25, 50, 100 and 150 MB). Once done, it starts to upload random files until the account is full and an error is returned from the provider. When this appends the script starts to download a random file from the account and removes it when the download has finished.

This test will be running for 5 days approximately.

Service Variability Workload

This workload maintained in every node a nearly continuous upload and download transfer flow to analyze the performance variability of the service over time. This workload provides an appropriate substrate to elaborate a time-series analysis of these services. The procedure was as follows: The upload process first created files corresponding to each defined file size which were labeled as “reserved”, since they were not deleted from the account. By doing this we assured that the download process was never interrupted, since at least the reserved files were always ready for being downloaded. Then, the upload process started uploading synthetic random files until the account was full. When the account was full, this process deleted all files with the exception of the reserved ones to continue uploading files. In parallel, the download process was continuously downloading random files stored in the account.

Implementation: The script creates 4 files as the previous test. When the files are ready it uploads a file named reserved.dat (which will remain in the accound until the test ends) with a size of 50MB. As soon as the file is completely uploaded, the script creates two threads, one for download and one for upload.

  • The upload thread constantly uploads files sized [25, 50, 100, 150] MB until the account is full. Once it is full, the thread removes all files except "reserved.dat". Then, it starts its cycle again.
  • The download thread continuously lists all files in the account and downloads one randomly chosen. There will be always at least one file ("reserved.dat").

Deployment

Finally, we executed the experiments in different ways depending on the chosen platform. In the case of PlanetLab, we employed the same machines in each test, and therefore, we needed to sequentially execute all the combinations of workloads and providers. This minimized the impact of hardware and network heterogeneity, since all the experiments were executed in the same conditions. On the contrary, in our labs we executed in parallel a certain workload for all providers (i.e. assigning 10 machines per provider). This provided two main advantages: The measurement process was substantially faster, and fair comparison of the three services was possible for the same period of time.

Traces

(NEW!) Here we provide some measurement traces collected during our measurement.

Trace Format. Files are in .csv format. The column fields are:

  • row_id: database row identifier.
  • account_id: Personal Cloud account used to perform this API call.
  • file_size: size of the uploaded/downloaded file in bytes.
  • operation_time_start: starting time of the API call.
  • operation_time_end: Finishing time of the API call.
  • time_zone (not used): Time zone of a node for PlanetLab tests (http://www.planet-lab.org/).
  • operation_id: Hash to identify this API call.
  • operation_type: PUT/GET API call.
  • bandwidth_trace: time-series trace of a file transfer (Kbytes/sec) obtained with vnstat (http://humdi.net/vnstat/).
  • node_ip: Network address of the node executing this operation.
  • node_name: Host name of the node executing this operation.
  • quota_start: Amount of data in the Personal Cloud account at the moment of starting the API call.
  • quota_end: Amount of data in the Personal Cloud account at the moment of finishing the API call.
  • quota_total: Storage capacity of this Personal Cloud account.
  • capped (not used): Indicates if the current node is being capped (for PlanetLab tests).
  • failed: Indicates if the API call has failed (1) or not (0).
  • Failure info: Includes the available failure information in this API call (if any).

Files and experiment description:

Load and Transfer Test (University Labs, from 2012-06-28 18:09:36 to 2012-07-03 18:36:07) - Box (http://ast-deim.urv.cat/pc_measurement/measurement_box_load_test.csv)

Load and Transfer Test (University Labs, from 2012-06-28 17:23:05 to 2012-07-03 15:52:37) - DropBox (http://ast-deim.urv.cat/pc_measurement/measurement_dropbox_load_test.csv)

Load and Transfer Test (University Labs, from 2012-06-29 14:12:37 to 2012-07-04 14:00:31) - SugarSync (http://ast-deim.urv.cat/pc_measurement/measurement_sugarsync_load_test.csv)

Load and Transfer Test (PlanetLab, from 2012-07-11 16:02:53 to 2012-07-16 06:05:29) - Box (http://ast-deim.urv.cat/pc_measurement/measurement_box_load_transfer_pl.csv)

Load and Transfer Test (PlanetLab, from 2012-06-22 17:05:05 to 2012-06-28 08:52:38) - DropBox (http://ast-deim.urv.cat/pc_measurement/measurement_dropbox_load_transfer_pl.csv)

Load and Transfer Test (PlanetLab, from 2012-07-11 16:03:53 to 2012-07-17 09:37:24) - SugarSync (http://ast-deim.urv.cat/pc_measurement/measurement_sugarsync_load_transfer_pl.csv)

Service Variability Test (University Labs, from 2012-07-04 16:11:51 to 2012-07-09 10:19:24) - Box (http://ast-deim.urv.cat/pc_measurement/measurement_box_service_variability.csv)

Service Variability Test (University Labs, from 2012-07-03 18:30:47 to 2012-07-09 10:02:50) - DropBox (http://ast-deim.urv.cat/pc_measurement/measurement_dropbox_service_variability.csv)

Service Variability Test (University Labs, from 2012-07-04 16:17:13 to 2012-07-09 14:34:07) - SugarSync (http://ast-deim.urv.cat/pc_measurement/measurement_sugarsync_service_variability.csv)

Citation Policy.

To benefit from this dataset in your research you should cite the original measurement paper: "Actively Measuring Personal Cloud Storage". Raúl Gracia-Tinedo, Marc Sánchez-Artigas, Adrián Moreno-Martínez, Cristian Cotes and Pedro García-López. 6th IEEE International Conference on Cloud Computing (Cloud'13), pages 301-308. June 27-July 2, 2013, Santa Clara Marriot, CA, USA. (http://www.thecloudcomputing.org/2013/).

Bibtex format: @conference{gracia_actively_cloud_13, author = "Gracia-Tinedo, Ra{\'u}l and S{\'a}nchez-Artigas, Marc and Moreno-Mart{\'i}nez, Adri{\'a}n and Cotes-Gonz{\'a}lez, Cristian and Garc{\'i}a-L{\'o}pez, Pedro", booktitle = "IEEE CLOUD'13", pages = "301-308", title = "{A}ctively {M}easuring {P}ersonal {C}loud {S}torage", year = "2013", }

Paper in PDF format: http://ants.etse.urv.es/web/administrator/components/com_jresearch/files/publications/Actively%20Measuring%20Personal%20Cloud%20Storage.pdf

For any doubt email any of the article authors.

Enjoy :)

Overview

StackSync website http://stacksync.org
StackSync Documentation  http://stacksync.org/documentation
StackSync Synchronization service  https://github.com/stacksync/sync-service/releases
StackSync Store API implementation https://github.com/stacksync/swift-API
StackSync Desktop Client https://github.com/stacksync/desktop/releases
StackSync Web Client https://github.com/stacksync/web
StackSync Management Interface https://github.com/stacksync/manager
StackSync Android Client  https://github.com/stacksync/android
StackSync Cloud ABE branch https://github.com/stacksync/desktop/tree/abe
StackSync Hybris integration https://github.com/pviotti/stacksync-desktop
StackSync PrivySeal https://privyseal.epfl.ch/#!/

 

Introduction

StackSync is a Dropbox-like open source synchronization tool that runs on top of OpenStack Swift. StackSync is specially designed to take care of organizations real needs with features like scalability, openness, security and the typical ease of use offered by personal clouds.

In general terms, StackSync can be divided into three main blocks: clientssynchronization service(a.k.a. Synchronization service), and the storage back-end. An overview of the architecture with the main components and their interaction is shown in the figure below.

 

 

As we can see in the figure above, the storage back-end is separated from the rest of the architecture by a line. This means that StackSync can be used with external storage back-ends such as Amazon S3 or RackSpace. This enables StackSync to fit different organization and be offered in three different configurations:

  • Public Cloud. Data and metadata is stored in a public storage provider such as Amazon or Rackspace.
  • Private Cloud. StackSync is installed on-premise. Data and metadata is stored on the company’s infrastructure.
  • Hybrid Cloud. Data is stored in a public storage provider and metadata is kept inside the company’s infrastructure. This allows organizations that sensible information is stored on-premise, while raw data is stored encrypted on a third-party storage service.

 

Last Release Features

Sync your files

Keep your files synced across all your devices.

Sharing

Share your files with your loved ones.

Adaptable to your needs

StackSync can adapt to your organization’s needs. It can be installed on-premise, on a public Cloud, or on a hybrid architecture.

Scalability

StackSync is designed to handle large amounts of users.

Wherever you are

With the mobile apps and web apps you can access your files when you are on the move.

Quota Service 

It helps to assign a quota limit for each user. In this way, the admin has the storage controll. 

Administration Interface

It allows to admin to create user groups, add new users, remove users, assign quota to users. 

 

 

Overview

eyeOS Virtual Desktop https://github.com/cloudspaces/eyeos-u1db
Share API Protocol https://github.com/cloudspaces/interop-protocol
Store API Specification https://github.com/stacksync/swift-API/blob/master/StackSync_API_Specifications.md

Introduction

This document describes an open service platform for Personal Clouds including a number of services ensuring both horizontal and vertical interoperability.

Horizontal interoperability is focused on exchanging and sharing information between heterogeneous Personal Clouds. It includes services for data storage and sharing to shared workspaces. Horizontal interoperability will be demonstrated between StackSync and NEC Personal Clouds using the Share and Store APIs. 

First, we describe the first version of the sharing protocol, which will enable different Personal Clouds to share resources among them via an API, without forcing users to be in the same provider. More generally, the sharing protocol creates a freely-implementable and generic methodology for allowing Personal Cloud interoperability. We provide the protocol specification as well as different use case scenarios that the protocol may face.

Next, we describe the storage API in Section 5. This API is meant to consolidate a standard among Personal Clouds to achieve an easier interoperability and facilitate access to third parties. We specify the different actions and resources available as well as the needed parameters to perform queries and the distinct error codes available.

Vertical Interoperability refers to external third-party applications accessing a Personal Cloud.  Vertical interoperability will be demonstrated thanks to the eyeOs web desktop infrastructure and tools. eyeOS services like eyeFiles and eyeCalendar will demonstrate the aforementioned Store and Share services. 

The goal of this platform is that users retake control of their information stored in Personal Cloud. Users can decide how (access control) and whom (users, applications) can access information stored in their Personal Clouds. The service platform offers open specifications that may be adopted by third-party providers (Personal Clouds, Applications) and thus break the fragmentation of the market and its implicit vendor lock-in.

 

 

Share API

The sharing protocol enables different Personal Clouds to share resources among them via an API, without forcing users to be in the same provider. More generally, the sharing protocol creates a freely-implementable and generic methodology for allowing Personal Cloud interoperability

Prerequisites

Having two Personal Clouds (Personal Cloud 1 and Personal Cloud 2) that pretend to interoperate with each other. They must meet the following requirements before using the present specification.

  1. Once the sharing process is completed, Personal Clouds must use APIs to access protected resources. In case they do not implement the Storage API proposed bellow, Personal Cloud 1 must implement an adapter to access Personal Cloud 2 API, and vice versa.
  2. Personal Cloud 1 must be registered in Personal Cloud 2 and validated as an authorized service in order to obtain its credentials, and vice versa. The method in which Personal Clouds register with each other and agree to cooperate is beyond the scope of this specification.
Request URL

The sharing protocol defines three endpoints:

  • Share URL. The URL used to present the sharing proposal to the user and obtain authorization.
  • Unshare URL. The URL used to finish the sharing agreement.
  • Credentials URL. The URL used to provide the access credentials.
Interoperability process overview

The sharing protocol is done in three steps:

  1. User A invites User B to its folder located in Personal Cloud A.
  2. Personal Cloud A creates the sharing proposal.
  3. Personal Cloud A sends the access credentials to Personal Cloud B

In the last figure, we can observe the sharing process divided in the three steps commented above. First, a user in Personal Cloud A expresses its intention of sharing a file with a external user (User B). Personal Cloud A will send an email with information about the proposal to the external user. The external user will select its favourite Personal Cloud, in which it has an account, namely Personal Cloud B. In the second step, Personal Cloud A will create the sharing proposal and send it to Personal Cloud B, which will require User B to authorize the proposal. The result of the proposal will be returned to Personal Cloud A. Finally, Personal Cloud A will hand the access credentials over the Personal Cloud B, granting forthcoming access to the shared resource. 

User Invitation
  1. User sends an invitation: User A wants to share a folder with User B. Therefore, User A goes to Personal Cloud A, selects the folder he/she wants to share, and introduces the email of User B, who will receive an email indicating the intention of User A to share a folder with him/her and a link to a website located on Personal Cloud A. 
  2. The recipient selects its Personal Cloud: User B clicks on the link and is taken to the Personal Cloud A, where it is asked to select its Personal Cloud from a list of services that have an agreement with Personal Cloud A. User B selects Personal Cloud B.
  3. Creating the sharing proposal: At this time, Personal Cloud A creates the sharing proposal. To create a sharing proposal, Personal Cloud A sends an HTTP request to Personal Cloud B’s share URL. The Personal Cloud B documentation specifies the HTTP method for this request, and HTTP POST is RECOMMENDED.

     

    Field Description
    share_id A random value that uniquely identifies the sharing proposal
    resource_url An absolute URL to access the shared resource located in Personal Cloud A.
    owner_name The name corresponding to the owner of the folder
    owner_email The email corresponding to the owner of the folder
    folder_name The name of the folder
    permission Permissions granted to the recipient. Options are READ-ONLY and READ-WRITE.
    recipient The email corresponding to the user who the folder has been shared with
    callback An absolute URL to which the Personal Cloud B will redirect the User back when the Accepting the invitation step is completed.
    protocol_version MUST be set to 1.0. Services MUST assume the protocol version to be 1.0 if this parameter is not present

 

Invitation acceptance
  1. The user accepts the proposal: Personal Cloud B displays User B the details of the folder invitation request. User B must provide its credentials and explicitly accept the invitation.
  2. Returning the proposal response: Once Personal Cloud B has obtained approval or denial from User B, Personal Cloud B must use the callback to inform Personal Cloud A about the User B decision. Personal Cloud B uses the callback to construct an HTTP GET request, and directs the User’s web browser to that URL with the following values added as query parameters:

     

    Field Description
    share_id A random value that uniquely identifies the sharing proposal
    accepted A string indicating whether the invitation has been accepted or denied. true and false are the only possible values
Access Credentials
  1. Granting access to the service: 

    When Personal Cloud A receives the proposal result it must provide the access credentials to Personal Cloud B in order to be able to obtain the shared resource. Personal Cloud A sends an HTTP request to Personal Cloud B’s credential URL. The Personal Cloud B documentation specifies the HTTP method for this request, and HTTP POST is recommended.

    Personal Cloud A specifies what type of authentication protocol and version must be used to access the resource. To this end, Personal Cloud B must check the auth_protocol and auth_protocol_version parameters. The authentication protocol and version used by Personal Cloud A is beyond the scope of this specification, but OAuth 1.0a or OAuth 2.0 is recommended.


    Field Description
    share_id A random value that uniquely identifies the sharing proposal
    auth_protocol The authentication protocol used to access the shared resource (e.g. oauth)
    auth_protocol_version The version of the authentication protocol (e.g. 1.0a)

    Other authentication-specific parameters are sent together with the above parameters, these parameters may include values like tokens, timestamps or signatures.

Storage API

In this section we will describe the storage API. This API is meant to consolidate a standard among Personal Clouds to achieve an easier interoperability and facilitate access to third parties. We will specify the different actions and resources available as well as the needed parameters to perform queries and the distinct error codes available.

API Specification: https://github.com/stacksync/swift-API/blob/master/StackSync_API_Specifications.md

Source Code: https://github.com/stacksync/swift-API 

 

You are here: Home CloudSpaces Results