Personal Cloud Datasets

NEC Personal Cloud Trace

The NEC dataset integrates two sources of information: storage layer and sharing interactions. The information to build our dataset was collected directly by the provider from the back-end (SQL Server & OpenStack Swift). 

Regarding storage (CS_FileTraces), the trace is a snapshot of the data store contents (OpenStack Swift). To wit, the trace contains log lines that identify and describe files (size, extension), as well as the file owner and the container/folder where it is stored. This enables us to analyze the storage layer of this service in detail.

The sharing trace (CS_SharingTraces) contains log lines describing sharing interactions across users and information about shared files. This trace contains all sharing interactions in the NEC Personal Cloud (Madrid datacenter) from March 7th 2013 to September 9th 2015. To our knowledge, this is the most extensive trace of data sharing in Personal Clouds to date.

Dataset limitations: The collected dataset has some limitations. For instance, we did not trace information about the locations of users (e.g., IP addresses) due to privacy reasons. Moreover, it should be noted that while the sharing trace captures the whole sharing activity of users for several months, the storage layer information corresponds to a snapshot at a specific point in time. Consequently, not all the files captured in the sharing trace exist in the storage trace, and vice-versa. The sharing trace also lacks from temporal information.

Download the NEC trace (.zip) - 7.7MB (md5: 27f9f5478d7cbffcedc1b317e8765321)

Citation Policy.

To benefit from this dataset in your research you should cite the original measurement paper: "Understanding Data Sharing in Private Personal Clouds".  R. Gracia-Tinedo, Pedro García-López, Alberto Gómez and Anastasio IllanaIEEE International Conference on Cloud Computing (CLOUD'16), San Francisco, June 2016 (http://thecloudcomputing.org/2016/). 


 

UbuntuOne (U1) Measurement

Protocol Entities

Before describing the trace format, it is important to understand few concepts:

  1. Volume: it can be considered as a directory. During the installation of the U1 client, the client creates an initial volume to store files with id=0 (root/predefined). There are 3 types of volumes: i) root/predefined, ii) udf (user defined folder, which is a folder created by the user) and iii) share (sub-volume of another user to which the current user has access).

  2. Node: A node is a file or a directory in the system. We can distinguish between both in the node_type field.

  3. Session: A user interacts with the server in the context of a U1-storage protocol session (not HTTP or any other session type). This session is used to identify the requests of a single user during the session lifetime. Sessions are not usually expired automatically. A client may disconnect, or a server process may go down, and that will end the session.

  4. Request types: There are different request types. Type “storage”/”storage_done” is related with the storage operations of users (put, get, delete,...). Type “session” denote those operations related with the management of user sessions. Type "rpc" are the translation of storage protocol messages (of type “storage*”) into RPC calls to the metadata backend. 

Protocol Operations

 Meaning of operation types:

  • ListSharesthis operation lists all the volumes of a user that are type “share”. In this operation, their field “shared_by” is the owner of the volume and “shared_to” is the user to which that volume was shared with. In this operation, the field `shares` is like `udfs`; it represents the number of volumes type share of this user.
  • PutContentResponse/GetContentResponsethese are the actual file uploads and downloads, respectively. The notification goes to the U1 backend but the actual data is stored in a separate service (Amazon S3).
  • Unlink: Delete a file or a directory from a volume.
  • Make: Request/Response to a touch operation in the backend. Normally this is associated with RPC upload jobs, in consequence to a file upload.
  • QuerySetCapsResponse: "frozenset([u'resumable-uploads', u'no-content', u'account-info', u'volumes', u'generations', u'fix462230'])
  • GetDeltaResponse: Get the differences between the server volume and the local one (generations).
  • AuthenticateRequest/Response: Operations managed by the servers to create sessions for users.
  • ListVolumesoperation that is normally done at the beginning of a session, that lists all the volumes of a user (own, shared,..).

Trace Description

Trace logfile name. Consider a line in the trace like this "production-whitecurrant-23-20140128". They'll all be "production-", because we didn't look at logs from staging which is the other prefix. After that prefix is the name of the physical machine (whitecurrant), followed by the number of the server process (23) and the date. The server process number can migrate between nodes to load balance. After that is the date the logfile was "cut" (there is one log file per server/service and day).

Normally there are between 8-16 processes per physical machine. The identifier of the process is unique within a machine, although it can migrate from one machine to another. Sharding is in the metadata storage backend, so it is behind the point where traces are taken. This means that in these traces any combination of server/process can handle any user. To have a strictly sequential notion of the activity of a user we should take into account the session and sort the trace by timestamp (a user may have more than one parallel connection). A session starts in the least loaded machine and lives in the same node until it finishes, making user events strictly sequential.

Explanation of the columns in the trace file (.csv): 

  • T: trace type. Distinct types of traces have different columns associated.
  • addr: Inbound address and port. Probably not particularly interesting, as they'll be internal addresses.
  • caps: the caps of a client. Newer clients will have more capabilities than older ones. It's like feature flags. This allow U1 to distinguish which clients can do certain actions, and even to notify a client about the necessity of updating his client.
  • client_metadata: metadata of a client. For example, operating system, client build information, etc.
  • current_gen: current volume generation. A volume's generation is a monotonic counter increased for every operation on the volume. It is needed for clients to detect changes between the server version and their own.
  • ext: file extension.
  • failed: error message (if any).
  • free_bytes: available space in the volume.
  • from_gen: client generation. This is for an operation that requests all the changes to a volume from a certain generation onwards. The client keeps track of the generation it's at, and on reconnection asks for changes from that point on.
  • hash: sha1sum of the file
  • level: trace level.
  • logfile_id: trace file identifier.
  • method: rpc method name.
  • mime: file type.
  • msg: request message.
  • node_id: id a file or a directory.
  • nodes: is the number of files and directories within a volume.
  • pid: # of server process.
  • req_id: request identifier. There are separate entries for the start and end of every operation. `req_id` is a unique identifier for this server and session, where the server is part of the logfile_id (without date).  `req_id` is present in entries of type `storage` and `storage_done` (specialization of storage). ‘Storage’ requests are also specialization of ‘Session’.
  • req_t: request type.
  • root: root of the volume. All volumes have a root node that would be similar to the mount point in a file system.
  • server: server type.
  • shared_by: owner of the shared volume.
  • shared_to: user to which the shared folder is shared with.
  • shares: number of shared volumes for this user.
  • sid: id of the ubuntuone-storageprotocol session (not http).
  • size: file size.
  • time: request processing time (database processing).
  • tstamp: trace timestamp.
  • type: node type (here we can distinguish between files and directories).
  • udfs: number of user volumes.
  • user_id: user identifier.
  • user: user name (anonymized using int32 random numbers). Only used when the users starts a session.
  • vol_id: volume identifier. "0" is the identifier of the root volume for users. This means that all the users have a volume with vol_id=0.

Download the U1 dataset (.gz) - 84GB (md5: 62cf355792e59e85a95886e8e23b4ae3).

Citation Policy.

To benefit from this dataset in your research you should cite the original measurement paper: "Dissecting UbuntuOne: Autopsy of a Global-scale Personal Cloud Back-end".  R. Gracia-TinedoYongchao Tian, Josep Sampé, Hamza HarkousJohn Lenton, Pedro García-López, Marc Sánchez-Artigas and Marko Vukolic. ACM Internet Measurement Conference (IMC'15), Tokyo, October 2015 (http://conferences2.sigcomm.org/imc/2015/). 

 


 

Active Personal Cloud Measurement

We have created a script named "measurement.py". It takes two arguments:

  • Provider: -p or --provider and [dropbox|box|sugarsync]
  • Test: -t or --type and [load_and_transfer|service_variability]

Depending on the test type, the script executes one of the following files:

  • load_and_transfer: File load_and_transfer_test.py
  • service_variability: File service_variability.py

Load and Transfer Workload

The objective of this workload was twofold: Measuring the maximum up/down transfer speed of operations and detecting correlations between the transfer speed and the load of an account. Intuitively, the first objective was achieved by alternating upload and download operations, since the provider only needed to handle one operation per account at a time. We achieved the second point by acquiring information about the load of an account in each API call. The execution of this workload was continuously performed at each node as follows: First, a node created synthetic files of a size chosen at random from the aforementioned set of sizes. That node uploaded files until the capacity of the account was full. At this point, that node downloaded all the files also in random order. After each download, the file was deleted.

Implementation: First of all, the script creates 4 different files of different sizes (25, 50, 100 and 150 MB). Once done, it starts to upload random files until the account is full and an error is returned from the provider. When this appends the script starts to download a random file from the account and removes it when the download has finished.

This test will be running for 5 days approximately.

Service Variability Workload

This workload maintained in every node a nearly continuous upload and download transfer flow to analyze the performance variability of the service over time. This workload provides an appropriate substrate to elaborate a time-series analysis of these services. The procedure was as follows: The upload process first created files corresponding to each defined file size which were labeled as “reserved”, since they were not deleted from the account. By doing this we assured that the download process was never interrupted, since at least the reserved files were always ready for being downloaded. Then, the upload process started uploading synthetic random files until the account was full. When the account was full, this process deleted all files with the exception of the reserved ones to continue uploading files. In parallel, the download process was continuously downloading random files stored in the account.

Implementation: The script creates 4 files as the previous test. When the files are ready it uploads a file named reserved.dat (which will remain in the accound until the test ends) with a size of 50MB. As soon as the file is completely uploaded, the script creates two threads, one for download and one for upload.

  • The upload thread constantly uploads files sized [25, 50, 100, 150] MB until the account is full. Once it is full, the thread removes all files except "reserved.dat". Then, it starts its cycle again.
  • The download thread continuously lists all files in the account and downloads one randomly chosen. There will be always at least one file ("reserved.dat").

Deployment

Finally, we executed the experiments in different ways depending on the chosen platform. In the case of PlanetLab, we employed the same machines in each test, and therefore, we needed to sequentially execute all the combinations of workloads and providers. This minimized the impact of hardware and network heterogeneity, since all the experiments were executed in the same conditions. On the contrary, in our labs we executed in parallel a certain workload for all providers (i.e. assigning 10 machines per provider). This provided two main advantages: The measurement process was substantially faster, and fair comparison of the three services was possible for the same period of time.

Traces

(NEW!) Here we provide some measurement traces collected during our measurement.

Trace Format. Files are in .csv format. The column fields are:

  • row_id: database row identifier.
  • account_id: Personal Cloud account used to perform this API call.
  • file_size: size of the uploaded/downloaded file in bytes.
  • operation_time_start: starting time of the API call.
  • operation_time_end: Finishing time of the API call.
  • time_zone (not used): Time zone of a node for PlanetLab tests (http://www.planet-lab.org/).
  • operation_id: Hash to identify this API call.
  • operation_type: PUT/GET API call.
  • bandwidth_trace: time-series trace of a file transfer (Kbytes/sec) obtained with vnstat (http://humdi.net/vnstat/).
  • node_ip: Network address of the node executing this operation.
  • node_name: Host name of the node executing this operation.
  • quota_start: Amount of data in the Personal Cloud account at the moment of starting the API call.
  • quota_end: Amount of data in the Personal Cloud account at the moment of finishing the API call.
  • quota_total: Storage capacity of this Personal Cloud account.
  • capped (not used): Indicates if the current node is being capped (for PlanetLab tests).
  • failed: Indicates if the API call has failed (1) or not (0).
  • Failure info: Includes the available failure information in this API call (if any).

Files and experiment description:

Load and Transfer Test (University Labs, from 2012-06-28 18:09:36 to 2012-07-03 18:36:07) - Box (http://ast-deim.urv.cat/pc_measurement/measurement_box_load_test.csv)

Load and Transfer Test (University Labs, from 2012-06-28 17:23:05 to 2012-07-03 15:52:37) - DropBox (http://ast-deim.urv.cat/pc_measurement/measurement_dropbox_load_test.csv)

Load and Transfer Test (University Labs, from 2012-06-29 14:12:37 to 2012-07-04 14:00:31) - SugarSync (http://ast-deim.urv.cat/pc_measurement/measurement_sugarsync_load_test.csv)

Load and Transfer Test (PlanetLab, from 2012-07-11 16:02:53 to 2012-07-16 06:05:29) - Box (http://ast-deim.urv.cat/pc_measurement/measurement_box_load_transfer_pl.csv)

Load and Transfer Test (PlanetLab, from 2012-06-22 17:05:05 to 2012-06-28 08:52:38) - DropBox (http://ast-deim.urv.cat/pc_measurement/measurement_dropbox_load_transfer_pl.csv)

Load and Transfer Test (PlanetLab, from 2012-07-11 16:03:53 to 2012-07-17 09:37:24) - SugarSync (http://ast-deim.urv.cat/pc_measurement/measurement_sugarsync_load_transfer_pl.csv)

Service Variability Test (University Labs, from 2012-07-04 16:11:51 to 2012-07-09 10:19:24) - Box (http://ast-deim.urv.cat/pc_measurement/measurement_box_service_variability.csv)

Service Variability Test (University Labs, from 2012-07-03 18:30:47 to 2012-07-09 10:02:50) - DropBox (http://ast-deim.urv.cat/pc_measurement/measurement_dropbox_service_variability.csv)

Service Variability Test (University Labs, from 2012-07-04 16:17:13 to 2012-07-09 14:34:07) - SugarSync (http://ast-deim.urv.cat/pc_measurement/measurement_sugarsync_service_variability.csv)

Citation Policy.

To benefit from this dataset in your research you should cite the original measurement paper: "Actively Measuring Personal Cloud Storage". Raúl Gracia-Tinedo, Marc Sánchez-Artigas, Adrián Moreno-Martínez, Cristian Cotes and Pedro García-López. 6th IEEE International Conference on Cloud Computing (Cloud'13), pages 301-308. June 27-July 2, 2013, Santa Clara Marriot, CA, USA. (http://www.thecloudcomputing.org/2013/).

Bibtex format: @conference{gracia_actively_cloud_13, author = "Gracia-Tinedo, Ra{\'u}l and S{\'a}nchez-Artigas, Marc and Moreno-Mart{\'i}nez, Adri{\'a}n and Cotes-Gonz{\'a}lez, Cristian and Garc{\'i}a-L{\'o}pez, Pedro", booktitle = "IEEE CLOUD'13", pages = "301-308", title = "{A}ctively {M}easuring {P}ersonal {C}loud {S}torage", year = "2013", }

Paper in PDF format: http://ants.etse.urv.es/web/administrator/components/com_jresearch/files/publications/Actively%20Measuring%20Personal%20Cloud%20Storage.pdf

For any doubt email any of the article authors.

Enjoy :)

You are here: Home Results Datasets