Collecting and analyzing large amounts of data about the operation of vital systems such as traffic systems and the financial system is ever more important. By means of the example of the worldwide BitTorrent filesharing system, we show how workflows for bigdata queries can be created.
Many vital systems in current society are monitored with ever more detail in order to understand, and steer, their operation. Examples of such systems are manmade systems such as traffic systems, security systems, the financial system, and business process systems, and natural systems such the weather, oceans with their currents, and forests that are likely to have fires. Performing timebased analytics of such systems, that is, processing and analyzing the timebased data collected from such systems, is a big challenge: the data have to be processed in time, and without error. Currently, no good methods exist for building solutions for largescale dataanalytics problems that are resilient to failures and deliver the required performance.
Here, we present the BTWorld use case for timebased big data analytics, which aims at understanding the operation and evolution of BitTorrent, a major Internet application for file sharing with significant traffic and over 100 million users. Our use case extends prior work on MapReduce workloads with a comprehensive use case that focuses on a new application domain, a workflow of coupled MapReduce jobs, and an empirical study based on a multiyear data set. With BTWorld, we are also able to extend over a decade of theoretical BitTorrent research with knowledge that can only be acquired from a bigdatadriven study.
BitTorrent is a peertopeer (P2P) filesharing protocol whose success comes mainly from facilitating and incentivizing collaboration between peers. BitTorrent breaks up files into pieces that can be shared individually by peers. For each file shared in BitTorrent, the file name and information about its pieces form a metadata file (a torrent), which is uniquely identified. A swarm is a group of BitTorrent peers sharing the same torrent. Among the peers of a swarm, seeders possess all the pieces of the file, while leechers possess only some of the pieces and are downloading the remainder. To help peers meet each other, for example to join a swarm for the first time, BitTorrent uses trackers, which are centralized servers that give upon request lists of peers in the swarm of a particular torrent (see Figure 1).
BTWorld focuses on understanding BitTorrent and its evolution, which have a significant impact on the operation of the entire Internet. Traditional BitTorrent theory can predict interesting steadystate phenomena, but fails to account for complex transient behavior (e.g., flashcrowds), for complex technical limitations (e.g., firewalls), and for complex interdependencies between global BitTorrent elements (e.g., legal effects). Measurement studies have been performed [IPTPS2005, IPTPS2010], but these usually focus on specific questions and consider relatively short time periods. As a consequence, many important questions related to nonfunctional system properties—availability, performance—cannot be answered. As an alternative, with BTWorld we propose a datadriven approach to acquiring knowledge about BitTorrent. By continuously collecting data over a long time period (currently already more than 4 years) that can be used in statistical models and validation of theories, BTWorld promises to solve many of the problems faced by the current theoretical and measurement approaches. However, a datadriven approach raises many challenges in building an efficient, scalable, and costeffective system for data processing and preservation.
In BTWorld, we obtain information from the trackers (and not the individual peers) on the swarms they serve by scraping them. In Table I we present an overview of the BTWorld dataset we have collected.
Given the richness of the data set, it is common for the BitTorrent analyst to design new queries, which could (and in our experience do) traverse the entire data set to produce their output. At the same time, queries need to be translated to executable code as fast and simply as possible. The BTWorld workflow is designed as a set of interdependent queries written in Pig Latin, an SQLlike programming language designed to be automatically translated to MapReduce jobs. The queries are presented in the figure [Figure 2], together with their data dependencies created to maximize data reuse. Table II summarizes their function.
As examples of a query, the query Tracker over Time (ToT) reports the number of swarms served, the total peer population, and the ratios per swarm of the numbers of seeders and leechers, and the query ActiveSwarms (AS) reports, based on the output of ToT, the total number of active swarms in the system.
We have executed the BTWorld workflow first on a 100 GB dataset and later on a 1.5 TB dataset on a 24-node Hadoop cluster deployed on our DAS-4 system with the configuration given in Table III. Some queries scan the complete dataset and their runtimes are therefore proportional to the dataset size. Some others are small post-processing queries that summarize the output data of other queries and have runtimes that are almost independent of the dataset size. For instance, the query Trackers over Time (ToT) reports the number of swarms and their composition in numbers of leechers and seeders for each tracker over time, and does scan the complete dataset. In contrast, the query Active Swarms (AS) takes the output of ToT and simply reports the total number of swarms that are active in BitTorrent over time. The query response times in Figure 3 reflect this (in)dependency to the dataset size.
Some properties of the execution of the BTWorld workflow are shown in Table IV. As it turns out, the workflow scales sublinearly as the total runtime only increases by a factor of about 7 for a dataset that is 15 times as large.
Our method for creating our BTWorld workflow easily carries over to systems in other application areas. Examples of such systems are manmade systems such as traffic systems, security systems, the financial system, and business process systems, and natural systems such the weather, oceans with their currents, and forests that are likely to have fires.