BTWorld
  • 01
    Big data processing
  • 02
    Use of BitTorrent
  • 03
    Mapreduce workflow
  • 04
    Results & system performance
  • 05
    Other uses
  • BTWorld: MapReduce Workflows for TimeBased Analytics

    Collecting and analyzing large amounts of data about the operation of vital systems such as traffic systems and the financial system is ever more important. By means of the example of the worldwide BitTorrent filesharing system, we show how workflows for bigdata queries can be created.

    Watch video
    Start reading
  • 01
    Big Data Processing

    Many vital systems in current society are monitored with ever more detail in order to understand, and steer, their operation. Examples of such systems are man­made systems such as traffic systems, security systems, the financial system, and business process systems, and natural systems such the weather, oceans with their currents, and forests that are likely to have fires. Performing time­based analytics of such systems, that is, processing and analyzing the time­based data collected from such systems, is a big challenge: the data have to be processed in time, and without error. Currently, no good methods exist for building solutions for large­scale data­analytics problems that are resilient to failures and deliver the required performance.

    Here, we present the BTWorld use case for time­based big data analytics, which aims at understanding the operation and evolution of BitTorrent, a major Internet application for file sharing with significant traffic and over 100 million users. Our use case extends prior work on MapReduce workloads with a comprehensive use case that focuses on a new application domain, a workflow of coupled MapReduce jobs, and an empirical study based on a multiyear data set. With BTWorld, we are also able to extend over a decade of theoretical BitTorrent research with knowledge that can only be acquired from a big­data­driven study.

  • 02
    Use case: BitTorrent

    BitTorrent is a peer­to­peer (P2P) file­sharing protocol whose success comes mainly from facilitating and incentivizing collaboration between peers. BitTorrent breaks up files into pieces that can be shared individually by peers. For each file shared in BitTorrent, the file name and information about its pieces form a metadata file (a torrent), which is uniquely identified. A swarm is a group of BitTorrent peers sharing the same torrent. Among the peers of a swarm, seeders possess all the pieces of the file, while leechers possess only some of the pieces and are downloading the remainder. To help peers meet each other, for example to join a swarm for the first time, BitTorrent uses trackers, which are centralized servers that give upon request lists of peers in the swarm of a particular torrent (see Figure 1).

    BTWorld focuses on understanding BitTorrent and its evolution, which have a significant impact on the operation of the entire Internet. Traditional BitTorrent theory can predict interesting steady­state phenomena, but fails to account for complex transient behavior (e.g., flashcrowds), for complex technical limitations (e.g., firewalls), and for complex inter­dependencies between global BitTorrent elements (e.g., legal effects). Measurement studies have been performed [IPTPS2005, IPTPS2010], but these usually focus on specific questions and consider relatively short time periods. As a consequence, many important questions related to non­functional system properties—availability, performance—cannot be answered. As an alternative, with BTWorld we propose a data­driven approach to acquiring knowledge about BitTorrent. By continuously collecting data over a long time period (currently already more than 4 years) that can be used in statistical models and validation of theories, BTWorld promises to solve many of the problems faced by the current theoretical and measurement approaches. However, a data­driven approach raises many challenges in building an efficient, scalable, and cost­effective system for data processing and preservation.

    In BTWorld, we obtain information from the trackers (and not the individual peers) on the swarms they serve by scraping them. In Table I we present an overview of the BTWorld dataset we have collected.

    Learn more

    Download IPTPS2005
    Download IPTPS2010
    Table I
    Figure 1

    15 TB

    of data produced by monitoring the BitTorrent trackers for

    4 YEARS

  • 03
    Mapreduce workflow

    Given the richness of the data set, it is common for the BitTorrent analyst to design new queries, which could (and in our experience do) traverse the entire data set to produce their output. At the same time, queries need to be translated to executable code as fast and simply as possible. The BTWorld workflow is designed as a set of interdependent queries written in Pig Latin, an SQL­like programming language designed to be automatically translated to MapReduce jobs. The queries are presented in the figure [Figure 2], together with their data dependencies created to maximize data reuse. Table II summarizes their function.

    As examples of a query, the query Tracker over Time (ToT) reports the number of swarms served, the total peer population, and the ratios per swarm of the numbers of seeders and leechers, and the query ActiveSwarms (AS) reports, based on the output of ToT, the total number of active swarms in the system.

    Learn more

    IEEE-BigData-2013
    CCGrid-SCALE-2013
    Figure 2
    Table II

  • 04
    Results & System Performance

    We have executed the BTWorld workflow first on a 100 GB dataset and later on a 1.5 TB dataset on a 24-node Hadoop cluster deployed on our DAS-4 system with the configuration given in Table III. Some queries scan the complete dataset and their runtimes are therefore proportional to the dataset size. Some others are small post-processing queries that summarize the output data of other queries and have runtimes that are almost independent of the dataset size. For instance, the query Trackers over Time (ToT) reports the number of swarms and their composition in numbers of leechers and seeders for each tracker over time, and does scan the complete dataset. In contrast, the query Active Swarms (AS) takes the output of ToT and simply reports the total number of swarms that are active in BitTorrent over time.  The query response times in Figure 3 reflect this (in)dependency to the dataset size.

    Some properties of the execution of the BTWorld workflow are shown in Table IV. As it turns out, the workflow scales sublinearly as the total runtime only increases by a factor of about 7 for a dataset that is 15 times as large.

    Learn more

    Table III
    Figure 3
    Table IV

  • 05
    Other uses of time-based analytics

    Our method for creating our BTWorld workflow easily carries over to systems in other application areas. Examples of such systems are manmade systems such as traffic systems, security systems, the financial system, and business process systems, and natural systems such the weather, oceans with their currents, and forests that are likely to have fires.

  • BTWorld is a project that is realised with help of
    You can get more information at
    Dick Epema (d.h.j.epema@tudelft.nl) and
    Alexandru Iosup (a.iosup@tudelft.nl)