SparkTrails
SparkTrails is a MapReduce implementation of HypTrails based on Apache Spark and enables fast, scalable and distributed comparison of hypotheses about human trails.
For a primer, we refer to the paper introducing SparkTrails and as well as the original HypTrails paper. In the following we give additional details on the SparkTrails and how to use it, and elaborate on the evaluation we have conducted in the SparkTrails paper.
-
Martin Becker, Hauke Mewes, Andreas Hotho, Dimitar Dimitrov, Florian Lemmerich, Markus Strohmaier
SparkTrails: A MapReduce Implementation of HypTrails for Comparing Hypotheses About Human Trails
Proceedings of the 25th International Conference on World Wide Web, 2016 -
Philipp Singer, Denis Helic, Andreas Hotho, Markus Strohmaier
HypTrails: A Bayesian Approach for Comparing Hypotheses About Human Trails on the Web
Proceedings of the 24th International Conference on World Wide Web, 2015
Quickstart
In this section we give a quickstart tutorial on how to get and use SparkTrails. SparkTrails is implemented for Apache Spark and is written in Scala. Thus, the relevant code snippets are mostly in Scala. For more details, also see the GitHub page
Get and build SparkTrails:
Get Spark and run the Spark Shell including SparkTrails:
Run our first example:
You can plot the results for example using R:For additional examples, we refer to the SparkTrails repository where we will successively add more tutorials.
Evaluation
For a general evaluation we refer to the SparkTrails paper. In this section we elaborate on the datasets used for runtime experiments as well as the baseline code.
-
Martin Becker, Hauke Mewes, Andreas Hotho, Dimitar Dimitrov, Florian Lemmerich, Markus Strohmaier
SparkTrails: A MapReduce Implementation of HypTrails for Comparing Hypotheses About Human Trails
Proceedings of the 25th International Conference on World Wide Web, 2016
Datasets
In the SparkTrails paper, we use several datasets to evaluate the runtime of our approach. This includes several synthetic datasets, a Wikipedia dataset and a Flickr dataset aimed at demonstrating the scalability of SparkTrails. In this section we elaborate on these datasets.
Generally, HypTrails works with trails on a finite set of states in a Markov chain scenario of order 1. Thus, each evaluation dataset is made up of two matrices: an observation matrix and a hypothesis matrix. The observation matrix contains the number observation for each state transition and the hypothesis matrix contains the transition probabilities between states according to a user hypothesis.
Synthetic Datasets
Generally we use synthetic datasets because we can easily scale their size. Additionally, we can identify the characteristics of SparkTrails with regard to the sparsity of the given datasets. Thus, overall, we work with two types synthetic datasets: dense datasets and a sparse datasets. Both are generated randomly. The general procedure is as follows:
- Define the number of states for the datasets, a seed for the pseudo-random generator and an upper limit for the observation count.
- Create a list with all the states and assign a pseudo-random number, a state seed, to each state, using the seed from step 1.
- For each state, create another list with all the states and assign a pseudo-random value between 1 and the upper limit specified in step 1 to each of them, using the state seed from step 2.
With this approach, we get a dense observation matrix, i.e., a matrix with no zero entries. However, since typical datasets are usually sparse, i.e., there are a lot of entries equals to zero, we need a way to imitate this characteristic. Hence, we introduce a so called row fraction and column fraction. The row fraction is used to determine the count of non-zero rows in our observation matrix, the column fraction to determine the number of non-zero entries in the remaining rows. Both are used as a probability for each row/column to be zero or not and are applied before iterating through the list of the states in step 2/3.
Dense Datasets
The dense datasets (s1 and s2) use dense observation and hypothesis matrices created as described above. For s1 we use 93000 states and for s2 we use 186000 states. In both cases we use a single generated matrix as both observation and hypothesis matrix.
Sparse Datasets
The sparse dataset (r) is randomly created using row fraction and column fraction as described above. We use 26350000 states and a row as well as a column fraction of 1%. Note that while this is very sparse, the fractions still larger than for example for the Wikipedia dataset. Again we use a single generated matrix as both observation and hypothesis matrix. In the sparse case this represents the worst case with regard to runtime since SparkTrails takes advantage of the general sparisty structure as described in the corresponding paper.
Wikipedia Dataset
The Wikipedia datasets (wself and wnw) is based on transitions between Wikipedia articles.
The observation matrix for both datasets (wself and wnw) contains transition counts extracted from the logs of the desktop version of Wikipedia from Feb 2015 [1] by Wulczyn and Taraborelli. Log entries identified to be from spiders and bots are not included. Transition that occur less than 10 times in the logs are removed. The dataset contains transitions with articles from the main namespace as but also includes incoming traffic from external sources. For our experiments, we discard the latter.
For the "self" dataset (wself), we reuse the observation matrix as hypothesis matrix.
For the network dataset (wnw), the hypothesis matrix uses the underlying link network created from the links between the articles in the XML Dump of the English Wikipedia from 4th of March 2015. The transition probability between a state i and a state j is equal to 0 if there is no link and 1 over the number of outgoing links of state i otherwise.
-
[1] Ellery Wulczyn, Dario Taraborelli
Wikipedia clickstream
figshare, 2015 (accessed: 2015-12-13)
Flickr Dataset
For the Flickr dataset (f), we consider phototrails of individual Flickr users. The dataset we use is based on the "Photowalking the city" paper by Becker et al. [1]. In particular, we use the Los Angeles observation matrix and the proximity hypothesis with a spreading parameter of 2.5km.
The observation matrix is generated by dividing the city area into cells of 200m by 200m and counting the transitions between these grid cells with regard to photo sequences generated by individual Flickr users.
For the hypothesis matrix, the transition matrix is based on proximity. That is, the transition probability between a state i and a state j is small, if the distance between these two states is large: P(j|i) = 1/Z · e−1 / (2σ2) · dist(i, j)2.
-
[1] Martin Becker, Philipp Singer, Florian Lemmerich, Andreas Hotho, Denis Helic, Markus Strohmaier
Photowalking the city: comparing hypotheses about urban photo trails on Flickr
Social Informatics, Volume 9471 of the series Lecture Notes in Computer Science, 2015
Baselines
While a standard Python implementation exists, we use a custom, simplified Python implementation as a baseline. The original implementation is flexible with regard to several special cases and allows to use specialized storage mechanisms for large observation and hypothesis matrices. However, in our evaluation scenario, we found this implementation to be slow in comparison to our custom, simplified version. Thus, for evaluating SparkTrails, we use this simplified version, which will be available on GitHub.Authors
-
Martin Becker
University of Würzburg
http://www.is.informatik.uni-wuerzburg.de/staff/becker -
Hauke Mewes
University of Würzburg
-
Andreas Hotho
University of Würzburg
http://www.is.informatik.uni-wuerzburg.de/staff/hotho/ -
Dimitar Dimitrov
GESIS - Leibniz Institute for the Social Sciences
http://www.dimitardimitrov.info -
Florian Lemmerich
GESIS - Leibniz Institute for the Social Sciences
http://www.is.informatik.uni-wuerzburg.de/staff/lemmerich/ -
Markus Strohmaier
GESIS - Leibniz Institute for the Social Sciences and University of Koblenz-Landau
http://www.gesis.org/das-institut/mitarbeiterverzeichnis/?name=markus%2Cstrohmaier