SparkTrails | A MapReduce implementation of HypTrails

For a primer, we refer to the paper introducing SparkTrails and as well as the original HypTrails paper. In the following we give additional details on the SparkTrails and how to use it, and elaborate on the evaluation we have conducted in the SparkTrails paper.

Martin Becker, Hauke Mewes, Andreas Hotho, Dimitar Dimitrov, Florian Lemmerich, Markus Strohmaier
SparkTrails: A MapReduce Implementation of HypTrails for Comparing Hypotheses About Human Trails
Proceedings of the 25th International Conference on World Wide Web, 2016
Philipp Singer, Denis Helic, Andreas Hotho, Markus Strohmaier
HypTrails: A Bayesian Approach for Comparing Hypotheses About Human Trails on the Web
Proceedings of the 24th International Conference on World Wide Web, 2015

Quickstart

In this section we give a quickstart tutorial on how to get and use SparkTrails. SparkTrails is implemented for Apache Spark and is written in Scala. Thus, the relevant code snippets are mostly in Scala. For more details, also see the GitHub page

Get and build SparkTrails:

# clone SparkTrails repository
git clone https://github.com/hyptrails/sparktrails-core

# change to repository directory
cd sparktrails

# build and package SparkTrails using Maven
mvn package

Get Spark and run the Spark Shell including SparkTrails:

# download Spark
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz

# unpack Spark
tar -xf spark-1.5.2-bin-hadoop2.6.tgz

# run the shell
./spark-1.5.2-bin-hadoop2.6/bin/spark-shell --jars target/sparktrails-core_2.10-1.0.0.jar

Run our first example:

// imports
import org.dmir.sparktrails.Matrix._
import org.dmir.sparktrails.MatrixUtils
import org.dmir.sparktrails.HypTrails
import org.dmir.sparktrails.row.MarkovChainK
import java.io.PrintWriter

// set the number of states
val numberOfStates = 4

// define prefix for files
val prefix = "src/main/resources/tutorial/quickstart"

// load the transition count matrix
val d = MatrixUtils.loadRowMatrix(s"$prefix/transitioncounts.row")(sc)

// load hypotheses
val h1 = MatrixUtils.loadRowMatrix(s"$prefix/hypothesis1.row")(sc)
val h2 = MatrixUtils.loadRowMatrix(s"$prefix/hypothesis2.row")(sc)

// define the concentration parameters to use
val ks = Seq[Double](1,2,3,4,5,10,100,1000,10000)

// elicit prior parameters
val h1Elicited = HypTrails.Prior.elicit(h1.toMapRows, ks)
val h2Elicited = HypTrails.Prior.elicit(h2.toMapRows, ks)

// calculate the log of the marginal likelihood
// for each hypothesis and each k
val e1 = MarkovChainK.evidence(numberOfStates, ks.size, d, h1Elicited)(sc)
val e2 = MarkovChainK.evidence(numberOfStates, ks.size, d, h2Elicited)(sc)

// output results
import java.io.PrintWriter
val pw = new PrintWriter("results.csv")
pw.append(s"hypothesis,k,evidence\n")
ks.zip(e1).foreach { case (k,e) =>
  pw.append(s"h1,$k,$e\n")
}
ks.zip(e2).foreach { case (k,e) =>
  pw.append(s"h2,$k,$e\n")
}
pw.close

You can plot the results for example using R:

# read evidence values
e <- read.csv("results.csv")

# plot evidence curves
library(ggplot2)
ggplot(e, aes(x = k, y = evidence, colour = hypothesis)) + geom_point() + geom_line()

For additional examples, we refer to the SparkTrails repository where we will successively add more tutorials.

Evaluation

For a general evaluation we refer to the SparkTrails paper. In this section we elaborate on the datasets used for runtime experiments as well as the baseline code.

Martin Becker, Hauke Mewes, Andreas Hotho, Dimitar Dimitrov, Florian Lemmerich, Markus Strohmaier
SparkTrails: A MapReduce Implementation of HypTrails for Comparing Hypotheses About Human Trails
Proceedings of the 25th International Conference on World Wide Web, 2016

Datasets

In the SparkTrails paper, we use several datasets to evaluate the runtime of our approach. This includes several synthetic datasets, a Wikipedia dataset and a Flickr dataset aimed at demonstrating the scalability of SparkTrails. In this section we elaborate on these datasets.

Generally, HypTrails works with trails on a finite set of states in a Markov chain scenario of order 1. Thus, each evaluation dataset is made up of two matrices: an observation matrix and a hypothesis matrix. The observation matrix contains the number observation for each state transition and the hypothesis matrix contains the transition probabilities between states according to a user hypothesis.

Synthetic Datasets

Generally we use synthetic datasets because we can easily scale their size. Additionally, we can identify the characteristics of SparkTrails with regard to the sparsity of the given datasets. Thus, overall, we work with two types synthetic datasets: dense datasets and a sparse datasets. Both are generated randomly. The general procedure is as follows:

Define the number of states for the datasets, a seed for the pseudo-random generator and an upper limit for the observation count.
Create a list with all the states and assign a pseudo-random number, a state seed, to each state, using the seed from step 1.
For each state, create another list with all the states and assign a pseudo-random value between 1 and the upper limit specified in step 1 to each of them, using the state seed from step 2.

With this approach, we get a dense observation matrix, i.e., a matrix with no zero entries. However, since typical datasets are usually sparse, i.e., there are a lot of entries equals to zero, we need a way to imitate this characteristic. Hence, we introduce a so called row fraction and column fraction. The row fraction is used to determine the count of non-zero rows in our observation matrix, the column fraction to determine the number of non-zero entries in the remaining rows. Both are used as a probability for each row/column to be zero or not and are applied before iterating through the list of the states in step 2/3.

Dense Datasets

The dense datasets (s₁ and s₂) use dense observation and hypothesis matrices created as described above. For s₁ we use 93000 states and for s₂ we use 186000 states. In both cases we use a single generated matrix as both observation and hypothesis matrix.

Sparse Datasets

The sparse dataset (r) is randomly created using row fraction and column fraction as described above. We use 26350000 states and a row as well as a column fraction of 1%. Note that while this is very sparse, the fractions still larger than for example for the Wikipedia dataset. Again we use a single generated matrix as both observation and hypothesis matrix. In the sparse case this represents the worst case with regard to runtime since SparkTrails takes advantage of the general sparisty structure as described in the corresponding paper.

Wikipedia Dataset

The Wikipedia datasets (w_self and w_nw) is based on transitions between Wikipedia articles.

The observation matrix for both datasets (w_self and w_nw) contains transition counts extracted from the logs of the desktop version of Wikipedia from Feb 2015 [1] by Wulczyn and Taraborelli. Log entries identified to be from spiders and bots are not included. Transition that occur less than 10 times in the logs are removed. The dataset contains transitions with articles from the main namespace as but also includes incoming traffic from external sources. For our experiments, we discard the latter.

For the "self" dataset (w_self), we reuse the observation matrix as hypothesis matrix.

For the network dataset (w_nw), the hypothesis matrix uses the underlying link network created from the links between the articles in the XML Dump of the English Wikipedia from 4^th of March 2015. The transition probability between a state i and a state j is equal to 0 if there is no link and 1 over the number of outgoing links of state i otherwise.

[1] Ellery Wulczyn, Dario Taraborelli
Wikipedia clickstream
figshare, 2015 (accessed: 2015-12-13)

Flickr Dataset

For the Flickr dataset (f), we consider phototrails of individual Flickr users. The dataset we use is based on the "Photowalking the city" paper by Becker et al. [1]. In particular, we use the Los Angeles observation matrix and the proximity hypothesis with a spreading parameter of 2.5km.

The observation matrix is generated by dividing the city area into cells of 200m by 200m and counting the transitions between these grid cells with regard to photo sequences generated by individual Flickr users.

For the hypothesis matrix, the transition matrix is based on proximity. That is, the transition probability between a state i and a state j is small, if the distance between these two states is large: P(j|i) = 1/Z · e^{−1 / (2σ²) · dist(i, j)²}.

[1] Martin Becker, Philipp Singer, Florian Lemmerich, Andreas Hotho, Denis Helic, Markus Strohmaier
Photowalking the city: comparing hypotheses about urban photo trails on Flickr
Social Informatics, Volume 9471 of the series Lecture Notes in Computer Science, 2015

Baselines

While a standard Python implementation exists, we use a custom, simplified Python implementation as a baseline. The original implementation is flexible with regard to several special cases and allows to use specialized storage mechanisms for large observation and hypothesis matrices. However, in our evaluation scenario, we found this implementation to be slow in comparison to our custom, simplified version. Thus, for evaluating SparkTrails, we use this simplified version, which will be available on GitHub.

Authors

Martin Becker
University of Würzburg
http://www.is.informatik.uni-wuerzburg.de/staff/becker
Hauke Mewes
University of Würzburg
Andreas Hotho
University of Würzburg
http://www.is.informatik.uni-wuerzburg.de/staff/hotho/
Dimitar Dimitrov
GESIS - Leibniz Institute for the Social Sciences
http://www.dimitardimitrov.info
Florian Lemmerich
GESIS - Leibniz Institute for the Social Sciences
http://www.is.informatik.uni-wuerzburg.de/staff/lemmerich/
Markus Strohmaier
GESIS - Leibniz Institute for the Social Sciences and University of Koblenz-Landau
http://www.gesis.org/das-institut/mitarbeiterverzeichnis/?name=markus%2Cstrohmaier

Contact

Martin Becker
becker@informatik.uni-wuerzburg.de