Fork me on GitHub

For a primer, we refer to the paper introducing SparkTrails and as well as the original HypTrails paper. In the following we give additional details on the SparkTrails and how to use it, and elaborate on the evaluation we have conducted in the SparkTrails paper.


In this section we give a quickstart tutorial on how to get and use SparkTrails. SparkTrails is implemented for Apache Spark and is written in Scala. Thus, the relevant code snippets are mostly in Scala. For more details, also see the GitHub page

Get and build SparkTrails:

# clone SparkTrails repository
git clone

# change to repository directory
cd sparktrails

# build and package SparkTrails using Maven
mvn package

Get Spark and run the Spark Shell including SparkTrails:

# download Spark

# unpack Spark
tar -xf spark-1.5.2-bin-hadoop2.6.tgz

# run the shell
./spark-1.5.2-bin-hadoop2.6/bin/spark-shell --jars target/sparktrails-core_2.10-1.0.0.jar

Run our first example:

// imports
import org.dmir.sparktrails.Matrix._
import org.dmir.sparktrails.MatrixUtils
import org.dmir.sparktrails.HypTrails
import org.dmir.sparktrails.row.MarkovChainK

// set the number of states
val numberOfStates = 4

// define prefix for files
val prefix = "src/main/resources/tutorial/quickstart"

// load the transition count matrix
val d = MatrixUtils.loadRowMatrix(s"$prefix/transitioncounts.row")(sc)

// load hypotheses
val h1 = MatrixUtils.loadRowMatrix(s"$prefix/hypothesis1.row")(sc)
val h2 = MatrixUtils.loadRowMatrix(s"$prefix/hypothesis2.row")(sc)

// define the concentration parameters to use
val ks = Seq[Double](1,2,3,4,5,10,100,1000,10000)

// elicit prior parameters
val h1Elicited = HypTrails.Prior.elicit(h1.toMapRows, ks)
val h2Elicited = HypTrails.Prior.elicit(h2.toMapRows, ks)

// calculate the log of the marginal likelihood
// for each hypothesis and each k
val e1 = MarkovChainK.evidence(numberOfStates, ks.size, d, h1Elicited)(sc)
val e2 = MarkovChainK.evidence(numberOfStates, ks.size, d, h2Elicited)(sc)

// output results
val pw = new PrintWriter("results.csv")
pw.append(s"hypothesis,k,evidence\n") { case (k,e) =>
} { case (k,e) =>
You can plot the results for example using R:
# read evidence values
e <- read.csv("results.csv")

# plot evidence curves
ggplot(e, aes(x = k, y = evidence, colour = hypothesis)) + geom_point() + geom_line()

For additional examples, we refer to the SparkTrails repository where we will successively add more tutorials.


For a general evaluation we refer to the SparkTrails paper. In this section we elaborate on the datasets used for runtime experiments as well as the baseline code.


In the SparkTrails paper, we use several datasets to evaluate the runtime of our approach. This includes several synthetic datasets, a Wikipedia dataset and a Flickr dataset aimed at demonstrating the scalability of SparkTrails. In this section we elaborate on these datasets.

Generally, HypTrails works with trails on a finite set of states in a Markov chain scenario of order 1. Thus, each evaluation dataset is made up of two matrices: an observation matrix and a hypothesis matrix. The observation matrix contains the number observation for each state transition and the hypothesis matrix contains the transition probabilities between states according to a user hypothesis.

Synthetic Datasets

Generally we use synthetic datasets because we can easily scale their size. Additionally, we can identify the characteristics of SparkTrails with regard to the sparsity of the given datasets. Thus, overall, we work with two types synthetic datasets: dense datasets and a sparse datasets. Both are generated randomly. The general procedure is as follows:

  1. Define the number of states for the datasets, a seed for the pseudo-random generator and an upper limit for the observation count.
  2. Create a list with all the states and assign a pseudo-random number, a state seed, to each state, using the seed from step 1.
  3. For each state, create another list with all the states and assign a pseudo-random value between 1 and the upper limit specified in step 1 to each of them, using the state seed from step 2.

With this approach, we get a dense observation matrix, i.e., a matrix with no zero entries. However, since typical datasets are usually sparse, i.e., there are a lot of entries equals to zero, we need a way to imitate this characteristic. Hence, we introduce a so called row fraction and column fraction. The row fraction is used to determine the count of non-zero rows in our observation matrix, the column fraction to determine the number of non-zero entries in the remaining rows. Both are used as a probability for each row/column to be zero or not and are applied before iterating through the list of the states in step 2/3.

Dense Datasets

The dense datasets (s1 and s2) use dense observation and hypothesis matrices created as described above. For s1 we use 93000 states and for s2 we use 186000 states. In both cases we use a single generated matrix as both observation and hypothesis matrix.

Sparse Datasets

The sparse dataset (r) is randomly created using row fraction and column fraction as described above. We use 26350000 states and a row as well as a column fraction of 1%. Note that while this is very sparse, the fractions still larger than for example for the Wikipedia dataset. Again we use a single generated matrix as both observation and hypothesis matrix. In the sparse case this represents the worst case with regard to runtime since SparkTrails takes advantage of the general sparisty structure as described in the corresponding paper.

Wikipedia Dataset

The Wikipedia datasets (wself and wnw) is based on transitions between Wikipedia articles.

The observation matrix for both datasets (wself and wnw) contains transition counts extracted from the logs of the desktop version of Wikipedia from Feb 2015 [1] by Wulczyn and Taraborelli. Log entries identified to be from spiders and bots are not included. Transition that occur less than 10 times in the logs are removed. The dataset contains transitions with articles from the main namespace as but also includes incoming traffic from external sources. For our experiments, we discard the latter.

For the "self" dataset (wself), we reuse the observation matrix as hypothesis matrix.

For the network dataset (wnw), the hypothesis matrix uses the underlying link network created from the links between the articles in the XML Dump of the English Wikipedia from 4th of March 2015. The transition probability between a state i and a state j is equal to 0 if there is no link and 1 over the number of outgoing links of state i otherwise.

Flickr Dataset

For the Flickr dataset (f), we consider phototrails of individual Flickr users. The dataset we use is based on the "Photowalking the city" paper by Becker et al. [1]. In particular, we use the Los Angeles observation matrix and the proximity hypothesis with a spreading parameter of 2.5km.

The observation matrix is generated by dividing the city area into cells of 200m by 200m and counting the transitions between these grid cells with regard to photo sequences generated by individual Flickr users.

For the hypothesis matrix, the transition matrix is based on proximity. That is, the transition probability between a state i and a state j is small, if the distance between these two states is large: P(j|i) = 1/Z · e−1 / (2σ2) · dist(i, j)2.


While a standard Python implementation exists, we use a custom, simplified Python implementation as a baseline. The original implementation is flexible with regard to several special cases and allows to use specialized storage mechanisms for large observation and hypothesis matrices. However, in our evaluation scenario, we found this implementation to be slow in comparison to our custom, simplified version. Thus, for evaluating SparkTrails, we use this simplified version, which will be available on GitHub.



Contact us for further information.