Adam Warski

8 Oct 2013

Creating an on-line recommender system with Apache Mahout

machine learning

Recently we’ve been implementing a recommender system for Yap.TV: you can see it in action after installing the app and going to the “Just for you” tab. We’re using Apache Mahout as the base for doing recommendations. Mahout is a “scalable machine learning library” and contains both local and distributed implementations of user- and item- based recommenders using collaborative filtering algorithms.


For now we’ll focus on the local, single-machine implementation. It should work well if you have up to 10s of millions of preference values. Above that, you should probably consider the Hadoop-based implementation, as the data simply won’t fit into memory.

Writing a basic recommender with Mahout is quite simple; As Mahout is very configurable, usually there are different implementations to choose from; I’ll just describe what I think are “good starting points”.


First you need a file with the input data. The format is quite simple: either comma-separated (user id, item id) pairs or (user id, item id, preference value) triples. This expresses what you already know: what users like which items, and optionally how much (e.g. on a 1-5 scale). The ids must be integers, the preference value is treated as a float.

Let’s first create a user-based recommender: that is a recommender, which when asked for recommendations for user A, first looks up “similar” users to A, and then tries to find best items, which these similar users have rated, but A hasn’t. To do that, we need to create 4 components:

  • data model: this will use the file
  • user similarity: a measure which given two users, will return a number representing how similar they are
  • neighborhood: for finding the neighborhood of a given user
  • recommender: which takes these pieces together to produce recommendations

For unary input data (where users either like items or we don’t know), a good starting point is:

val dataModel = new FileDataModel(file)
val userSimilarity = new LogLikelihoodSimilarity(dataModel)
val neighborhood = new NearestNUserNeighborhood(25, userSimilarity, dataModel)
val recommender = new GenericBooleanPrefUserBasedRecommender(dataModel, neighborhood, userSimilarity)

If we have preference values (triples in the input data):

val dataModel = new FileDataModel(file)
val userSimilarity = new PearsonCorrelationSimilarity(dataModel)
val neighborhood = new NearestNUserNeighborhood(25, userSimilarity, dataModel)
val recommender = new GenericUserBasedRecommender(dataModel, neighborhood, userSimilarity)

Now we are ready to get some recommendations; this is as simple as:

// Gets 10 recommendations
val result = recommender.recommend(userId, 10)

// We get back a list of item-estimated preference value, 
// sorted from the highest score
result.foreach(r => println(r.getItemID() + ": " + r.getValue())) 


What about the on-line aspect? The above will work great for existing users; what about new users which register in the service? For sure we want to provide some reasonable recommendations for them as well. Creating a recommender instance is expensive (for sure takes longer than a “normal” network request), so we can’t just create a new recommender each time.

Luckily Mahout has a possibility of adding temporary users to a data model. The general setup then is:

  • periodically re-create the whole recommender using current data (e.g. each day or each hour – depending on how long it takes)
  • when doing a recommendation, check if the user exists in the system
  • if yes, do the recommendation as always
  • if no, create a temporary user, fill in the preferences, and do the recommendation

The first part (periodically re-creating the recommender) may be actually quite tricky if you are limited on memory: when creating the new recommender, you need to hold two copies of the data in memory (to still be able to server requests from the old one). But as that doesn’t really have anything to do with recommendations, I won’t go into details here.

As for the temporary users, we can wrap our data model with a PlusAnonymousConcurrentUserDataModel instance. This class allows to obtain a temporary user id; the id must be later released so that it can be re-used (there’s a limited number of such ids). After obtaining the id, we have to fill in the preferences, and then we can proceed with the recommendation as always:

val dataModel = new PlusAnonymousConcurrentUserDataModel(
    new FileDataModel(file),

val recommender: = ...

// we are assuming a unary model: we only know which items a user likes
def recommendFor(userId: Long, userPreferences: List[Long]) = {
  if (userExistsInDataModel(userId)) {
  } else {

def recommendForNewUser(userPreferences: List[Long]) = {
  val tempUserId = dataModel.takeAvailableUser()

  try {
    // filling in a Mahout data structure with the user's preferences
    val tempPrefs = new BooleanUserPreferenceArray(userPreferences.size)
    tempPrefs.setUserID(0, tempUserId)
    userPreferences.zipWithIndex.foreach { case (preference, idx) => 
      tempPrefs.setItemID(idx, preference) 
    dataModel.setTempPrefs(tempPrefs, tempUserId)

  } finally {

def recommendForExistingUser(userId: Long) = {
  recommender.recommend(userId, 10)

Incorporating business logic

It often happens that we want to boost the score of selected items because of some business rules. In our use-case, for example if a show has a new episode, we want to give it a higher score. That’s possible using the IDRescorer interface for Mahout. An instance of a rescorer is provided when invoking Recommender.recommend. For example:

val rescorer = new IDRescorer {
  def rescore(id: Long, originalScore: Double) = {
    if (showIsNew(id)) {
      originalScore * 1.2 
    } else {

  def isFiltered(id: Long) = false

// Gets 10 recommendations
val result = recommender.recommend(userId, 10, rescorer)


Mahout is a great basis for creating recommenders. It’s very configurable and provides many extension points. There’s still quite a lof of work in picking the right configuration parameter values, setting up rescoring and evaluating the recommendation results, but the algorithms are solid, so there’s one thing less to worry about.

There’s also a very good book, Mahout in Action, which covers recommender systems and other components of Mahout. It’s based on version 0.5 (current one is 0.8), but the code examples mostly work and the main logic of the project is the same.


comments powered by Disqus

Any questions?

Can’t find the answer you’re looking for?