Journal



Recent Entries

Buzzkill
I’ve been struggling for days to put into words my reaction to the launch of Google Buzz. But the phrase I can’t get out of my head is “HOW could they screw up THIS MUCH?” Well here’s how: Google took Gmail, one of the most widely used web services on... (Continue)
Alternate dimensions
If you’re a typical designer working in the software world, the majority of products you’ll create will have strictly two dimensional interfaces — length & width only, pixels on the screen. As interfaces have evolved over the years many have gained a very simple kind of "depth": lighting effects, drop... (Continue)
An Insurgency of Quality
Dave Hussman, one of the leaders of the post-agile movement, recently hosted a one-day conference on the topic of “Redesigning Agility”, and invited me to give a plenary talk. The focus of the conference and my talk were how to integrate agile development with interaction design. I was very... (Continue)

A better algorithm isn't enough to fix Netflix's recommendations

by Stefan Klocek on August 7, 2009

There has been a lot of hype recently as Netflix announced provisional winners of their million dollar contest to improve their recommendation algorithm. The goal was to improve matching by 10%. Since it took over 50,000 entrants the better part of three years trying to improve past 10% this is probably a trick they can only pull off once. Given that their current recommendation engine does a miserable job of recommending movies for me, even a 10% increase isn't likely to be particularly satisfying.

I've rated just shy of 800 movies on Netflix;and just over 150 items on Amazon, yet Amazon's recommendations are usually satisfying while Netflix struggles to accurately recommend any movies I'd like to see. This isn't a case of esoteric movie tastes, in fact I'm fairly mainstream, largely preferring the entertainment of a summer blockbuster to the intellectualism of an indie documentary. The books I like are the opposite: non-fiction, obscure, expensive, limited runs, or out-out-of-print, in short; not popular. And still, Amazon recommends the right books.

Pandora is a music service which delights me by consistently recommending new music to me which I like. Netflix can't give me great recommendations. Amazon and Pandora do. Why?

Clearly the algorithm is a critical part of any good recommendation engine. But there seem to be limits to what can be accomplished just by tweaking the math. If Netflix can only squeeze a 10% improvement out of the calculation for recommendations, where can they turn to get additional increases in quality?

Tweaking what happens before and after the algorithm seems to be the only other opportunities. Both of these are ultimately interaction design solutions. Let's take a look at a few approaches to recommendations used by Netflix, Amazon and Pandora and see how they lead to different results.


Upsteam; Garbage in, garbage out


Pandora made a deep investment in creating a way for experts to tag songs before they are ever recommended to users. Their Music Genome Project employs music experts who listen to and tag songs with up to 400 different distinguishing characteristics. This early and expert interaction paradigm means that recommendations can be made for nuances in songs that I'm not even consciously aware of.


PandoraPrimary matchmaking for recommendations in Pandora is done by matching the qualities of a song I rate with other songs tagged with similar qualities. Because the assignment of distinct qualities has been made by and expert Pandora employee before the song enters the database. Nothing I rate will ever change these qualities.

My ratings (of like or dislike) simply help the system understand my preferences. Other people's ratings are never surfaced to me. I am in a walled garden. My ratings help match my music desires to the independent set of song qualities. Because ratings don't contaminate the song qualities dataset—and recommendations are based largely on song qualities—other people can't dilute what is recommended to me.

Amazon relies heavily on matching what is recommended with other shared purchases. This item-to-item matching follows connections between purchased objects, rather than matching personal ratings with other people's ratings. Amazon augments this by evaluating what is in my wishlist and Amazon browsing history. All three of these indicators are pulled by matching activity patterns—logs of what I've done—rather than from the self-reflective reporting of what I think I like. Activity patterns are honest in a way that self-reports simply can't match. Activities I performed are what they are, the system can't be mistaken about what I did in the same way I might be when I self-report my enjoyment. When people reflect about their preferences they are rarely accurate, but their behavior patterns can't lie or make subjective mistakes.

amazon

Amazon pays attention to a different part of my interaction than Netflix. Amazon pays attention to my passive interaction; behavior. Netflix pays attention to my active interaction; ratings.

Amazon also has a robust ratings system. Amazon's ratings supply supplimentary information instead of forming the basis for a match. Unlike Pandora, ratings of other people are surfaced, but they don't appear to have a strong effect on recommendations. Books may be recommended which have no-or-low ratings. A well matched recommendation for me recently had two stars; luckily it didn't keep the book from my recommendations.


netflix - ratingWhile Netflix does base recommendations on movie characteristics such as genre, customer ratings also play a large role. Personal ratings and the averaged ratings of all Netflix users are both used to calculate recommendations. This is augmented with information about my past viewing and current movie cue.

My ratings are self-reflective, self-reported assessments of the movie I watched. Not only are ratings impacted by context, but ratings are subjective. I can rate a movie high because it was entertaining, you might rate the same movie high because you liked the performance of the lead actress. Netflix's five star rating system can't capture the nuances of this. So ratings become a mixed up amalgam of subjectively meaningful evaluations.

The interaction problem with Netflix's ratings is that they can't capture what I mean with my rating. A "I loved it" rating isn't nearly articulate enough to make another recommendation. Apparently even 800 ratings still doesn't give enough data to make good assessments. Because ratings form the backbone of Netflix recommendations, the process for rating a movie needs to enable people to describe not only "what" the rating is, but "why." While this process would introduce other problems, such as a more cumbersome rating system, the data would be more precise, allowing the algorithm to perform better matches.

Simple user-driven star ratings are no substitute for for deeply sophisticated and nuanced, expert tagging, or for the honesty of behavior patterns. Pandora gets away with a simple ratings system because their team of experts works hard to create meaningful distinctions on the back end. They move the interaction away from user feedback, and put it early on in the process, when musicians listen to a song and tag it, entering the specific qualities which make it distinct in the system. Amazon avoids the problem by looking for behavior patterns rather than, qualities or ratings.

It's not possible to have simple, user-driven, and accurate all at the same time. Pandora has heavily invested in the Music Genome Project, if Netflix made the same kind of investment in a Movie Genome Project, the resulting high quality data would allow a simple front-end experience, and deliver more satisfying recommendations. Without this—as the saying goes—garbage in, garbage out.

Downstream; Help me, help you improve my recommendations

Amazon, Netflix and Pandora all make it easy to see why something was recommended. Netflix does the bare minimum, showing me a list of up to three of the movies "I enjoyed" upon which it has based the current recommendation. There is no way to change that it is using these for recommendations.

image_4.png

Pandora has an equally limited feedback. I can click to see the qualities which the song has that were used to make the recommendation. The qualities are a bit esoteric with terms like, "extensive vamping," and "intricate melodic phrasing."

image_5.png

Because these were assigned by their team of experts I don't have any ability to change them.

Amazon has a much more satisfying experience. Mouseover a recommendation and Amazon displays a dialog box with details telling me this book was recommended because of purchase, rating or wishlist item, and it tells me which one. If there are multiple items used for the recommendation it tells me this too.

image_6.pngAnd it provides an easy way to fix the recommendation if I think it was off. I can click a control which opens a pop-up where all the items used for the recommendation are listed.

I can either rate them or indicate I don't want them used for recommendations. In this way Amazon allows me to directly influence and correct the data which is contributing to my recommendations.

Netflix could easily add a way to do something similar with recommendations. Simply adding a control which allows me to indicate that I don't want a movie to be used as the basis for recommendations would give me a stronger ability to help improve my recommendations. Over time with steady user participation the database for recommendations would be refined by users themselves.

Is it enough?

If Netflix created a database of well defined characteristics which experts applied to movies, it would certainly help improve recommendations.If they provided a way for users to help them recover from poor recommendations, it would not only help users feel in control, but it would slowly but surely improve recommendations.

But is it enough? If users could create their own movie channels (like on Pandora) based on movie characteristics, could that lead to better recommendations? Could a limited number of ratings, or adding cost to ratings improve the quality of ratings and resulting recommendations? What if Netflix stopped sending emails making it easy for people to rate movies they watched over the weekend? Could the additional effort required weed out people who are not thinking deeply enough about their ratings? Would making each movie a fixed rental cost change people's viewing behavior, making them more discriminating consumers? What other interaction, or service changes could help Netflix deliver better recommendations?

Filed under: Critiques, Interaction design, Interaction Patterns, Web


Stefan Klocek

Stefan Klocek is an interaction designer of the Design Communicator flavor at Cooper. Unquenchable curiosity and practiced critical thinking make him especially well suited for this position. He has worked in a range of creative environments (from startu


More entries by Stefan


Comments

On Aug 7, 2009, Dan said:

I actually think you may be categorically wrong. Using metadata was fully-allowed in the contest, and pretty much everyone who tried it had lesser results. I don't think there's any solid evidence it can help at all.

On Aug 7, 2009, Toddler said:

What you complain about above, that a mainstream taste gives poor recommendations while a more niche taste givis better, seem to me quite logical. I've run into this many times and I actually, just for fun, did a little research on it by openening different accounts on last.fm and emulated tastes. If you have a narrow taste, chances are that someone will have a similar narrow taste that the recommendation engine can pick up on. The more mainstream material you associate with the more "polluted" your recommendations will be. Take radiohead or coldplay on last.fm for example, everyone listens to them, so naturally almost any other artist will seem similar to them to some extent. So, if you've shown that you like them your recommendations could be almost anything. On the other hand, pick any obscure unsigned electro act and you surely will get recommended something that actually is similar. Because anyone that found them in the first place would share the same taste as you. You're never unique on the internet.. =)

Other than that, I agree, recommendations mostly but not always suck. It's just that I think having a more narrow taste gives you better recommendations than a mainstream one.

On Aug 7, 2009, Jussi Pasanen said:

Stefan, I came across this in a recent project where we looked at book recommendations: Have you tried Jinni that claims to "find movies, TV shows matching your taste" and is currently in beta? Their system is based on a "Movie genome", which sounds very much like Pandora's Music Genome Project. The interface is pretty nifty, too.

Cheers, Jussi

On Aug 8, 2009, Lisa said:

Try Clerkdogs.com - entirely new movie and TV recommendation engine powered by humans - not algorithms

On Aug 13, 2009, Phoebe said:

Thanks for the thought-provoking piece. You mention that Netflix uses movie characteristics such as genre in their recommendations. It's perhaps worth clarifying that adding metadata has a limited influence on a ratings-based collaborative filtering algorithm, and is quite different from Pandora's extensive Genome or a deep semantic approach like Jinni's.

On Aug 31, 2009, Al Bredenberg said:

I like your point that the real areas for improvement are what happens before and after the algorithm.

After a few years using Netflix to get access to a marvelous "long tail" of video content, their system has still not figured out that I never rent "R" rated movies, which still appear prominently in the recommendations.

On top of that, I have never found any way to give unstructured feedback to Netflix. As far as I can see, there is no way to send even a brief message to Netflix making a suggestion about how they might make their service better.

I wonder whether this is intentional, or just a reflection of the mindset of the company's developers?

On Sep 17, 2009, Sam Horodezky said:

I question the implied assertion that humans can do better than statistical analysis at uncovering regularities in a large data set (with or without meta-data). It reminds me a little bit of the attitute from 15 years ago that a computer could never beat a human in chess.
FWIW the Netflix system works pretty well for me.

 

Post a comment


Name

Email Address

Comments (Feel free to use basic HTML tags for style)

We're trying to advance the conversation, and we trust that you will, too. We'd rather not moderate, but we will remove any comments that are blatantly inflammatory or inappropriate. Let it fly, but keep it clean. Thanks.

To help filter spam, please enter the letter s here