A better algorithm isn't enough to fix Netflix's recommendations

There has been a lot of hype recently as Netflix announced provisional winners of their million dollar contest to improve their recommendation algorithm. The goal was to improve matching by 10%. Since it took over 50,000 entrants the better part of three years trying to improve past 10% this is probably a trick they can only pull off once. Given that their current recommendation engine does a miserable job of recommending movies for me, even a 10% increase isn't likely to be particularly satisfying.

I've rated just shy of 800 movies on Netflix;and just over 150 items on Amazon, yet Amazon's recommendations are usually satisfying while Netflix struggles to accurately recommend any movies I'd like to see. This isn't a case of esoteric movie tastes, in fact I'm fairly mainstream, largely preferring the entertainment of a summer blockbuster to the intellectualism of an indie documentary. The books I like are the opposite: non-fiction, obscure, expensive, limited runs, or out-out-of-print, in short; not popular. And still, Amazon recommends the right books.

Pandora is a music service which delights me by consistently recommending new music to me which I like. Netflix can't give me great recommendations. Amazon and Pandora do. Why?

Clearly the algorithm is a critical part of any good recommendation engine. But there seem to be limits to what can be accomplished just by tweaking the math. If Netflix can only squeeze a 10% improvement out of the calculation for recommendations, where can they turn to get additional increases in quality?

Tweaking what happens before and after the algorithm seems to be the only other opportunities. Both of these are ultimately interaction design solutions. Let's take a look at a few approaches to recommendations used by Netflix, Amazon and Pandora and see how they lead to different results.

Upsteam; Garbage in, garbage out

Pandora made a deep investment in creating a way for experts to tag songs before they are ever recommended to users. Their Music Genome Project employs music experts who listen to and tag songs with up to 400 different distinguishing characteristics. This early and expert interaction paradigm means that recommendations can be made for nuances in songs that I'm not even consciously aware of.

Pandora
Primary matchmaking for recommendations in Pandora is done by matching the qualities of a song I rate with other songs tagged with similar qualities. Because the assignment of distinct qualities has been made by and expert Pandora employee before the song enters the database. Nothing I rate will ever change these qualities.

My ratings (of like or dislike) simply help the system understand my preferences. Other people's ratings are never surfaced to me. I am in a walled garden. My ratings help match my music desires to the independent set of song qualities. Because ratings don't contaminate the song qualities dataset—and recommendations are based largely on song qualities—other people can't dilute what is recommended to me.

Amazon relies heavily on matching what is recommended with other shared purchases. This item-to-item matching follows connections between purchased objects, rather than matching personal ratings with other people's ratings. Amazon augments this by evaluating what is in my wishlist and Amazon browsing history. All three of these indicators are pulled by matching activity patterns—logs of what I've done—rather than from the self-reflective reporting of what I think I like. Activity patterns are honest in a way that self-reports simply can't match. Activities I performed are what they are, the system can't be mistaken about what I did in the same way I might be when I self-report my enjoyment. When people reflect about their preferences they are rarely accurate, but their behavior patterns can't lie or make subjective mistakes.

amazon

Amazon pays attention to a different part of my interaction than Netflix. Amazon pays attention to my passive interaction; behavior. Netflix pays attention to my active interaction; ratings.

Amazon also has a robust ratings system. Amazon's ratings supply supplimentary information instead of forming the basis for a match. Unlike Pandora, ratings of other people are surfaced, but they don't appear to have a strong effect on recommendations. Books may be recommended which have no-or-low ratings. A well matched recommendation for me recently had two stars; luckily it didn't keep the book from my recommendations.

netflix - rating
While Netflix does base recommendations on movie characteristics such as genre, customer ratings also play a large role. Personal ratings and the averaged ratings of all Netflix users are both used to calculate recommendations. This is augmented with information about my past viewing and current movie cue.

My ratings are self-reflective, self-reported assessments of the movie I watched. Not only are ratings impacted by context, but ratings are subjective. I can rate a movie high because it was entertaining, you might rate the same movie high because you liked the performance of the lead actress. Netflix's five star rating system can't capture the nuances of this. So ratings become a mixed up amalgam of subjectively meaningful evaluations.

The interaction problem with Netflix's ratings is that they can't capture what I mean with my rating. A "I loved it" rating isn't nearly articulate enough to make another recommendation. Apparently even 800 ratings still doesn't give enough data to make good assessments. Because ratings form the backbone of Netflix recommendations, the process for rating a movie needs to enable people to describe not only "what" the rating is, but "why." While this process would introduce other problems, such as a more cumbersome rating system, the data would be more precise, allowing the algorithm to perform better matches.

Simple user-driven star ratings are no substitute for for deeply sophisticated and nuanced, expert tagging, or for the honesty of behavior patterns. Pandora gets away with a simple ratings system because their team of experts works hard to create meaningful distinctions on the back end. They move the interaction away from user feedback, and put it early on in the process, when musicians listen to a song and tag it, entering the specific qualities which make it distinct in the system. Amazon avoids the problem by looking for behavior patterns rather than, qualities or ratings.

It's not possible to have simple, user-driven, and accurate all at the same time. Pandora has heavily invested in the Music Genome Project, if Netflix made the same kind of investment in a Movie Genome Project, the resulting high quality data would allow a simple front-end experience, and deliver more satisfying recommendations. Without this—as the saying goes—garbage in, garbage out.

Downstream; Help me, help you improve my recommendations

Amazon, Netflix and Pandora all make it easy to see why something was recommended. Netflix does the bare minimum, showing me a list of up to three of the movies "I enjoyed" upon which it has based the current recommendation. There is no way to change that it is using these for recommendations.

image_4.png

Pandora has an equally limited feedback. I can click to see the qualities which the song has that were used to make the recommendation. The qualities are a bit esoteric with terms like, "extensive vamping," and "intricate melodic phrasing."

image_5.png

Because these were assigned by their team of experts I don't have any ability to change them.

Amazon has a much more satisfying experience. Mouseover a recommendation and Amazon displays a dialog box with details telling me this book was recommended because of purchase, rating or wishlist item, and it tells me which one. If there are multiple items used for the recommendation it tells me this too.

image_6.png
And it provides an easy way to fix the recommendation if I think it was off. I can click a control which opens a pop-up where all the items used for the recommendation are listed.

I can either rate them or indicate I don't want them used for recommendations. In this way Amazon allows me to directly influence and correct the data which is contributing to my recommendations.

Netflix could easily add a way to do something similar with recommendations. Simply adding a control which allows me to indicate that I don't want a movie to be used as the basis for recommendations would give me a stronger ability to help improve my recommendations. Over time with steady user participation the database for recommendations would be refined by users themselves.

Is it enough?

If Netflix created a database of well defined characteristics which experts applied to movies, it would certainly help improve recommendations.If they provided a way for users to help them recover from poor recommendations, it would not only help users feel in control, but it would slowly but surely improve recommendations.

But is it enough? If users could create their own movie channels (like on Pandora) based on movie characteristics, could that lead to better recommendations? Could a limited number of ratings, or adding cost to ratings improve the quality of ratings and resulting recommendations? What if Netflix stopped sending emails making it easy for people to rate movies they watched over the weekend? Could the additional effort required weed out people who are not thinking deeply enough about their ratings? Would making each movie a fixed rental cost change people's viewing behavior, making them more discriminating consumers? What other interaction, or service changes could help Netflix deliver better recommendations?