NLP for Customer Insights: Beyond Sentiment Analysis

NLP for Customer Insights: Beyond Sentiment Analysis

April 1, 2025
7 min read
NLP
Customer Research
Practice

There's an episode of *Curb* where Larry orders a sandwich and spends the rest of the meal trying to explain to the waiter that he didn't get what he ordered, while the waiter insists nothing is wrong. That's the experience of reading customer reviews and watching the product team reassure itself nothing is wrong. The signal is there. The system isn't built to hear it.

Sentiment analysis is the version of NLP everyone reaches for first. Take a pile of customer text, score it positive or negative, put the score on a dashboard. It's been the default for so long that most teams treat it as the whole job, when really it's the first ten percent of the work. The interesting part is everything that happens after you stop asking whether the customer is happy and start asking what they're happy or unhappy about, in their own words, in detail you didn't anticipate.

I learned this concretely at Beats by Dre, doing NLP on roughly 2,000 customer reviews of a product line. The brief was "tell us what customers are saying." The first version of the analysis was a sentiment dashboard with a positive-neutral-negative split. It said the product was 73% positive. The product team already knew that. They had a sales chart.

The dashboard wasn't wrong. It was useless, which is a different problem and a worse one, because useless analysis is harder to argue with than wrong analysis. Wrong analysis you can fix. Useless analysis just sits there, technically correct and operationally inert, while the team upstream goes back to their existing intuitions because the data didn't tell them anything they could act on.

What changed the analysis from useless to useful was framing the question differently. Instead of "are customers happy," the question became "what are customers happy or unhappy about, specifically, and in what context." That reframing pushes you from pure sentiment into a few related techniques that get talked about less.

Aspect-based sentiment analysis was the first one that mattered. The technique is simple in principle: extract the specific things the review is about, score sentiment for each one separately, aggregate the scores across reviews. In practice this surfaced patterns that pure sentiment had washed out. Battery life was getting positive sentiment in some product variants and negative in others. Comfort was getting positive sentiment from short-session listeners and negative from people who wore the product for hours. Bluetooth pairing was almost universally negative across every variant, which the average score had been hiding because pairing complaints sat next to enthusiastic praise about sound quality in the same review. The team had been reading "73% positive" and hearing "the product is good." The aspect-level data said "the product is good in three specific ways and bad in two specific ways and you should fix one of the bad ones immediately."

Topic modeling was the second one. Aspect-based analysis works when you already know what aspects to look for. Topic modeling works when you don't. Run it on a review corpus and it surfaces the themes customers are actually talking about, including the ones you didn't ask about. The most useful thing it found in the Beats data was an unanticipated topic: a substantial number of customers were discussing the product in the context of working out, even though the model wasn't marketed as fitness gear. That topic had been invisible in the aspect analysis because nobody had thought to add "workout suitability" as an aspect to score. It was visible in topic modeling because the customers themselves wouldn't shut up about it. That single insight ended up driving a marketing campaign.

Named entity recognition was the third. Less glamorous, more useful than it sounds. Apply NER to customer reviews and it tags every product mention, every feature, every competitor, every accessory. Track those mentions over time and patterns emerge that no human reading 2,000 reviews would catch. Spike in mentions of a specific competitor's product right after their launch. Sudden drop in mentions of a feature you stopped marketing. Quiet rise in mentions of an accessory the team didn't realize people were buying separately. The technique itself is old. The application of it to a continuously updating review feed is what makes it actually useful.

Text classification, beyond binary sentiment, is the fourth. The teams I've seen do this well train classifiers for the categories they actually need to route, not the categories an off-the-shelf model offers. Technical issue versus feature request versus pricing complaint versus compliment versus something we can't categorize and need to read manually. The last category is the one that matters most, because the things you can't categorize are usually where the next product insight lives.

The newer wave, large language models doing review analysis, is interesting but worth being skeptical about for a specific reason. LLMs are very good at producing fluent summaries that feel insightful and are actually wrong in ways that are hard to detect. If you ask GPT to summarize a thousand customer reviews, it will produce three paragraphs that sound like an executive briefing. The paragraphs will reference patterns that are mostly real, lightly invented, and not tied back to the source data in a way you can audit. For the kind of work where the analysis informs a product decision, this is dangerous. The fix is to use LLMs for the parts of the pipeline where their failure modes are bounded, like extraction, classification, individual review summarization, rather than the parts where they can hallucinate at scale. Aspect identification, yes. Strategic summary, no.

A few smaller things that are easy to forget:

The reviews you read are not the customers you have. People who write reviews are systematically not representative. They're the people angry enough or pleased enough to type. Most of your customers are silent, and the silent ones are usually the median experience, which the review corpus will not tell you about. Mix review analysis with survey data and behavioral data, or you'll be analyzing the loudest 5% and calling it the whole.

The text customers write is a translation of the experience they had. Something happened, they tried to put it into words, and what showed up in the review is two layers removed from the underlying experience. Treat the text as evidence about the experience, not as the experience itself.

Negative reviews are more useful than positive ones, almost always. Positive reviews tell you what's working. You probably already knew. Negative reviews tell you what's broken in ways you didn't predict, which is the only kind of information that drives improvement.

The customer who's almost-but-not-quite happy is the most valuable signal in the corpus. They're not so unhappy that they leave. They're not so happy that they're unconditional. They're telling you exactly which detail is keeping them from being a fan, and that detail is usually fixable.

The harder thing to internalize is that no NLP technique extracts insight automatically. The techniques surface patterns. Insight is what happens when a person who understands the business looks at the patterns and recognizes which ones matter. The most sophisticated model in the world, applied without that interpretive layer, produces another dashboard that nobody uses.

The reason I still find this work interesting is that customers have always been telling companies the truth. The companies that listen, in the right way, with the right tools, learn things their competitors don't. The companies that aggregate the voices into a single average are the ones who, like Larry's waiter, keep insisting nothing is wrong while the customer is still trying to explain that the sandwich was wrong from the moment it hit the table.