What Is Stemming?

Written by Coursera Staff • Updated on

By learning about stemming, you can understand how it works in natural language processing to affect our lives through chatbots, automated customer service lines, Alexa, and smart home devices. Also, read further to discover a career connected to stemming.

[Featured Image] A data scientist uses stemming to simplify data.

Stemming is a technique within natural language processing (NLP) designed to enhance language understanding by focusing on the core of words. Regarding artificial intelligence (AI), the development of NLP was a revolutionary advancement. This breakthrough has reshaped how we can converse with machines, making interactions more natural and intuitive, like a conversation between two people. At the heart of NLP are techniques to understand and retrieve information, one of which is “stemming.” 

In this article, we will explore what stemming is, how it relates to NLP, common examples of stemming, and additional information retrieval techniques. Finally, if you’re interested in a career related to stemming, we will provide a way to get started.

What is natural language processing?

Natural language processing (NLP), a type of artificial intelligence, enables computers to understand, process, interpret, and generate human language. While we used to interact with computers primarily through coding languages, such as Java or HTML, natural language processing allows you to communicate with computers as you would with another person directly. 

By combining linguistic and computer science fields, NLP development has transformed how you can speak with and understand computers. NLP drives many applications we interact with daily, such as voice-activated assistants, online customer support chatbots, smart home devices, email filters, and translation services.

When thinking about NLP, you can think of computers as the students and humans as the teachers. Just like you would teach someone to read and speak, you can teach computers to understand, interpret, and respond to human input. Over time, you can train NLP algorithms to recognize and understand more complex language and undertones, like sarcasm.

How NLP works

Human language is complex, so you must train the NLP algorithm to do several things. Core applications of an NLP algorithm often include:

  • Tokenization: Creating “tokens” or deconstructing language into its separate parts such as names, words, and punctuation

  • Part-of-speech tagging: Recognizing and labeling parts of speech in text (e.g., nouns)

  • Parsing: Determining the grammatical structure of sentences

  • Semantic analysis: Going beyond definitions to understand meaning in context

  • Sentiment analysis: Recognizing the emotional undertones of a text

What is stemming?

Stemming is the part of NLP that focuses on the roots of words to attach the correct meaning to the correct word. As you might imagine, being able to parse words and interpret meaning is an important function for an NLP algorithm. This might be difficult if many words mean the same thing but are different words—this is where conflation comes in. 

Conflation involves treating two distinct words or phrases as semantically equivalent because they refer to the same core idea. For example, “decided” and “decidable” might not be synonyms but are equivalent in certain contexts. NLP algorithms use “stemming” to effectively retrieve text information to understand the differences between words.

The idea behind stemming is to take away different endings of words to find the most basic part, which is the “stem.” For instance, if you took the words “swimmer,” “swimming,” and “swims,” they all have the root word “swim.” This helps NLP algorithms understand the meaning of different related words. By simplifying the words, computers can process language more easily. More specifically, Porter’s algorithm for stemming defines a set of suffixes and a basic required length of the word so that the algorithm can determine if removing the suffix is reasonable (e.g., removing “-ing” in “feeding” but not “ring”). The algorithm then applies steps sequentially after a series of checks.

Examples of stemming

  • “Fishing,” “fished,” and “fisher,” stem to “fish”

  • “Argue,” “argued,” “argument,” “arguing,” and “arguer,” stem to “argu”

  • “Create,” “creative,” “creativity,” “creator,” and “creating,” stem to “creat”

Advantages and disadvantages of stemming

When choosing stemming as your information retrieval technique, knowing the benefits and limitations can help you avoid common pitfalls and ensure you use the right technique for your needs. Consider the following advantages and disadvantages.

Advantages

  • Improves search accuracy: Stemming links related words, helping to identify information of interest.

  • Reduced dimensionality: By collapsing multiple forms of a word into a single representation, stemming reduces dimensionality and can make statistical processing easier.

  • Quick processing: Stemming algorithms are generally straightforward and fast, which speeds up the processing time for large volumes of text.

Disadvantages

  • Over-stemming: Sometimes, stemming can be too aggressive, resulting in different words being reduced to the same stem even though they have different meanings (e.g., “metaphor” and “metaphysical” might stem to “meta”).

  • Under-stemming: At other times, stemming might not be aggressive enough, failing to conflate words that are practically the same (e.g., “ready” and “readiness” might not show as related).

  • Difficulty with irregular conjugation: If the word is in a form not included in the pre-defined set of suffixes, the algorithm may not recognize it or may stem it improperly.

Explore more NLP information retrieval techniques.

Stemming is a powerful NLP tool, but it isn’t the only one. When learning about NLP, exploring different conflation techniques can help you better understand how computers process language and information. Different methods might be more effective for you depending on the type of text on which you are applying the algorithm.

Some additional methods for conflation include: 

  • Direct matching: Comparing the character sequences of two words to calculate how similar they are. After a certain threshold, they’re considered equivalent.

  • Lemmatization: This process creates lemmas, which are groups of words based on the same core term or stem. To accomplish this, lemmatization reviews lexical material kept in electronic dictionaries and lexicons.

  • Cluster-based conflation: This approach creates clusters of equivalent words based on associations in a text corpus. 

  • N-gram conflation: This involves breaking down words into N-letter fragments (N-grams) and identifying similar words based on these fragments. 

How to start a career in stemming

If you’re interested in a career involving stemming, a natural language processing position is a good choice. For example, as a natural language processing engineer, based on your job title and type of business, you could create natural language processing systems, understand speech patterns, and develop AI speech recognition, along with machine translation, syntactic analysis, and algorithm construction. To do this work, the algorithms you build will use the process of stemming to help machines both understand and communicate with language rather than numbers. According to Glassdoor, if you choose to work as a natural language processing engineer, you can expect to earn an average annual salary of $87,262 [1].

To become a natural language processing engineer, you’ll probably need to earn at least an associate or bachelor’s degree in academic areas such as engineering, computer science, data science, or artificial intelligence. Additionally, you might consider completing a master’s or even a PhD to increase your employability. Along with a degree, internships can give you experience in the real world and help you sharpen some of the necessary abilities, such as computer programming, statistical analysis, and machine learning methods. Finally, as a natural language processing engineer, you will most likely be working with others, so developing your workplace skills can contribute to your marketability.

Expand your natural language processing knowledge on Coursera.

You won’t find a shortage of NLP concepts to explore, and the field continues to expand. The US Bureau of Labor Statistics expects the economic sector in which you’ll find natural language processing to grow 35 percent from 2022 to 2032, which is well above average [2]. Whether you are interested in focusing on NLP, learning general AI along with machine learning techniques, or have a specific concept in mind, you can take your next steps with beginner and advanced courses on Coursera. Consider the Natural Language Processing Specialization by DeepLearning.AI for a broad overview of NLP.

Article sources

1

Glassdoor. “Natural Language Processing Engineer Salaries, https://www.glassdoor.com/Salaries/natural-language-processing-engineer-salary-SRCH_KO0,36.htm.” Accessed March 21, 2024.

Keep reading

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.