What Is Lexical Analysis?

Written by Coursera Staff • Updated on

Lexical analysis is the first step of text processing used in many artificial intelligence algorithms. Learn why this process is a key step in natural language processing, allowing machines to understand human text more effectively.

[Featured Image] A programmer sits at a desk and uses a laptop application that includes lexical analysis.

Lexical analysis is one of the first steps in natural language processing, allowing computers to break down input text into individual units for further analysis. This article will explore key terms related to lexical analysis, the steps of lexical analysis, advantages and limitations, and what types of careers utilize this process.

Key terms in lexical analysis

When learning about lexical analysis, having a firm grasp of several key terms can help you understand the underlying process and how lexical analysis fits into the larger picture of natural language processing and artificial intelligence. Some keywords and phrases to become familiar with include the following.

NLP (Natural language processing)

NLP is a branch of computer science and artificial intelligence that centers around designing ways for computers to communicate with humans. NLP aims for computers to be able to listen and converse with humans using natural language. NLP programs enable computers to read, understand, interpret, and mimic human languages in a valuable and meaningful way.

Token 

A token is a sequence of characters grouped into a single entity. Each token represents a set of character sequences conveying a specific meaning. In programming languages, tokens can be keywords, operators, identifiers, or other elements that have a syntactical role.

Tokenizer 

A tokenizer is a program that divides an input into separate tokens. These tokens have distinct meanings and represent individual entities. The tokenizer needs to identify the boundaries of tokens, which can vary depending on the context and the rules of the specific language. Tokenization is typically the first step before natural language processing.

Lexer (lexical analyzer)

A lexer, short for lexical analyzer, is a more complex program that tokenizes the input text and classifies these tokens into predefined categories. For example, in programming languages, a lexer would categorize tokens as keywords, operators, literals, etc. The lexer plays a crucial role in the parsing stage, as it feeds the parser with tokens to facilitate syntactical analysis.

Lexeme

A lexeme is a dictionary word or abstract entity that is the base meaning of a certain word or category. For example, “write,” “wrote,” “writing,” and “written” would generally all belong to the lexeme “write” unless you wanted to classify each as separate lexemes. In the expression “5 + 6,” “5,” “+,” and “6” are separate lexemes. 

What is lexical analysis?

Lexical analysis, or scanning, is a fundamental step in NLP. In programming languages, this process involves the lexical analyzer (lexer or scanner) reading the source code character by character to group these characters into tokens, the smallest units in the code that convey meaning. These tokens typically fall into categories such as constants (like integers, doubles, characters, strings), operators (arithmetic, logical, relational), punctuation (commas, semicolons, braces), and keywords (reserved words with predefined meanings like if, while, return).

Once the lexical analyzer, or lexer, scans the text, it produces a stream of tokens. This tokenized format is essential for the next processing or program compilation stages. Lexical analysis is also an appropriate time for you to perform other data-cleaning chores, like stripping white spaces or compiling certain types of text. You can think of lexical analysis as a pre-processing step before more complex NLP analysis.

Steps of lexical analysis

Lexical analysis is a set of key steps that transform an input text into tokens or lexemes for further NLP analysis. While the process will vary depending on the method used and the type of input, most lexical analysis processes follow these general steps.

1. Identify tokens.

The first step is to determine a fixed set of input symbols. These include letters, digits, operators, brackets, and other special symbols. Each of these symbols or combinations has a specific token type.

2. Assign strings to tokens.

The lexer is programmed to recognize and categorize inputs. For example, it might be set up to recognize “cat” as a string token and “2023” as an integer token. Keywords, identifiers, whitespace, and other elements are similarly categorized.

3. Return the lexeme or value of the token.

The lexeme is essentially the smallest unit in the set of substrings that form the token. The lexer returns this lexeme, which subsequent processing stages can then use.

Types of lexical analysis

When choosing what type of lexical analysis method to use to process your text or input, you will likely want to use one of two primary types: “Loop and Switch” or “Regular Expressions and Finite Automata.” Each method uses a distinct algorithm to analyze the input and break it down into tokens that are more easily processed by machines.

Loop and switch algorithm 

Loop constructs are like the tools used to read through a book line by line. They do a similar job in lexical analysis by going through the code, one character at a time. Think of them as being on a mission to read every letter and symbol in the code to ensure it doesn’t overlook anything. They keep doing this until they reach the end of the code. This method helps the lexer capture every piece of the code and break it down into small, meaningful tokens.

Switch statements act like quick decision-makers. Once the lexer reads a character or a group of characters, the switch statement jumps in to decide what type of token these characters belong to. This is like coming across different items when packing your garage and deciding which box to go into. For example, if the loop reads “dog”, the switch statement quickly decides it’s a string or keyword token. This step is crucial for organizing the code into different categories like keywords, numbers, or operators, making it easier to understand and process.

Regular expressions and finite automata 

Regular expressions describe patterns in text. In lexical analysis, they define the rules for how different tokens should look. For instance, a regular expression might describe what an email address or phone number should look like. The lexer uses these expressions to identify tokens by matching the text in the code with these patterns. It’s like having a checklist to see if a piece of text meets certain criteria to be considered a specific type of token.

Finite automata are like smart robots that follow a set of instructions to perform a task. In lexical analysis, they take the rules described by regular expressions and use them to analyze the code. They check each part of the code against these rules to see if they match. If they do, they identify a token.

Should you choose lexical analysis for your text processing? 

When deciding whether to choose lexical analysis for your text processing, you should consider the advantages and disadvantages of lexical analysis to make an informed decision. While lexical analysis is a common method for text pre-processing within NLP, it is not a perfect algorithm. Some key advantages and limitations are as follows.

Advantages of lexical analysis

  • Data cleaning: It effectively removes extraneous elements like white spaces or comments, making the source program cleaner.

  • Simplifies input for further analysis: By organizing the input into relevant tokens and discarding irrelevant information, lexical analysis simplifies subsequent syntactical analysis tasks.

  • Compresses the input: Beyond simplification, the lexer plays an important role in reducing and compiling the input.

Limitations of lexical analysis

  • Ambiguity: Lexical analysis can sometimes be ambiguous in its categorization of tokens.

  • Lookahead limitations: The lexer often requires a lookahead feature to decide on the categorization of tokens, which can be a complex process.

  • Localized view of the source program: Lexical analyzers may not detect issues like garbled sequences, undeclared identifiers, or misspelled words, as they only report separate tokens without understanding their interrelation.

Start a career using lexical analysis.

A wide range of careers and fields leverage the power of lexical analysis. As NLP and artificial intelligence continue to grow, the applications across industries will likely increase. One of the most common careers that uses NLP and lexical analysis is NLP engineer. NLP engineers’ job duties vary, but typically, they design natural language processing systems, work with speech systems within artificial intelligence applications, implement new NLP algorithms, and refine models, among other tasks. In the United States, on average, NLP engineers take home an annual base salary of $122,726, according to Glassdoor’s January 2024 data [1].

Other related careers that may use NLP, depending on your focus, may include:

  • Software engineers

  • Data scientists

  • Machine learning engineers

  • Artificial intelligence engineers

  • Language engineers

  • Research engineers

Continue learning with Coursera.

You can stay updated on existing and new advancements within the NLP field with engaging Coursera courses offered by leading research institutions and industry organizations. Consider the Natural Language Processing Specialization offered by DeepLeanring.AI. This four-course program will help you gain NLP-related skills, including using recurrent neural networks, implementing sentiment analysis, and machine translation.

Article sources

  1. Glassdoor. “How much does a NLP Engineer make?, https://www.glassdoor.com/Salaries/nlp-engineer-salary-SRCH_KO0,12.htm.” Accessed January 19, 2024.

Keep reading

Updated on
Written by:

Editorial Team

Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...

This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.