Tokenizer in Python

In Python, a tokenizer is used to break a string into smaller units of components, referred to as tokens. Tokens are words, numbers, punctuation marks, or any meaningful unit according to the context. Tokenization is an essential step in NLP and text analysis.

Detailed explanation of tokenization with examples are given below:

1. Tokenizing Strings Using `str.split()`

The simplest and most naive way to split a string is by using split(). It operates by splitting the string into words based on any delimiter. When no argument is passed, a delimiter of white space is taken as the default.

Important Properties:

Default : Splits by any white spaces (spaces, tabs, newline).
User Input: You could pass in the separator (for instance, split(\”,\”)).
Limitations :
- Does not break up punctuation adequately.
- Treats punctuation like part of a word.

Example:

text = "Hello, world! Welcome to Python programming."
tokens = text.split()
print(tokens)

Output:

['Hello,', 'world!', 'Welcome', 'to', 'Python', 'programming.']

Input string: "Hello, world! Welcome to Python programming."
The split() function identifies spaces and splits the string wherever it finds them.
Result: ['Hello,', 'world!', 'Welcome', 'to', 'Python', 'programming.']
Notice how punctuation such as, and. become appendages to words such as Hello, and programming.

2. Tokenizing with `nltk` Library

For more advanced tokenizations than split(), nltk offers a library with two major functionalities:

word_tokenize: used to break a sentence into words. It separates punctuation from words.
sent_tokenize: Breaks a sentence into logical sentences.

Installation:

pip install nltk

word_tokenize Details:

It uses advanced algorithms for the proper management of punctuation and whitespaces.
Separates punctuation marks like commas, periods, and exclamation points as separate tokens.

import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')  # Download required data for tokenization

text = "Hello, world! Welcome to Python programming."
tokens = word_tokenize(text)
print(tokens)

Output:

['Hello', ',', 'world', '!', 'Welcome', 'to', 'Python', 'programming', '.']

Input: "Hello, world! Welcome to Python programming."
Tokenization result: ['Hello', ',', 'world', '!', 'Welcome', 'to', 'Python', 'programming', '.']
Words and punctuation are considered as different tokens.

`sent_tokenize` Details:

Splits text into sentences based on punctuation and capitalization patterns.
Useful for paragraph-level tokenization.

from nltk.tokenize import sent_tokenize

text = "Hello, world! Welcome to Python programming. Let's learn NLP."
sentences = sent_tokenize(text)
print(sentences)

Output:

['Hello, world!', 'Welcome to Python programming.', "Let's learn NLP."]

Input: "Hello, world! Welcome to Python programming. Let's learn NLP."
Output Sentence tokenization : ['Hello, world!', 'Welcome to Python programming.', "Let's learn NLP."]
Each sentence is considered a separate string in the list.

3. Regular Expressions (`re` Module)

Python’s re module provides powerful and flexible tokenization based on regular expressions. You define patterns for matching the tokens you are interested in extracting.

Important Properties:

Flexibility: Can define patterns according to specific needs of tokenization.
Pattern Explanation:
- \b: Matches a word boundary.
- \w+: Matches one or more word characters, which are letters, digits, or underscores.

Example:

import re

text = "Hello, world! Welcome to Python programming."
tokens = re.findall(r'\b\w+\b', text)
print(tokens)

Output:

['Hello', 'world', 'Welcome', 'to', 'Python', 'programming']

Input: "Hello, world! Welcome to Python programming."
The given regular expression \b\w+\b matches only the words and excludes the punctuation.
Result : ['Hello', 'world', 'Welcome', 'to', 'Python', 'programming']

4. Tokenizing Using `spaCy`

spaCy is a high-performance NLP library built for production-level NLP tasks. The tokenizer would automatically take care of all complexities involved in tokenization rules.

Installation:

pip install spacy
python -m spacy download en_core_web_sm

Important Properties:

Pretrained Models: Language-specific models for accurate tokenization.
Advanced Handling: Splits into distinct types: punctuation, white spaces, numbers and special characters.
Tokenization Output: Returns every token as an object with further metadata (e.g., part-of-speech tagging, dependency parsing).

Example:

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Hello, world! Welcome to Python programming."
doc = nlp(text)

tokens = [token.text for token in doc]
print(tokens)

Output:

['Hello', ',', 'world', '!', 'Welcome', 'to', 'Python', 'programming', '.']

Input: "Hello, world! Welcome to Python programming."
The spaCy tokenizer breaks the string into individual tokens: ['Hello', ',', 'world', '!', 'Welcome', 'to', 'Python', 'programming', '.
For punctuation marks like, and !, it considers them as separate tokens.

5. Tokenizing with `transformers` for NLP Models

The Hugging Face transformers library is typically used for preparing text data for deep learning models such as BERT and GPT. It preprocesses the text by tokenizing it into subword units to accommodate large vocabularies.

Installation:

pip install transformers

Important Properties:

Subword tokenization: Breaks words into smaller units if they are not in the vocabulary.
Pretrained Models: Each model has its own tokenizer designed for compatibility.

Example:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Hello, world! Welcome to Python programming."
tokens = tokenizer.tokenize(text)
print(tokens)

Output:

['hello', ',', 'world', '!', 'welcome', 'to', 'python', 'programming', '.']

Input: "Hello, world! Welcome to Python programming."
Result: ['hello', ',', 'world', '!', 'welcome', 'to', 'python', 'programming', '.']
Words are converted to lowercase (specific to bert-base-uncased).
This divides a word if it is not part of the model’s vocabulary into subwords.

6. Custom Tokenization

Custom tokenization involves writing one’s own logic to break up the text into tokens based on a specific requirement.

Important Properties:

Flexibility: Tailored to your needs (e.g., ignoring certain punctuation, handling specific delimiters).
Trade-offs: More effort is required compared to using existing libraries.

Example:

def custom_tokenizer(text):
    tokens = text.replace(",", "").replace(".", "").split()
    return tokens

text = "Hello, world! Welcome to Python programming."
tokens = custom_tokenizer(text)
print(tokens)

Output:

['Hello', 'world!', 'Welcome', 'to', 'Python', 'programming']

Input: "Hello, world! Welcome to Python programming."
Custom logic removes the punctuation marks, and. before splitting.
Result: ['Hello', 'world!', 'Welcome', 'to', 'Python', 'programming']

Summary Table

Method	Pros	Cons
`str.split()`	Simple and easy to use.	Does not handle punctuation or advanced scenarios.
`nltk`	Well-suited for basic NLP tasks.	Requires downloading additional data.
`re` (Regular Expressions)	Highly customizable for specific rules.	Requires knowledge of regular expressions.
`spaCy`	Fast, efficient, and feature-rich.	Larger memory footprint and setup required.
`transformers`	Optimized for deep learning tasks.	Not ideal for simple text analysis.
Custom Tokenization	Fully tailored to your needs.	Time-consuming to develop and debug.

For Employers

For Employees

For Employers

For Employees