Python Regex

Regular Expressions, or regex, in Python are patterns used to match strings or parts of strings. The re module in Python gives the functionality to work with regex. Let’s break this concept step by step.

Basics of Regular Expressions

A regular expression is a sequence of characters defining a search pattern. These patterns are used to search, match, extract, and manipulate strings.

Regex Syntax

Special Characters (Meta-characters)

  • .: Matches any character except a newline.
  • ^: Matches the start of a string.
  • $: Matches the end of a string.
  • *: Matches 0 or more repetitions of the preceding character.
  • +: Matches 1 or more repetitions of the preceding character.
  • ?: Matches 0 or 1 occurrence of the preceding character.
  • {m,n}: Matches between m and n repetitions of the preceding character.
  • []: Matches any one character within the brackets.
  • |: Acts as an OR operator.
  • (): Groups patterns and captures matched text.

Escaping Special Characters

To match a literal special character, escape it with a backslash (\):

  • . matches a literal dot (.).
  • * matches a literal asterisk (*).

Character Classes

  • \d: Matches any digit (0-9).
  • \D: Matches any non-digit.
  • \w: Matches any alphanumeric character and underscore (a-z, A-Z, 0-9, _).
  • \W: Matches any non-alphanumeric character.
  • \s: Matches any whitespace (spaces, tabs, newlines).
  • \S: Matches any non-whitespace character.

Anchors

  • ^: Matches the beginning of a string (or beginning of a line in multiline mode).
  • $: Matches the end of a string (or end of a line in multiline mode).

Repetitions

  • *: Matches 0 or more occurrences.
  • +: Matches 1 or more occurrences.
  • ?: Matches 0 or 1 occurrence.
  • {m}: Matches exactly m occurrences.
  • {m,n}: Matches between m and n occurrences.

Grouping and Capturing

  • (...): Groups a pattern and captures it.
  • (?:...): Groups a pattern but does not capture it (non-capturing group).

Flags

  • re.IGNORECASE or re.I: Case-insensitive matching.
  • re.MULTILINE or re.M: Makes ^ and $ match the start and end of each line.
  • re.DOTALL or re.S: Makes. match any character, including newlines.

Python re Module

You can use the re module to work with regex in Python. Common functions include:

1. re.match()

Matches the pattern at the beginning of the string.

import re
pattern = r"hello"
result = re.match(pattern, "hello world")
print(result.group()) # Output: hello

2. re.search()

Searches the entire string for a match.

result = re.search(r"world", "hello world")
print(result.group()) # Output: world

3. re.findall()

Returns a list of all matches in the string.

result = re.findall(r"\d+", "There are 3 cats, 4 dogs, and 5 birds.")
print(result) # Output: ['3', '4', '5']

4. re.sub()

Replaces matches with a specified string.

result = re.sub(r"cat", "dog", "The cat is cute.")
print(result) # Output: The dog is cute.

5. re.split()

Splits the string at each match of the pattern.

result = re.split(r"\s+", "Split this sentence into words.")
print(result) # Output: ['Split', 'this', 'sentence', 'into', 'words.']

6. re.compile()

Compiles a regex pattern for reuse.

pattern = re.compile(r"\d+")
result = pattern.findall("123 and 456")
print(result) # Output: ['123', '456']

Practical Examples

  1. Validate an Email Address
pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$"
email = "example@mail.com"
if re.match(pattern, email):
    print("Valid email")
else:
    print("Invalid email")

2. Extract Dates

text = "The event is on 2024-12-25 and 2025-01-01."
dates = re.findall(r"\d{4}-\d{2}-\d{2}", text)
print(dates) # Output: ['2024-12-25', '2025-01-01']

3. Replace Multiple Spaces with a Single Space

text = "This is a test."
result = re.sub(r"\s+", " ", text)
print(result) # Output: This is a test.

Tips for Learning Regex

  • Use an online regex tester like regex101 to experiment with patterns.
  • Break down complex patterns into smaller parts.
  • Start with simple patterns and gradually build complexity.