Python Regex
Regular Expressions, or regex, in Python are patterns used to match strings or parts of strings. The re
module in Python gives the functionality to work with regex. Let’s break this concept step by step.
Basics of Regular Expressions
A regular expression is a sequence of characters defining a search pattern. These patterns are used to search, match, extract, and manipulate strings.
Regex Syntax
Special Characters (Meta-characters)
.
: Matches any character except a newline.^
: Matches the start of a string.$
: Matches the end of a string.*
: Matches 0 or more repetitions of the preceding character.+
: Matches 1 or more repetitions of the preceding character.?
: Matches 0 or 1 occurrence of the preceding character.{m,n}
: Matches betweenm
andn
repetitions of the preceding character.[]
: Matches any one character within the brackets.|
: Acts as an OR operator.()
: Groups patterns and captures matched text.
Escaping Special Characters
To match a literal special character, escape it with a backslash (\)
:
.
matches a literal dot(.)
.*
matches a literal asterisk(*)
.
Character Classes
\d
: Matches any digit (0-9).\D
: Matches any non-digit.\w
: Matches any alphanumeric character and underscore (a-z
,A-Z
,0-9
,_
).\W
: Matches any non-alphanumeric character.\s
: Matches any whitespace (spaces, tabs, newlines).\S
: Matches any non-whitespace character.
Anchors
^
: Matches the beginning of a string (or beginning of a line in multiline mode).$
: Matches the end of a string (or end of a line in multiline mode).
Repetitions
*
: Matches 0 or more occurrences.+
: Matches 1 or more occurrences.?
: Matches 0 or 1 occurrence.{m}
: Matches exactlym
occurrences.{m,n}
: Matches betweenm
andn
occurrences.
Grouping and Capturing
(...)
: Groups a pattern and captures it.(?:...)
: Groups a pattern but does not capture it (non-capturing group).
Flags
re.IGNORECASE
orre.I
: Case-insensitive matching.re.MULTILINE
orre.M
: Makes^
and$
match the start and end of each line.re.DOTALL
orre.S
: Makes.
match any character, including newlines.
Python re
Module
You can use the re
module to work with regex in Python. Common functions include:
1. re.match()
Matches the pattern at the beginning of the string.
import re
pattern = r"hello"
result = re.match(pattern, "hello world")
print(result.group()) # Output: hello
2. re.search()
Searches the entire string for a match.
result = re.search(r"world", "hello world")
print(result.group()) # Output: world
3. re.findall()
Returns a list of all matches in the string.
result = re.findall(r"\d+", "There are 3 cats, 4 dogs, and 5 birds.")
print(result) # Output: ['3', '4', '5']
4. re.sub()
Replaces matches with a specified string.
result = re.sub(r"cat", "dog", "The cat is cute.")
print(result) # Output: The dog is cute.
5. re.split()
Splits the string at each match of the pattern.
result = re.split(r"\s+", "Split this sentence into words.")
print(result) # Output: ['Split', 'this', 'sentence', 'into', 'words.']
6. re.compile()
Compiles a regex pattern for reuse.
pattern = re.compile(r"\d+")
result = pattern.findall("123 and 456")
print(result) # Output: ['123', '456']
Practical Examples
- Validate an Email Address
pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$"
email = "example@mail.com"
if re.match(pattern, email):
print("Valid email")
else:
print("Invalid email")
2. Extract Dates
text = "The event is on 2024-12-25 and 2025-01-01."
dates = re.findall(r"\d{4}-\d{2}-\d{2}", text)
print(dates) # Output: ['2024-12-25', '2025-01-01']
3. Replace Multiple Spaces with a Single Space
text = "This is a test."
result = re.sub(r"\s+", " ", text)
print(result) # Output: This is a test.
Tips for Learning Regex
- Use an online regex tester like regex101 to experiment with patterns.
- Break down complex patterns into smaller parts.
- Start with simple patterns and gradually build complexity.