Tabula Python
Tabula is a Python library that extracts tables from PDF files. Mainly, it’s quite helpful with PDF files having tabular information that can’t be correctly interpreted manually. Tabula is based on Java and uses Tabula-Java under the hood. Everything you need to know about using Tabula with Python explained in as much detail as possible:
1. Installing Tabula
Before using Tabula in Python, you need to install the tabula-py library, which serves as a Python wrapper for Tabula-Java.
Installation:
Run the following commands in your terminal or command prompt:
pip install tabula-py
Note: Tabula requires Java to be installed on your system. Ensure Java is installed and properly configured in your system’s environment variables.
You can check if Java is installed by running:
java -version
If not, download and install Java from the official Java website.
2. Importing Tabula
Once tabula-py is installed, import the necessary functions into your Python script. Here’s how:
from tabula import read_pdf, convert_into
You’ll need these two functions:
read_pdf: To extract tables from the PDF.convert_into: To convert PDF tables directly into files like CSV.
3. Extracting Tables from PDFs
The read_pdf function is the most commonly used feature of Tabula. It reads tables from a PDF file and converts them into a Pandas DataFrame.
Syntax:
read_pdf(input_path, pages=1, output_format='dataframe', options)
Example 1: Extracting a Single Table
Let’s say you have a PDF file named example.pdf with a table on the first page.
from tabula import read_pdf
# Read the table from page 1 of the PDF
dataframe = read_pdf("example.pdf", pages=1)
# Display the table
print(dataframe)
Output:
If the PDF contains a table like this:
| Name | Age | City |
|---|---|---|
| John | 25 | New York |
| Alice | 30 | Los Angeles |
| Bob | 22 | Chicago |
The printed output in Python will be:
Name Age City
0 John 25 New York
1 Alice 30 Los Angeles
2 Bob 22 Chicago
Example 2: Extracting Multiple Tables
If the page contains more than one table, you can use the multiple_tables option to return a list of DataFrames:
tables = read_pdf("example.pdf", pages=1, multiple_tables=True)
# Print all extracted tables
for i, table in enumerate(tables):
print(f"Table {i+1}:")
print(table)
Output: If there are two tables on the page, the output will look like this:
Table 1:
Name Age City
0 John 25 New York
1 Alice 30 Los Angeles
Table 2:
Product Price
0 Laptop $999
1 Phone $499
4. Converting Tables to Other Formats
If you want to directly save the extracted table into a file (e.g., CSV or JSON), you can use the convert_into function.
Syntax:
convert_into(input_path, output_path, output_format='csv', pages=1, options)
Example: Saving a Table as a CSV File
from tabula import convert_into
# Convert table from page 1 of the PDF into a CSV file
convert_into("example.pdf", "output.csv", output_format="csv", pages=1)
print("PDF table has been saved as a CSV file.")
Expected Output: A file named output.csv is created with the following content:
Name,Age,City
John,25,New York
Alice,30,Los Angeles
Bob,22,Chicago
5. Fine-Tuning Table Extraction
Tabula provides the following refinements to adjust the extraction process.
Common Options:
area: Fetches tables from specific region of page.- Format:
[top, left, bottom, right]in points.
- Format:
lattice: Use to fetch tables containing gridlinesstream: Used to fetch non-gridline-containing tablesguess: Automatically locates table area.
Example: Extract Table from a Specific Area
table = read_pdf(
"example.pdf",
pages=1,
area=[100, 50, 500, 400], # Define the table area
stream=True # Use stream-based extraction
)
print(table)
6. Combining Tabula with Pandas
Since Tabula outputs tables as Pandas DataFrames, you can easily analyze or manipulate the extracted data.
Example:
import pandas as pd
from tabula import read_pdf
# Extract table
df = read_pdf("example.pdf", pages=1)
# Perform basic analysis
print(df.head()) # Display the first 5 rows
print(df.describe()) # Statistical summary
print(df['Age'].mean()) # Calculate the average age
Sample Output: For a table like this:
| Name | Age | City |
|---|---|---|
| John | 25 | New York |
| Alice | 30 | Los Angeles |
| Bob | 22 | Chicago |
Output would be:
Name Age City
0 John 25 New York
1 Alice 30 Los Angeles
2 Bob 22 Chicago
Age
count 3.000000
mean 25.666667
std 4.041452
min 22.000000
25% 23.500000
50% 25.000000
75% 27.500000
max 30.000000
26.0
7. Troubleshooting Common Issues
Java not founderror:- Ensure Java installed and added into your
PATH.
- Ensure Java installed and added into your
- Empty tables:
- Ensure the table can be found at the page or site.
- try to change to a different
streamorlattice.
- Dependencies missing: RUN:
pip install pandas numpy
8. Alternatives to Tabula
While Tabula is very powerful, it may not work perfectly in all cases. Other alternatives include:
- Camelot: Works better with PDFs that have well-defined table structures.
- PyPDF2: Useful for basic text extraction, but not so good for tables.
- PDFPlumber– Another library, which extracts table and other data from PDF files.