Tabula Python

Tabula is a Python library that extracts tables from PDF files. Mainly, it’s quite helpful with PDF files having tabular information that can’t be correctly interpreted manually. Tabula is based on Java and uses Tabula-Java under the hood. Everything you need to know about using Tabula with Python explained in as much detail as possible:

1. Installing Tabula

Before using Tabula in Python, you need to install the tabula-py library, which serves as a Python wrapper for Tabula-Java.

Installation:

Run the following commands in your terminal or command prompt:

pip install tabula-py

Note: Tabula requires Java to be installed on your system. Ensure Java is installed and properly configured in your system’s environment variables.

You can check if Java is installed by running:

java -version

If not, download and install Java from the official Java website.

2. Importing Tabula

Once tabula-py is installed, import the necessary functions into your Python script. Here’s how:

from tabula import read_pdf, convert_into

You’ll need these two functions:

  • read_pdf: To extract tables from the PDF.
  • convert_into: To convert PDF tables directly into files like CSV.

3. Extracting Tables from PDFs

The read_pdf function is the most commonly used feature of Tabula. It reads tables from a PDF file and converts them into a Pandas DataFrame.

Syntax:

read_pdf(input_path, pages=1, output_format='dataframe', options)

Example 1: Extracting a Single Table

Let’s say you have a PDF file named example.pdf with a table on the first page.

from tabula import read_pdf

# Read the table from page 1 of the PDF
dataframe = read_pdf("example.pdf", pages=1)

# Display the table
print(dataframe)

Output:

If the PDF contains a table like this:

NameAgeCity
John25New York
Alice30Los Angeles
Bob22Chicago

The printed output in Python will be:

     Name  Age          City
0    John   25     New York
1   Alice   30  Los Angeles
2     Bob   22       Chicago

Example 2: Extracting Multiple Tables

If the page contains more than one table, you can use the multiple_tables option to return a list of DataFrames:

tables = read_pdf("example.pdf", pages=1, multiple_tables=True)

# Print all extracted tables
for i, table in enumerate(tables):
    print(f"Table {i+1}:")
    print(table)

Output: If there are two tables on the page, the output will look like this:

Table 1:

     Name  Age          City
0    John   25     New York
1   Alice   30  Los Angeles

Table 2:

    Product   Price
0    Laptop   $999
1    Phone    $499

4. Converting Tables to Other Formats

If you want to directly save the extracted table into a file (e.g., CSV or JSON), you can use the convert_into function.

Syntax:

convert_into(input_path, output_path, output_format='csv', pages=1, options)

Example: Saving a Table as a CSV File

from tabula import convert_into

# Convert table from page 1 of the PDF into a CSV file
convert_into("example.pdf", "output.csv", output_format="csv", pages=1)

print("PDF table has been saved as a CSV file.")

Expected Output: A file named output.csv is created with the following content:

Name,Age,City
John,25,New York
Alice,30,Los Angeles
Bob,22,Chicago

5. Fine-Tuning Table Extraction

Tabula provides the following refinements to adjust the extraction process.

Common Options:

  1. area: Fetches tables from specific region of page.
    • Format: [top, left, bottom, right] in points.
  2. lattice: Use to fetch tables containing gridlines
  3. stream: Used to fetch non-gridline-containing tables
  4. guess: Automatically locates table area.

Example: Extract Table from a Specific Area

table = read_pdf(
    "example.pdf",
    pages=1,
    area=[100, 50, 500, 400],  # Define the table area
    stream=True                # Use stream-based extraction
)

print(table)

6. Combining Tabula with Pandas

Since Tabula outputs tables as Pandas DataFrames, you can easily analyze or manipulate the extracted data.

Example:

import pandas as pd
from tabula import read_pdf

# Extract table
df = read_pdf("example.pdf", pages=1)

# Perform basic analysis
print(df.head())       # Display the first 5 rows
print(df.describe())   # Statistical summary
print(df['Age'].mean())  # Calculate the average age

Sample Output: For a table like this:

NameAgeCity
John25New York
Alice30Los Angeles
Bob22Chicago

Output would be:

     Name  Age          City
0    John   25     New York
1   Alice   30  Los Angeles
2     Bob   22       Chicago

             Age
count   3.000000
mean   25.666667
std     4.041452
min    22.000000
25%    23.500000
50%    25.000000
75%    27.500000
max    30.000000

26.0

7. Troubleshooting Common Issues

  1. Java not found error:
    • Ensure Java installed and added into your PATH.
  2. Empty tables:
    • Ensure the table can be found at the page or site.
    • try to change to a different stream or lattice.
  3. Dependencies missing: RUN:
pip install pandas numpy

8. Alternatives to Tabula

While Tabula is very powerful, it may not work perfectly in all cases. Other alternatives include:

  • Camelot: Works better with PDFs that have well-defined table structures.
  • PyPDF2: Useful for basic text extraction, but not so good for tables.
  • PDFPlumber– Another library, which extracts table and other data from PDF files.