How to create a DataFrames in Python

DataFrame is a two-dimensional, tabular data structure provided by the pandas library in Python. Data analysis and manipulation uses them frequently since they are comparable to Excel spreadsheets or SQL tables.

Here is a stepwise discussion of how to create a DataFrame in pandas:

1. Installing pandas

Before you start, you need to install the pandas library. The purpose of this Python package is to manipulate and analyze data. Use the following command:

pip install pandas
  • What does this do?
    • pip is Python’s package installer. This command downloads and installs the pandas library and its dependencies (like NumPy) on your system.
  • How to verify if pandas is installed? Run the following Python code:
import pandas as pd
print(pd.__version__)

This should display the installed version of pandas.

2. Importing pandas

After installation, you need to import the library in your Python script or Jupyter Notebook.

import pandas as pd
  • Why as pd? This is a common convention in the Python community to shorten the name. Instead of typing pandas.DataFrame repeatedly, you can type pd.DataFrame.

3. Understanding DataFrame

A DataFrame is a 2-dimensional, tabular data structure in pandas. It can be imagined as an Excel spreadsheet, with:

  • Rows corresponding to individual records (e.g., a person, a transaction).
  • Columns correspond to attributes or fields (e.g., Name, Age, City).

Important properties of a DataFrame:

  • It can store data of different types, such as integers, floats, strings, etc.
  • It contains labeled rows, called the index, and columns, called the headers.
  • Can be created from diverse data sources like dictionaries, lists, NumPy arrays, or from external files, such as CSV, Excel.

4. Creating DataFrames

Method 1: From a Dictionary

What is a Dictionary? A Python dictionary is a collection of key-value pairs:

  • Keys represent column names.
  • Values represent the data in those columns.

Example:

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],  # Column 1
    'Age': [25, 30, 35],                 # Column 2
    'City': ['New York', 'Los Angeles', 'Chicago']  # Column 3
}

df = pd.DataFrame(data)  # Convert the dictionary to a DataFrame
print(df)

Explanation:

  • 'Name': The first key becomes a column with values 'Alice', 'Bob', 'Charlie'.
  • 'Age': The second key is turned into a column with values 25, 30, 35.
  • 'City': The third key is a column whose values are 'New York', 'Los Angeles', 'Chicago'.

Output:

      Name  Age         City
0    Alice   25    New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Method 2: From a List of Dictionaries

What is a List of Dictionaries? Each dictionary represents one row of data in the DataFrame.

Example:

data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},  # Row 1
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},  # Row 2
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}  # Row 3
]

df = pd.DataFrame(data)  # Convert the list of dictionaries to a DataFrame
print(df)

Explanation:

  • Each dictionary is treated as a row.
  • Keys in each dictionary become column headers.
  • Values become the data for those columns.

Output:

      Name  Age         City
0    Alice   25    New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

Method 3: From Separate Lists

What are Lists? Lists are ordered collections of data in Python. You can use separate lists for columns and combine them into a DataFrame.

Example:

names = ['Alice', 'Bob', 'Charlie']
ages = [25, 30, 35]
cities = ['New York', 'Los Angeles', 'Chicago']

df = pd.DataFrame({
    'Name': names,  # Column 1
    'Age': ages,    # Column 2
    'City': cities  # Column 3
})
print(df)

Explanation:

  • Combine the lists into a dictionary where keys are column names, and lists are column values.
  • Pandas automatically aligns the data based on their order.

Method 4: From a NumPy Array

What is a NumPy Array? NumPy is a library for numerical computations, and its arrays are similar to matrices. You can convert a 2D NumPy array into a DataFrame.

Example:

import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])  # 2D array
df = pd.DataFrame(data, columns=['A', 'B', 'C'])  # Assign column names
print(df)

Explanation:

  • The rows of the array become rows in the DataFrame.
  • The columns of the array become columns in the DataFrame.
  • Column names (‘A’, ‘B’, ‘C’) are explicitly assigned.

Output:

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

Method 5: From External Files

Pandas can directly read data from files like CSV, Excel, or JSON.

  • From CSV:
df = pd.read_csv('data.csv')  # Replace 'data.csv' with your file path
print(df)
  • From Excel:
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')  # Specify sheet name
print(df)

Explanation: Pandas reads the file, identifies the rows and columns, and creates a DataFrame.

5. Basic DataFrame Operations

Once a DataFrame is created, you can perform the following basic operations:

View Data

  • Display the first few rows:
print(df.head())  # Default: First 5 rows
print(df.head(10))  # Specify number of rows (e.g., 10)
  • Display the last few rows:
print(df.tail())  # Default: Last 5 rows

Data Summary

  • Get the number of rows and columns:
print(df.shape)  # Output: (number of rows, number of columns)
  • Get column names:
print(df.columns)  # Returns column headers

Access Specific Data

  • Access a single column:
print(df['Name'])  # Returns the 'Name' column as a pandas Series

Modify Data

  • Add a new column:
df['Country'] = ['USA', 'USA', 'USA']  # Add a new column with values
print(df)

Save Data

  • Save the DataFrame to a CSV file:
df.to_csv('output.csv', index=False)  # `index=False` avoids saving row numbers

The pandas library provides an easy and powerful way to create and manipulate structured data using DataFrames.