How to create a DataFrames in Python
DataFrame is a two-dimensional, tabular data structure provided by the pandas library in Python. Data analysis and manipulation uses them frequently since they are comparable to Excel spreadsheets or SQL tables.
Here is a stepwise discussion of how to create a DataFrame in pandas:
1. Installing pandas
Before you start, you need to install the pandas
library. The purpose of this Python package is to manipulate and analyze data. Use the following command:
pip install pandas
- What does this do?
pip
is Python’s package installer. This command downloads and installs thepandas
library and its dependencies (like NumPy) on your system.
- How to verify if pandas is installed? Run the following Python code:
import pandas as pd
print(pd.__version__)
This should display the installed version of pandas.
2. Importing pandas
After installation, you need to import the library in your Python script or Jupyter Notebook.
import pandas as pd
- Why
as pd
? This is a common convention in the Python community to shorten the name. Instead of typingpandas.DataFrame
repeatedly, you can typepd.DataFrame
.
3. Understanding DataFrame
A DataFrame is a 2-dimensional, tabular data structure in pandas. It can be imagined as an Excel spreadsheet, with:
- Rows corresponding to individual records (e.g., a person, a transaction).
- Columns correspond to attributes or fields (e.g., Name, Age, City).
Important properties of a DataFrame:
- It can store data of different types, such as integers, floats, strings, etc.
- It contains labeled rows, called the index, and columns, called the headers.
- Can be created from diverse data sources like dictionaries, lists, NumPy arrays, or from external files, such as CSV, Excel.
4. Creating DataFrames
Method 1: From a Dictionary
What is a Dictionary? A Python dictionary is a collection of key-value pairs:
- Keys represent column names.
- Values represent the data in those columns.
Example:
data = {
'Name': ['Alice', 'Bob', 'Charlie'], # Column 1
'Age': [25, 30, 35], # Column 2
'City': ['New York', 'Los Angeles', 'Chicago'] # Column 3
}
df = pd.DataFrame(data) # Convert the dictionary to a DataFrame
print(df)
Explanation:
'Name'
: The first key becomes a column with values'Alice'
,'Bob'
,'Charlie'
.'Age'
: The second key is turned into a column with values25
,30
,35
.'City'
: The third key is a column whose values are'New York'
,'Los Angeles'
,'Chicago'
.
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
Method 2: From a List of Dictionaries
What is a List of Dictionaries? Each dictionary represents one row of data in the DataFrame.
Example:
data = [
{'Name': 'Alice', 'Age': 25, 'City': 'New York'}, # Row 1
{'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'}, # Row 2
{'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'} # Row 3
]
df = pd.DataFrame(data) # Convert the list of dictionaries to a DataFrame
print(df)
Explanation:
- Each dictionary is treated as a row.
- Keys in each dictionary become column headers.
- Values become the data for those columns.
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
Method 3: From Separate Lists
What are Lists? Lists are ordered collections of data in Python. You can use separate lists for columns and combine them into a DataFrame.
Example:
names = ['Alice', 'Bob', 'Charlie']
ages = [25, 30, 35]
cities = ['New York', 'Los Angeles', 'Chicago']
df = pd.DataFrame({
'Name': names, # Column 1
'Age': ages, # Column 2
'City': cities # Column 3
})
print(df)
Explanation:
- Combine the lists into a dictionary where keys are column names, and lists are column values.
- Pandas automatically aligns the data based on their order.
Method 4: From a NumPy Array
What is a NumPy Array? NumPy is a library for numerical computations, and its arrays are similar to matrices. You can convert a 2D NumPy array into a DataFrame.
Example:
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # 2D array
df = pd.DataFrame(data, columns=['A', 'B', 'C']) # Assign column names
print(df)
Explanation:
- The rows of the array become rows in the DataFrame.
- The columns of the array become columns in the DataFrame.
- Column names (‘A’, ‘B’, ‘C’) are explicitly assigned.
Output:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
Method 5: From External Files
Pandas can directly read data from files like CSV, Excel, or JSON.
- From CSV:
df = pd.read_csv('data.csv') # Replace 'data.csv' with your file path
print(df)
- From Excel:
df = pd.read_excel('data.xlsx', sheet_name='Sheet1') # Specify sheet name
print(df)
Explanation: Pandas reads the file, identifies the rows and columns, and creates a DataFrame.
5. Basic DataFrame Operations
Once a DataFrame is created, you can perform the following basic operations:
View Data
- Display the first few rows:
print(df.head()) # Default: First 5 rows
print(df.head(10)) # Specify number of rows (e.g., 10)
- Display the last few rows:
print(df.tail()) # Default: Last 5 rows
Data Summary
- Get the number of rows and columns:
print(df.shape) # Output: (number of rows, number of columns)
- Get column names:
print(df.columns) # Returns column headers
Access Specific Data
- Access a single column:
print(df['Name']) # Returns the 'Name' column as a pandas Series
Modify Data
- Add a new column:
df['Country'] = ['USA', 'USA', 'USA'] # Add a new column with values
print(df)
Save Data
- Save the DataFrame to a CSV file:
df.to_csv('output.csv', index=False) # `index=False` avoids saving row numbers
The pandas library provides an easy and powerful way to create and manipulate structured data using DataFrames.