Python Excel Automation in Practice: Data Filtering and Categorization with Pandas

Why Use Python to Work with Excel

In everyday work, we frequently need to filter, categorize, and analyze data in Excel. When the dataset is small, manual operations are manageable. But when you're dealing with hundreds or even thousands of records, manually copying, pasting, and filtering becomes extremely inefficient and error-prone.

Thanks to its rich ecosystem of third-party libraries, Python can accomplish complex data processing tasks with minimal code. This article demonstrates Python's power for Excel automation through a real-world case study — automatically categorizing 250 movies by year and genre.

bilibili source: 【2026精选】102个Python实战项目，从入门到进阶，练完即可就业，活该你进大厂，基础到框架，建议码住！新学期卷死同学丨毕设？面试实战？统统不是问题！

The Core Tool for Python Excel Operations: The Pandas Library

Environment Setup and Installation

Python doesn't natively support direct Excel file manipulation — we need to install the Pandas module. Pandas is one of the most powerful data processing libraries in the Python ecosystem, developed by Wes McKinney in 2008, originally to meet the needs of financial data analysis. Its name derives from "Panel Data," a term in econometrics for multidimensional structured data. The core data structure in Pandas is the DataFrame, essentially a two-dimensional table with row indices and column labels, which maps conceptually very closely to an Excel worksheet. Installation is straightforward:

pip install pandas

Once installed, import it in your code with an alias:

import pandas as pd

This reflects one of Python's core strengths — a pluggable, modular design. Python itself is a lightweight language; you install the modules you need and combine them like building blocks to create powerful functionality. Under the hood, Pandas relies on NumPy for numerical computation and uses engines like openpyxl or xlrd to read and write Excel files. This layered architecture gives Pandas near-C-level computational performance while maintaining Python-level ease of use.

Reading Excel Data with Pandas

Reading an Excel file takes just one line of code:

data = pd.read_excel('数据.xlsx')

This loads all 250 movie records from the Excel file (including movie name, rating, director, cast, year/country/genre, etc.) into your Python program. Once loaded, the data is stored in memory as a DataFrame. You can query, filter, and aggregate it just like a database table — without repeatedly opening and closing the Excel file.

Hands-On: Automatically Categorizing Movie Data by Year

Requirements Analysis

In the original dataset, year, country, and genre information are stored in a single column separated by slashes (e.g., "1994/USA/Crime Drama"). Our goals are:

Split this information into separate columns
Save the data into different Sheet tabs based on year

Data Splitting

Using Pandas' apply method combined with Lambda anonymous functions, we can batch-process every row:

data['year'] = data['type'].apply(lambda x: x.split('/')[0].strip())
data['c'] = data['type'].apply(lambda x: x.split('/')[1].strip())
data['t'] = data['type'].apply(lambda x: x.split('/')[2].strip())

These three lines mean: for each entry in the type column, split by slash and extract the first part (year), second part (country), and third part (genre), trim whitespace, and store them in new columns.

This involves two important programming concepts. Lambda anonymous functions originate from Lambda calculus in the functional programming paradigm. They are Python's syntactic sugar for creating small, one-off functions. Unlike regular functions defined with def, Lambda functions are inline expressions, ideal for simple data transformation scenarios. The apply method is essentially a higher-order function — it takes another function as a parameter and applies it row-by-row (or column-by-column) to each element of the DataFrame. This pattern avoids writing explicit for loops, resulting in cleaner code and, in some cases, better performance through Pandas' internal vectorization optimizations.

Saving to Different Sheets by Year

# Get all unique years
years = data['year'].unique()

# Create an Excel writer
writer = pd.ExcelWriter('temp.xlsx')

# Filter by year and save
for i in years:
    data[data['year'] == i].to_excel(writer, sheet_name=i)

writer.close()

After running this, the generated Excel file has a separate worksheet for each year — 1994 movies in one Sheet, 1997 movies in another — neatly organized at a glance.

pd.ExcelWriter is a context manager provided by Pandas for writing Excel files. It allows you to write multiple DataFrames to different Sheets within the same Excel file. Under the hood, it relies on different writing engines: for .xlsx format, it defaults to the openpyxl engine; for the legacy .xls format, it uses the xlwt engine. It's worth noting that when dealing with very large files (millions of rows), you might consider using the xlsxwriter engine, which employs a streaming write mode with lower memory usage and faster write speeds.

Advanced: Challenges and Solutions for Categorizing by Genre

The Problem

Categorizing by genre is more complex than by year because a single movie can belong to multiple genres (e.g., "Drama Sci-Fi Thriller"). Simple equality matching would miss a large amount of data.

Solution: Using Contains Matching

First, we need to extract all unique genre labels. Since genres are separated by spaces, we use a generator expression combined with set for deduplication:

type_list = set(z for i in data['t'] for z in i.split(' '))

This single line is equivalent to: iterate through each record's genre field, split by space, extract each genre individually, and use set to automatically deduplicate.

What's actually used here is a generator expression, which has similar syntax to a list comprehension but uses parentheses. The difference is that a generator doesn't load all results into memory at once — it produces them on demand, making it more memory-friendly for large datasets. set is a Python data structure based on hash tables, with the core property that elements cannot be duplicated. When we pass the generator into set(), Python automatically deduplicates all genre labels with a time complexity close to O(n), far more efficient than manual deduplication through iteration.

Then we use the str.contains() method for substring matching:

for ty in type_list:
    data[data['t'].str.contains(ty)].to_excel(writer, sheet_name=ty)

This way, as long as a movie's genre contains "Sci-Fi," it will be placed in the Sci-Fi Sheet — even if it's also categorized as "Drama" and "Adventure."

str.contains() is part of Pandas' vectorized string operations. It performs substring matching across an entire column, returning a boolean Series (True/False sequence), which is then used with boolean indexing to filter matching rows. By default, this method supports Regular Expressions, meaning you can go beyond simple substring matching and use wildcards, character classes, quantifiers, and other advanced patterns for complex text filtering. For example, str.contains('^Crime') matches genres starting with "Crime," and str.contains('Drama|Thriller') matches records containing either "Drama" or "Thriller." This capability makes Pandas equally adept at handling unstructured text data.

Core Advantages of Python for Excel Operations

The entire auto-categorization feature requires only 6-8 lines of core code. Compared to the tedious manual Excel workflow (filter → copy → create new Sheet → paste → repeat N times), the Python approach offers three major advantages:

Comparison	Manual Operation	Python Automation
Processing Speed	Scales linearly with data volume	Virtually unaffected by data volume
Reusability	Redo from scratch each time	Change the filename and reuse
Data Scale	Large files freeze or crash	Handles large datasets with ease

It's worth adding that Excel itself has row limits — .xlsx format supports a maximum of 1,048,576 rows, while .xls format only supports 65,536 rows. When data exceeds these limits, Excel is completely helpless. Python with Pandas can easily handle millions or even tens of millions of rows. For even larger datasets, you can seamlessly switch to Dask (a distributed version of Pandas) or PySpark and other big data processing frameworks, enabling smooth scaling from a single machine to a cluster.

Python Career Paths and Learning Advice

Major Career Directions

Full-Stack Web Development: Independent front-end and back-end development capabilities, using frameworks like Django, Flask, etc. Django follows a "batteries included" philosophy, with built-in ORM (Object-Relational Mapping), template engine, form validation, user authentication, admin panel, and nearly every component needed for web development — ideal for rapidly building feature-complete large-scale applications. Early versions of Instagram and Pinterest were built on Django. Flask embraces a "micro-framework" philosophy, providing only routing and request handling at its core, with other features added through extension plugins as needed, giving developers maximum flexibility. In recent years, FastAPI has risen rapidly thanks to its native async support and automatic API documentation generation, becoming increasingly popular for building high-performance RESTful APIs and microservices.
Web Scraping Engineer: Automated data collection — a small amount of code can scrape large volumes of structured information. Python's web scraping ecosystem is very mature, ranging from the lightweight Requests + BeautifulSoup combination to the enterprise-grade Scrapy framework, and tools like Selenium and Playwright for handling JavaScript dynamic rendering, covering virtually every web data collection scenario.
Data Analysis: Data statistics, user behavior analysis, recommendation systems across industries. Python's data analysis stack typically includes Pandas (data processing), Matplotlib/Seaborn (data visualization), Scikit-learn (machine learning modeling), and Jupyter Notebook (interactive analysis environment), forming a complete workflow from data cleaning to insight generation.
Artificial Intelligence: Machine learning and deep learning — Python's ecosystem is the most mature. Major deep learning frameworks like TensorFlow, PyTorch, and Keras all use Python as their primary interface language. Combined with the thriving open-source community around Hugging Face and others, Python has become the de facto standard language in the AI field.
Automated Testing/DevOps: Script development to boost work efficiency

Learning Advice

Set Quantifiable Goals: Define what city, what salary range you're targeting, and work backward to determine the skills you need
Build a Knowledge Framework Quickly: Prioritize breadth before depth — you don't need to master every single topic
Practice-Driven Learning: Write more code, build more projects, and solidify your knowledge through hands-on experience
Stay Consistent: Programming skills require continuous practice — avoid the trap of sporadic effort

Conclusion

Python's approach to Excel automation essentially replaces large amounts of repetitive manual work with a few lines of code. This isn't just about improving office efficiency — it embodies computational thinking: abstracting repetitive labor into reusable program logic. In software engineering, this mindset is known as the DRY principle (Don't Repeat Yourself), one of the cornerstones of efficient programming.

For those looking to get started with Python, office automation is an excellent entry point: low barrier, quick results, and highly practical. Once you've learned basic syntax and the Pandas library, you can immediately apply it to real work — truly putting knowledge into practice. Beyond Excel operations, Python can easily automate other office scenarios such as batch Word document generation (python-docx), PDF file processing (PyPDF2), automated email sending (smtplib), and more. Once you've grasped the core concepts, you'll find it remarkably easy to extend your skills to new domains.