Claude Code + Databricks Hands-On Tutorial: Natural Language-Driven Enterprise Data Analysis

The Evolution of Data Analysis: From Excel to AI-Driven

The field of data analysis is undergoing a profound transformation. From early local processing in Excel, to the big data era of Python+Hadoop, to today's Data Lake architecture, each leap has expanded the scale and efficiency of data processing.

A Data Lake is an architectural pattern that stores all types of data (structured, semi-structured, unstructured) in their raw format within a unified storage system. Unlike traditional Data Warehouses that require data to be cleaned and structured before writing, Data Lakes adopt a "Schema on Read" strategy, meaning the structure is defined only when data is read. This flexibility enables enterprises to store log files, JSON, images, videos, and other formats, analyzing them on demand. Databricks takes this further with the "Lakehouse" concept, combining the flexibility of Data Lakes with the transactional integrity and performance advantages of Data Warehouses, relying on the open-source Delta Lake project for ACID transaction support.

With the emergence of AI coding tools, an even more exciting possibility has surfaced — using Claude Code to connect all these tools together, enabling natural language-driven enterprise data analysis.

This article, based on a hands-on demonstration by a Bilibili content creator, provides a detailed breakdown of how to use Claude Code to connect to Databricks and complete the full workflow from data querying and table creation to Notebook generation.

Claude Code connecting to Databricks hands-on demonstration

What Is Databricks? Why Choose It?

Databricks is not an ordinary database — it's an enterprise-grade unified data and AI analytics platform, officially positioned as a "Data Intelligence Platform." Founded in 2013 by the creators of Apache Spark, it initially focused on commercializing large-scale data processing engines before gradually expanding into a comprehensive platform covering the entire data lifecycle. Its core advantages include:

Unified Platform: Integrates data engineering, ad-hoc querying, machine learning, and AI applications in one place
Data Lake Architecture: Ideal for handling large volumes of complex, distributed enterprise data
Wide Enterprise Adoption: Many companies use it as their core data infrastructure

For data analysts, the traditional workflow involves creating queries with SQL statements inside Databricks, or generating Notebooks using Python/PySpark for data processing. PySpark is Apache Spark's Python API, allowing data analysts to use familiar Python syntax to operate on distributed datasets, handling TB or even PB-scale data volumes — a scale that single-machine Python (like Pandas) simply cannot handle. Once Databricks' CLI (Command Line Interface) is connected to Claude Code, the entire interaction shifts from "writing code manually" to "natural language conversation."

Step 1: Establishing the Connection Between Claude Code and Databricks

Environment Setup

The demonstration uses Databricks Free Edition (14-day free trial), which anyone can obtain by registering on the official website. The dataset used is Databricks' built-in sample data — NYC Taxi (New York City taxi data), containing fields such as trip start time, distance, fare amount, and pickup/dropoff area codes (Zip Code). The NYC Taxi dataset is one of the most classic public datasets in data science, published by the New York City Taxi and Limousine Commission (TLC). It contains billions of trip records and is commonly used for teaching, benchmarking, and urban transportation research.

Connection Configuration

After launching Claude Code, simply type: "Please connect to my Databricks." The first connection requires configuring a Personal Access Token (PAT). PAT is a token-based authentication method widely used for API and CLI tool authentication. Unlike traditional username+password authentication, PATs offer advantages such as configurable expiration times, scope-limited permissions, and instant revocability. In Databricks, PATs allow external tools to access workspace resources under a user's identity without exposing the user's primary account credentials. This mechanism follows OAuth 2.0 design principles and is the standard approach for programmatic access in modern cloud services. Security best practices recommend creating independent tokens for each external integration with minimal necessary permissions.

Once configured, Claude Code returns connection confirmation information, including email, workspace address, and token status. The entire process requires no configuration scripts.

Step 2: Completing Data Queries with Natural Language

Basic Queries

Once connected, you can ask questions directly in natural language. For example:

"Query NYC taxi data in the sample catalog. Show me which pickup zip code has highest average fares."

Claude Code completes three steps behind the scenes: understanding natural language semantics → converting to SQL code → executing the query and returning results. This natural language to SQL conversion process (Text-to-SQL or NL2SQL) is a classic research direction in natural language processing. The advantage of modern large language models is that they've learned extensive SQL patterns during pre-training. Combined with contextual schema information (table names, field names, data types, and other metadata), they can generate query statements containing complex multi-table JOINs, aggregate functions, and window functions. After about one minute, the results returned a ranking of the Top 20 pickup zip codes by average fare.

Intelligent Analytical Insights

Impressively, Claude Code not only returned the data results but also provided analytical insights by combining real-world geographic information. For example, the top-ranked Zip Code 11422 is located at the southeastern tip of Queens. Claude Code analyzed that "pickups there are likely to be involved in long haul trips to Manhattan or airport" — this area is far from Manhattan and airports, so the fares generated are naturally higher.

This ability to combine data results with domain knowledge demonstrates the unique advantage of large language models over traditional BI tools. Traditional BI (Business Intelligence) tools can only present the data itself, while LLMs can leverage world knowledge accumulated during pre-training (such as geographic locations, city layouts, traffic patterns) to give data business meaning.

Compound Analysis

Further requesting a JOIN analysis of pickup and dropoff information to find the most profitable routes, Claude Code returned multi-dimensional analysis results:

Most Profitable Routes by Average Fare: Most profitable routes sorted by average fare
Highest Top Revenue Routes: Highest total revenue routes calculated by trip volume × fare
Key Takeaways: Key business insights

This ability to move from single metrics to multi-dimensional cross-analysis demonstrates AI's deep comprehension in data analysis. Notably, Claude Code automatically distinguished between "highest average fare" and "highest total revenue" — two different business perspectives. The former may represent high unit-price but low-frequency routes, while the latter reflects comprehensive commercial value. This is precisely the "mean vs. total" analytical framework commonly used in data analysis.

Step 3: Creating Data Tables Through Conversation

Beyond querying, Claude Code can also create new tables directly in Databricks. Input:

"Create a data table in the workspace catalog that stores summary of taxi data by hour of day, populate from the sample data."

Claude Code automatically completed table creation and data population, ultimately generating a summary table called "fare_by_hour" under workspace.default. In traditional workflows, this step requires data engineers to write DDL (Data Definition Language) to define the table structure, then write ETL (Extract-Transform-Load) scripts to complete data extraction, transformation, and loading. DDL is a subset of SQL used to define database structures, including commands like CREATE TABLE and ALTER TABLE. ETL is a core process in data engineering, referring to the complete data pipeline of extracting raw data from source systems, performing cleaning and transformation, and finally loading it into the target system. Traditionally, developing and maintaining ETL scripts is a primary responsibility of data engineers, often involving extensive details like data type mapping, null handling, and deduplication logic.

Claude Code encapsulates these technical details beneath a natural language interface, meaning data analysts can accomplish work that previously required specialized data engineering knowledge through conversational interaction, significantly lowering the barrier to entry for data engineering.

Step 4: Automatically Generating Databricks Notebooks

The final demonstration covers Databricks' most commonly used Notebook feature. Input:

"Create a Databricks notebook called fare analysis, visualize fare trend by hour."

Claude Code automatically generated a complete EDA (Exploratory Data Analysis) Notebook. EDA is a critical initial phase of data science projects, proposed by statistician John Tukey in 1977. Its core idea is to understand data distribution, outliers, correlations, and patterns through visualization and statistical summaries before establishing formal hypotheses. Databricks Notebooks are interactive computing environments similar to Jupyter Notebooks, supporting the organization of code, text explanations, and visualization results in a single document. Each cell can run independently, supporting multiple languages including Python, SQL, Scala, and R, making them ideal for iterative data exploration and analysis report generation.

The generated Notebook includes:

Data loading and preprocessing code
Fare trend analysis by hour
Peak hour by trip volume visualization
Comparison charts of average fare vs. median fare
Total revenue distribution by time period

Users simply need to log into the Notebook, run each cell sequentially, verify that intermediate processes are reasonable, and ultimately obtain a complete visual analysis report. This "AI-generated + human-reviewed" workflow ensures both efficiency and controllable analysis quality.

Implications for Data Analysts

This demonstration reveals a fundamental shift in how data analysis work is done:

From "Executor" to "Conductor": Previously, data analysts needed to manually check Excel files, write SQL, and run Python scripts. Now AI can handle much of the repetitive work. But what's truly valuable — business understanding, problem definition, and result interpretation — still requires human involvement. This aligns with the "abstraction level elevation" trend in software engineering: from machine code to assembly, from assembly to high-level languages, from high-level languages to natural language. Each abstraction upgrade enables practitioners to think about problems at a higher level.

Unified Toolchain: Claude Code acts as a "universal glue," seamlessly connecting Databricks, SQL, Python, visualization, and other tools, reducing the cognitive burden of tool switching. In traditional workflows, data analysts frequently switch between multiple interfaces — SQL editors, Python IDEs, visualization tools, documentation systems — with each switch interrupting their flow of thought. A unified natural language interface eliminates this "context switching cost."

Core Competencies for Future Data Analysts: It's not just about using tools — it's about learning to "direct AI in using tools." Clear problem articulation, sound analytical frameworks, and critical thinking about results are the truly irreplaceable capabilities. Specifically, this includes: whether you can translate vague business questions into precise analytical requirements, whether you can assess the statistical validity of AI-returned results, whether you can identify biases and pitfalls in data, and whether you can translate analytical conclusions into actionable business recommendations.

Conclusion

By connecting Claude Code to Databricks, data analysts can use natural language to complete a series of operations including data querying, table creation, and Notebook generation, dramatically improving work efficiency. This isn't about replacing data analysts — it's about freeing them from tedious code writing so they can focus on higher-value business insights and decision support.

From a broader perspective, this represents an important evolution in human-machine collaboration: AI handles the "how" while humans focus on the "what" and the "why." When AI dramatically lowers the barrier to technical execution, what becomes truly scarce is deep business understanding, critical thinking about data, and the decision-making ability to translate analysis into action.