Skip to content

⚡️ Speed up function pivot_table by 4,131% #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: contrived-examples
Choose a base branch
from

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Jun 24, 2025

📄 4,131% (41.31x) speedup for pivot_table in src/numpy_pandas/dataframe_operations.py

⏱️ Runtime : 210 milliseconds 4.96 milliseconds (best of 99 runs)

📝 Explanation and details

Here is an optimized rewrite of your code.
The main performance bottleneck is the use of df.iloc[i] in a per-row loop, which is extremely slow in pandas, especially for large DataFrames.
Instead, we will extract relevant columns as numpy arrays (or pandas Series), then iterate in a vectorized, cache-friendly way without repeated DataFrame lookups or allocations.

Also, the aggregations can be done efficiently using dictionaries and only looping over the minimal data necessary.
The aggregation helper functions are unchanged.
No change to function signature or output format.

Optimized version

Key changes & speedups:

  • Replace for i in range(len(df)): row = df.iloc[i] ... with a direct zip of numpy arrays from the selected columns.
  • Use setdefault to reduce the number of dictionary lookups and lines.
  • Avoid all per-row DataFrame accesses.
  • Result format and function signature unchanged.

This change will typically speed up the function by 10x-50x+ on large DataFrames.
No external dependencies are added. Uses only pandas and numpy, both are already installed.
All the comments from your original code that described distinct sections remain applicable.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 37 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import Any

import pandas as pd
# imports
import pytest  # used for our unit tests
from src.numpy_pandas.dataframe_operations import pivot_table

# unit tests

# ------------------------------
# BASIC TEST CASES
# ------------------------------

def test_basic_mean():
    # Test mean aggregation on simple data
    df = pd.DataFrame({
        "A": ["foo", "foo", "bar", "bar"],
        "B": ["one", "two", "one", "two"],
        "C": [1, 2, 3, 4]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="mean"); result = codeflash_output # 114μs -> 58.2μs (96.3% faster)

def test_basic_sum():
    # Test sum aggregation on simple data
    df = pd.DataFrame({
        "A": ["foo", "foo", "bar", "bar"],
        "B": ["one", "two", "one", "two"],
        "C": [1, 2, 3, 4]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="sum"); result = codeflash_output # 110μs -> 57.2μs (92.8% faster)

def test_basic_count():
    # Test count aggregation on simple data
    df = pd.DataFrame({
        "A": ["foo", "foo", "bar", "bar", "foo"],
        "B": ["one", "two", "one", "two", "one"],
        "C": [1, 2, 3, 4, 5]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="count"); result = codeflash_output # 129μs -> 57.0μs (126% faster)

def test_multiple_entries_per_cell():
    # Test aggregation when there are multiple entries per cell
    df = pd.DataFrame({
        "A": ["foo", "foo", "foo", "bar", "bar"],
        "B": ["one", "one", "two", "one", "two"],
        "C": [1, 3, 2, 3, 4]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="mean"); result = codeflash_output # 129μs -> 57.7μs (124% faster)

def test_non_numeric_values():
    # Test count aggregation with non-numeric values
    df = pd.DataFrame({
        "A": ["foo", "foo", "bar"],
        "B": ["x", "y", "x"],
        "C": ["apple", "banana", "carrot"]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="count"); result = codeflash_output # 67.3μs -> 56.1μs (20.0% faster)

# ------------------------------
# EDGE TEST CASES
# ------------------------------

def test_empty_dataframe():
    # Test with an empty DataFrame
    df = pd.DataFrame(columns=["A", "B", "C"])
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="sum"); result = codeflash_output # 1.92μs -> 56.2μs (96.6% slower)

def test_missing_index_column():
    # Test with missing index column
    df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
    with pytest.raises(KeyError):
        pivot_table(df, index="X", columns="B", values="A", aggfunc="sum")

def test_missing_columns_column():
    # Test with missing columns column
    df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
    with pytest.raises(KeyError):
        pivot_table(df, index="A", columns="X", values="B", aggfunc="sum")

def test_missing_values_column():
    # Test with missing values column
    df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
    with pytest.raises(KeyError):
        pivot_table(df, index="A", columns="B", values="X", aggfunc="sum")

def test_unsupported_aggfunc():
    # Test with unsupported aggregation function
    df = pd.DataFrame({"A": [1], "B": [2], "C": [3]})
    with pytest.raises(ValueError):
        pivot_table(df, index="A", columns="B", values="C", aggfunc="median")


def test_duplicate_index_and_columns():
    # Test with duplicate values in index and columns
    df = pd.DataFrame({
        "A": ["a", "a", "a"],
        "B": ["b", "b", "b"],
        "C": [1, 2, 3]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="sum"); result = codeflash_output # 95.0μs -> 58.8μs (61.7% faster)

def test_single_row():
    # Test with a single-row DataFrame
    df = pd.DataFrame({"A": ["foo"], "B": ["bar"], "C": [42]})
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="mean"); result = codeflash_output # 43.0μs -> 56.7μs (24.0% slower)

def test_all_same_index_column():
    # Test where all rows have the same index and columns
    df = pd.DataFrame({
        "A": ["x"] * 5,
        "B": ["y"] * 5,
        "C": [1, 2, 3, 4, 5]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="mean"); result = codeflash_output # 128μs -> 57.0μs (125% faster)

def test_non_string_column_names():
    # Test with non-string column names
    df = pd.DataFrame({
        1: ["a", "a", "b"],
        2: ["x", "y", "x"],
        3: [10, 20, 30]
    })
    codeflash_output = pivot_table(df, index=1, columns=2, values=3, aggfunc="sum"); result = codeflash_output # 104μs -> 60.2μs (73.1% faster)

# ------------------------------
# LARGE SCALE TEST CASES
# ------------------------------

def test_large_number_of_rows():
    # Test with a large number of rows and moderate number of unique index/columns
    n = 1000
    df = pd.DataFrame({
        "A": ["foo"] * (n//2) + ["bar"] * (n//2),
        "B": ["one"] * (n//4) + ["two"] * (n//4) + ["one"] * (n//4) + ["two"] * (n//4),
        "C": list(range(n))
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="mean"); result = codeflash_output # 19.0ms -> 232μs (8083% faster)
    # Compute expected means
    foo_one = sum(range(0, n//4)) / (n//4)
    foo_two = sum(range(n//4, n//2)) / (n//4)
    bar_one = sum(range(n//2, n//2 + n//4)) / (n//4)
    bar_two = sum(range(n//2 + n//4, n)) / (n//4)

def test_large_unique_index_and_columns():
    # Test with many unique index and columns values, but only one row per combination
    n = 50
    data = {
        "A": [f"row{i}" for i in range(n) for j in range(n)],
        "B": [f"col{j}" for i in range(n) for j in range(n)],
        "C": [i * n + j for i in range(n) for j in range(n)]
    }
    df = pd.DataFrame(data)
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="sum"); result = codeflash_output # 49.0ms -> 1.10ms (4343% faster)

def test_large_count_aggregation():
    # Test count aggregation with large data
    n = 500
    df = pd.DataFrame({
        "A": ["x"] * n + ["y"] * n,
        "B": ["a"] * (n//2) + ["b"] * (n//2) + ["a"] * (n//2) + ["b"] * (n//2),
        "C": list(range(n)) + list(range(n))
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="count"); result = codeflash_output # 19.2ms -> 204μs (9270% faster)

def test_large_sum_aggregation():
    # Test sum aggregation with large data and check for correctness
    n = 200
    df = pd.DataFrame({
        "A": ["i" + str(i % 10) for i in range(n)],
        "B": ["j" + str(i % 20) for i in range(n)],
        "C": [i for i in range(n)]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="sum"); result = codeflash_output # 3.88ms -> 108μs (3483% faster)
    # Spot check a few sums
    for i in range(10):
        for j in range(20):
            idx = "i" + str(i)
            col = "j" + str(j)
            # Find all rows where A==idx and B==col
            expected = sum(df[(df["A"] == idx) & (df["B"] == col)]["C"])
            if expected or ((df["A"] == idx) & (df["B"] == col)).any():
                pass

def test_large_sparse_table():
    # Test with a large, sparse table (most cells empty)
    n = 100
    df = pd.DataFrame({
        "A": [f"row{i}" for i in range(n)],
        "B": [f"col{i}" for i in range(n)],
        "C": [i for i in range(n)]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="sum"); result = codeflash_output # 2.03ms -> 116μs (1647% faster)
    # Each cell should have only one value
    for i in range(n):
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from typing import Any

import pandas as pd
# imports
import pytest  # used for our unit tests
from src.numpy_pandas.dataframe_operations import pivot_table

# unit tests

# 1. BASIC TEST CASES

def test_pivot_table_mean_basic():
    # Simple mean aggregation
    df = pd.DataFrame({
        "A": ["foo", "foo", "bar", "bar"],
        "B": ["one", "two", "one", "two"],
        "C": [1, 2, 3, 4]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="mean"); result = codeflash_output # 113μs -> 59.0μs (91.9% faster)

def test_pivot_table_sum_basic():
    # Simple sum aggregation
    df = pd.DataFrame({
        "A": ["foo", "foo", "bar", "bar"],
        "B": ["one", "two", "one", "two"],
        "C": [1, 2, 3, 4]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="sum"); result = codeflash_output # 110μs -> 58.0μs (91.2% faster)

def test_pivot_table_count_basic():
    # Simple count aggregation
    df = pd.DataFrame({
        "A": ["foo", "foo", "bar", "bar", "bar"],
        "B": ["one", "two", "one", "two", "one"],
        "C": [1, 2, 3, 4, 5]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="count"); result = codeflash_output # 129μs -> 57.3μs (126% faster)

def test_pivot_table_multiple_values_per_group():
    # Multiple values per group for mean
    df = pd.DataFrame({
        "A": ["foo", "foo", "foo", "bar", "bar"],
        "B": ["one", "one", "two", "one", "two"],
        "C": [1, 3, 2, 3, 4]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="mean"); result = codeflash_output # 128μs -> 57.7μs (123% faster)

def test_pivot_table_non_numeric_count():
    # Count works with non-numeric values
    df = pd.DataFrame({
        "A": ["foo", "foo", "bar", "bar"],
        "B": ["one", "two", "one", "two"],
        "C": ["x", "y", "z", "w"]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="count"); result = codeflash_output # 81.1μs -> 56.4μs (43.8% faster)

# 2. EDGE TEST CASES

def test_pivot_table_empty_dataframe():
    # Empty DataFrame
    df = pd.DataFrame(columns=["A", "B", "C"])
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="mean"); result = codeflash_output # 1.83μs -> 55.8μs (96.7% slower)

def test_pivot_table_one_row():
    # DataFrame with a single row
    df = pd.DataFrame({"A": ["foo"], "B": ["bar"], "C": [42]})
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="sum"); result = codeflash_output # 43.3μs -> 55.8μs (22.3% slower)

def test_pivot_table_all_same_group():
    # All rows belong to one group
    df = pd.DataFrame({"A": ["x"]*4, "B": ["y"]*4, "C": [1, 2, 3, 4]})
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="mean"); result = codeflash_output # 109μs -> 56.9μs (93.0% faster)

def test_pivot_table_missing_column_raises():
    # Should raise if a column is missing
    df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
    with pytest.raises(KeyError):
        pivot_table(df, index="A", columns="B", values="C", aggfunc="sum")

def test_pivot_table_invalid_aggfunc_raises():
    # Should raise ValueError for unsupported aggfunc
    df = pd.DataFrame({"A": [1], "B": [2], "C": [3]})
    with pytest.raises(ValueError):
        pivot_table(df, index="A", columns="B", values="C", aggfunc="min")



def test_pivot_table_non_string_column_names():
    # Non-string column names
    df = pd.DataFrame({
        1: ["foo", "bar"],
        2: ["x", "y"],
        3: [10, 20]
    })
    codeflash_output = pivot_table(df, index=1, columns=2, values=3, aggfunc="sum"); result = codeflash_output # 85.5μs -> 63.0μs (35.7% faster)

def test_pivot_table_unhashable_values():
    # Unhashable values in index/columns
    df = pd.DataFrame({
        "A": [{"x": 1}, {"x": 2}],
        "B": [{"y": 3}, {"y": 4}],
        "C": [5, 6]
    })
    with pytest.raises(TypeError):
        pivot_table(df, index="A", columns="B", values="C", aggfunc="sum")

# 3. LARGE SCALE TEST CASES

def test_pivot_table_large_number_of_rows():
    # Test with 1000 rows, 10 groups, 10 columns
    import random
    random.seed(0)
    n = 1000
    groups = [f"group_{i}" for i in range(10)]
    cols = [f"col_{j}" for j in range(10)]
    data = {
        "A": [random.choice(groups) for _ in range(n)],
        "B": [random.choice(cols) for _ in range(n)],
        "C": [random.randint(1, 100) for _ in range(n)]
    }
    df = pd.DataFrame(data)
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="mean"); result = codeflash_output # 19.1ms -> 280μs (6716% faster)
    for v in result.values():
        pass

def test_pivot_table_large_number_of_groups():
    # Test with 1000 unique groups, 2 columns
    df = pd.DataFrame({
        "A": [f"group_{i}" for i in range(1000)],
        "B": ["x", "y"] * 500,
        "C": [i for i in range(1000)]
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="sum"); result = codeflash_output # 19.7ms -> 619μs (3079% faster)
    # Each group should have one value
    for k, v in result.items():
        for vv in v.values():
            pass

def test_pivot_table_large_number_of_columns():
    # Test with 2 groups, 500 columns
    ncols = 500
    df = pd.DataFrame({
        "A": ["foo"] * ncols + ["bar"] * ncols,
        "B": [f"col_{i}" for i in range(ncols)] * 2,
        "C": list(range(ncols)) + list(range(ncols, 2 * ncols))
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="sum"); result = codeflash_output # 19.5ms -> 431μs (4419% faster)

def test_pivot_table_large_uniform_data():
    # All values are the same, should get the same mean/sum/count
    df = pd.DataFrame({
        "A": ["a"] * 1000,
        "B": ["b"] * 1000,
        "C": [7] * 1000
    })
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="mean"); result_mean = codeflash_output # 19.0ms -> 225μs (8343% faster)
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="sum"); result_sum = codeflash_output # 18.9ms -> 176μs (10612% faster)
    codeflash_output = pivot_table(df, index="A", columns="B", values="C", aggfunc="count"); result_count = codeflash_output # 18.8ms -> 146μs (12755% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pivot_table-mc9s59u1 and push.

Codeflash

Here is an optimized rewrite of your code.  
The main performance bottleneck is the use of `df.iloc[i]` in a per-row loop, which is extremely slow in pandas, especially for large DataFrames.  
Instead, we will extract relevant columns as numpy arrays (or pandas Series), then iterate in a vectorized, cache-friendly way without repeated DataFrame lookups or allocations.

Also, the aggregations can be done efficiently using dictionaries and only looping over the minimal data necessary.  
The aggregation helper functions are unchanged.  
No change to function signature or output format.

### Optimized version



**Key changes & speedups:**
- Replace `for i in range(len(df)): row = df.iloc[i] ...` with a direct zip of numpy arrays from the selected columns.
- Use `setdefault` to reduce the number of dictionary lookups and lines.
- Avoid all per-row DataFrame accesses.
- Result format and function signature unchanged.

This change will typically speed up the function by 10x-50x+ on large DataFrames.  
No external dependencies are added. Uses only pandas and numpy, both are already installed.  
All the comments from your original code that described distinct sections remain applicable.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 24, 2025
@codeflash-ai codeflash-ai bot requested a review from KRRT7 June 24, 2025 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ codeflash Optimization PR opened by Codeflash AI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants