Skip to content

⚡️ Speed up function correlation by 26,306% #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Jun 21, 2025

📄 26,306% (263.06x) speedup for correlation in src/numpy_pandas/dataframe_operations.py

⏱️ Runtime : 1.96 seconds 7.41 milliseconds (best of 90 runs)

📝 Explanation and details

Here is an optimized version of your program. The main bottleneck is calling df.iloc[k][col] in the innermost loop, and repeated na checking. Instead, I create a single NumPy mask per column pair so that we only look at rows with complete data for both columns, then use fast NumPy ops for statistics. Finally, I avoid repeated conversion and slicing.

The implementation below will be vastly faster on non-trivial DataFrames.

Key optimizations.

  • Avoids slow explicit loops over rows with efficient NumPy masking and computation.
  • Converts columns to NumPy arrays once outside the main loops.
  • Computes means, stds, and covariance using vectorized NumPy functions.
  • Reuses per-column mask arrays for validity checking.
  • Reduces pure Python statement overhead and memory churn from frequent list appends.

This version should be orders of magnitude faster for medium/large DataFrames, preserving all semantics and the function signature.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 38 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import Tuple

import numpy as np
import pandas as pd
# imports
import pytest  # used for our unit tests
from src.numpy_pandas.dataframe_operations import correlation

# unit tests

# -------------------- BASIC TEST CASES --------------------

def test_correlation_identity():
    """
    Test that the correlation of a column with itself is always 1.0
    """
    df = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
    codeflash_output = correlation(df); result = codeflash_output # 276μs -> 85.4μs (224% faster)

def test_correlation_two_perfectly_correlated_columns():
    """
    Test two perfectly correlated columns (b = 2 * a)
    """
    df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [2, 4, 6, 8]})
    codeflash_output = correlation(df); result = codeflash_output # 783μs -> 214μs (266% faster)

def test_correlation_two_perfectly_anticorrelated_columns():
    """
    Test two perfectly anti-correlated columns (b = -a)
    """
    df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [-1, -2, -3, -4]})
    codeflash_output = correlation(df); result = codeflash_output # 787μs -> 212μs (270% faster)

def test_correlation_two_uncorrelated_columns():
    """
    Test two uncorrelated columns (constant and variable)
    """
    df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [7, 7, 7, 7]})
    codeflash_output = correlation(df); result = codeflash_output # 780μs -> 195μs (300% faster)

def test_correlation_three_columns_mixed():
    """
    Test three columns with mixed relationships
    """
    df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 4, 6], 'c': [3, 3, 3]})
    codeflash_output = correlation(df); result = codeflash_output # 1.30ms -> 382μs (240% faster)

def test_correlation_non_numeric_columns_ignored():
    """
    Test that non-numeric columns are ignored
    """
    df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 4, 6], 'c': ['x', 'y', 'z']})
    codeflash_output = correlation(df); result = codeflash_output # 920μs -> 225μs (308% faster)

# -------------------- EDGE TEST CASES --------------------

def test_correlation_empty_dataframe():
    """
    Test that an empty DataFrame returns an empty result
    """
    df = pd.DataFrame()
    codeflash_output = correlation(df); result = codeflash_output # 2.54μs -> 2.12μs (19.6% faster)

def test_correlation_single_row():
    """
    Test DataFrame with a single row: correlation should be nan due to zero std
    """
    df = pd.DataFrame({'a': [42], 'b': [99]})
    codeflash_output = correlation(df); result = codeflash_output # 249μs -> 189μs (31.4% faster)

def test_correlation_single_column():
    """
    Test DataFrame with a single column
    """
    df = pd.DataFrame({'a': [1, 2, 3, 4]})
    codeflash_output = correlation(df); result = codeflash_output # 230μs -> 83.7μs (175% faster)

def test_correlation_all_nan():
    """
    Test DataFrame where all entries are NaN
    """
    df = pd.DataFrame({'a': [np.nan, np.nan], 'b': [np.nan, np.nan]})
    codeflash_output = correlation(df); result = codeflash_output # 150μs -> 64.6μs (132% faster)
    for k in result:
        pass

def test_correlation_some_nan():
    """
    Test DataFrame with some NaN values; correlation should be computed only on valid pairs
    """
    df = pd.DataFrame({'a': [1, np.nan, 3], 'b': [4, 5, np.nan]})
    # Only row 0 is valid for both
    codeflash_output = correlation(df); result = codeflash_output # 424μs -> 193μs (119% faster)

def test_correlation_mixed_nan_valid():
    """
    Test DataFrame with enough non-NaN overlap to compute correlation
    """
    df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [2, np.nan, 6, 8]})
    # Only rows 0,2,3 are valid for both
    codeflash_output = correlation(df); result = codeflash_output # 1.04ms -> 209μs (396% faster)

def test_correlation_non_numeric_only():
    """
    Test DataFrame with only non-numeric columns
    """
    df = pd.DataFrame({'a': ['x', 'y', 'z'], 'b': ['a', 'b', 'c']})
    codeflash_output = correlation(df); result = codeflash_output # 45.7μs -> 45.3μs (0.827% faster)

def test_correlation_inf_values():
    """
    Test DataFrame with inf values; should propagate nan due to invalid std/cov
    """
    df = pd.DataFrame({'a': [1, 2, np.inf], 'b': [2, 4, 6]})
    codeflash_output = correlation(df); result = codeflash_output # 932μs -> 243μs (283% faster)

def test_correlation_different_length_columns():
    """
    Test DataFrame with columns of different lengths (shouldn't happen in pandas, but test for robustness)
    """
    df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, np.nan]})
    codeflash_output = correlation(df); result = codeflash_output # 763μs -> 209μs (264% faster)

# -------------------- LARGE SCALE TEST CASES --------------------

def test_correlation_large_random_data():
    """
    Test correlation on a large DataFrame with random data
    """
    np.random.seed(42)
    size = 1000
    a = np.random.randn(size)
    b = 3 * a + np.random.randn(size) * 0.01  # b is almost perfectly correlated with a
    c = np.random.randn(size)  # c is independent
    df = pd.DataFrame({'a': a, 'b': b, 'c': c})
    codeflash_output = correlation(df); result = codeflash_output # 395ms -> 452μs (87440% faster)

def test_correlation_large_constant_column():
    """
    Test large DataFrame with one constant column (should yield nan correlations)
    """
    size = 1000
    df = pd.DataFrame({'a': np.arange(size), 'b': np.full(size, 7)})
    codeflash_output = correlation(df); result = codeflash_output # 176ms -> 233μs (75492% faster)

def test_correlation_large_sparse_nan():
    """
    Test large DataFrame with many NaNs, but enough overlap to compute correlation
    """
    size = 1000
    a = np.arange(size, dtype=float)
    b = a * 2
    # Insert NaNs at random positions in b
    rng = np.random.default_rng(123)
    nan_indices = rng.choice(size, size // 2, replace=False)
    b[nan_indices] = np.nan
    df = pd.DataFrame({'a': a, 'b': b})
    codeflash_output = correlation(df); result = codeflash_output # 131ms -> 247μs (53046% faster)

def test_correlation_large_all_nan_overlap():
    """
    Test large DataFrame where columns never overlap on non-NaN values
    """
    size = 1000
    a = np.arange(size, dtype=float)
    b = np.full(size, np.nan)
    b[::2] = np.arange(size//2)
    a[::2] = np.nan
    df = pd.DataFrame({'a': a, 'b': b})
    codeflash_output = correlation(df); result = codeflash_output # 87.4ms -> 160μs (54255% faster)

def test_correlation_large_negative_correlation():
    """
    Test large DataFrame with perfect negative correlation
    """
    size = 1000
    a = np.linspace(0, 100, size)
    b = -a
    df = pd.DataFrame({'a': a, 'b': b})
    codeflash_output = correlation(df); result = codeflash_output # 175ms -> 229μs (76447% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from typing import Tuple

import numpy as np
import pandas as pd
# imports
import pytest  # used for our unit tests
from src.numpy_pandas.dataframe_operations import correlation

# unit tests

# 1. Basic Test Cases

def test_correlation_identity():
    # Test that correlation of a column with itself is always 1 (except for all-NaN or constant columns)
    df = pd.DataFrame({
        "A": [1, 2, 3, 4, 5],
        "B": [5, 4, 3, 2, 1]
    })
    codeflash_output = correlation(df); result = codeflash_output # 965μs -> 215μs (349% faster)

def test_correlation_perfect_negative():
    # Test that two perfectly negatively correlated columns return -1
    df = pd.DataFrame({
        "A": [1, 2, 3, 4, 5],
        "B": [5, 4, 3, 2, 1]
    })
    codeflash_output = correlation(df); result = codeflash_output # 964μs -> 213μs (352% faster)

def test_correlation_perfect_positive():
    # Test that two perfectly positively correlated columns return 1
    df = pd.DataFrame({
        "A": [1, 2, 3, 4, 5],
        "B": [2, 4, 6, 8, 10]
    })
    codeflash_output = correlation(df); result = codeflash_output # 959μs -> 212μs (352% faster)

def test_correlation_zero():
    # Test that two uncorrelated columns return correlation close to 0
    df = pd.DataFrame({
        "A": [1, 2, 3, 4, 5],
        "B": [7, 7, 7, 7, 7]
    })
    codeflash_output = correlation(df); result = codeflash_output # 955μs -> 193μs (394% faster)

def test_correlation_non_numeric_ignored():
    # Test that non-numeric columns are ignored
    df = pd.DataFrame({
        "A": [1, 2, 3],
        "B": [4, 5, 6],
        "C": ["a", "b", "c"]
    })
    codeflash_output = correlation(df); result = codeflash_output # 927μs -> 226μs (311% faster)

# 2. Edge Test Cases

def test_correlation_empty_dataframe():
    # Test that an empty dataframe returns an empty result
    df = pd.DataFrame()
    codeflash_output = correlation(df); result = codeflash_output # 2.50μs -> 1.71μs (46.4% faster)

def test_correlation_single_row():
    # Test that a single-row dataframe returns NaN for all correlations
    df = pd.DataFrame({"A": [1], "B": [2]})
    codeflash_output = correlation(df); result = codeflash_output # 247μs -> 189μs (30.5% faster)
    for v in result.values():
        pass

def test_correlation_single_column():
    # Test that a single-column dataframe returns self-correlation as 1
    df = pd.DataFrame({"A": [1, 2, 3, 4]})
    codeflash_output = correlation(df); result = codeflash_output # 229μs -> 83.4μs (175% faster)

def test_correlation_all_nan_column():
    # Test that a column with all NaN values results in NaN correlations
    df = pd.DataFrame({"A": [np.nan, np.nan, np.nan], "B": [1, 2, 3]})
    codeflash_output = correlation(df); result = codeflash_output # 481μs -> 107μs (348% faster)

def test_correlation_some_nan_values():
    # Test that rows with NaN in any of the two columns are ignored for that pair
    df = pd.DataFrame({
        "A": [1, 2, np.nan, 4],
        "B": [2, np.nan, 6, 8]
    })
    codeflash_output = correlation(df); result = codeflash_output # 603μs -> 202μs (199% faster)

def test_correlation_constant_column():
    # Test that a column with constant values results in NaN correlation with any column
    df = pd.DataFrame({
        "A": [1, 1, 1, 1],
        "B": [2, 3, 4, 5]
    })
    codeflash_output = correlation(df); result = codeflash_output # 777μs -> 194μs (301% faster)

def test_correlation_insufficient_overlap():
    # Test that if two columns have no rows where both are non-NaN, result is NaN
    df = pd.DataFrame({
        "A": [1, np.nan, 3],
        "B": [np.nan, 2, np.nan]
    })
    codeflash_output = correlation(df); result = codeflash_output # 332μs -> 131μs (152% faster)

def test_correlation_mixed_types():
    # Test that mixed types (int, float) are handled correctly
    df = pd.DataFrame({
        "A": [1, 2.5, 3, 4.5],
        "B": [4, 5.5, 6, 7.5]
    })
    codeflash_output = correlation(df); result = codeflash_output # 779μs -> 200μs (289% faster)


def test_correlation_large_random():
    # Test correlation on a large random DataFrame
    np.random.seed(0)
    size = 1000
    data = {
        "A": np.random.randn(size),
        "B": np.random.randn(size),
        "C": np.random.randn(size)
    }
    df = pd.DataFrame(data)
    codeflash_output = correlation(df); result = codeflash_output # 396ms -> 503μs (78663% faster)
    # Self-correlations should be 1
    for col in ["A", "B", "C"]:
        pass
    # Cross-correlations should be close to 0 for independent random data
    for col1 in ["A", "B", "C"]:
        for col2 in ["A", "B", "C"]:
            if col1 != col2:
                pass

def test_correlation_large_perfect():
    # Test correlation on a large perfectly correlated DataFrame
    size = 1000
    x = np.arange(size)
    df = pd.DataFrame({
        "A": x,
        "B": 2 * x + 10
    })
    codeflash_output = correlation(df); result = codeflash_output # 176ms -> 267μs (66039% faster)

def test_correlation_large_nan_blocks():
    # Test large DataFrame with blocks of NaNs
    size = 1000
    a = np.arange(size, dtype=float)
    b = np.arange(size, dtype=float)
    a[:500] = np.nan  # First half NaN in A
    b[500:] = np.nan  # Second half NaN in B
    df = pd.DataFrame({"A": a, "B": b})
    codeflash_output = correlation(df); result = codeflash_output # 87.1ms -> 151μs (57578% faster)

def test_correlation_large_constant_column():
    # Test large DataFrame with a constant column
    size = 1000
    df = pd.DataFrame({
        "A": np.arange(size),
        "B": np.ones(size)
    })
    codeflash_output = correlation(df); result = codeflash_output # 267ms -> 226μs (117810% faster)

def test_correlation_large_sparse():
    # Test large DataFrame with many NaNs, but some overlap
    size = 1000
    a = np.full(size, np.nan)
    b = np.full(size, np.nan)
    # Only 10 overlapping values
    a[100:110] = np.arange(10)
    b[100:110] = np.arange(10, 20)
    df = pd.DataFrame({"A": a, "B": b})
    codeflash_output = correlation(df); result = codeflash_output # 44.8ms -> 209μs (21298% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-correlation-mc5j6gnf and push.

Codeflash

Here is an optimized version of your program. The main bottleneck is calling `df.iloc[k][col]` in the innermost loop, and repeated na checking. Instead, I create a single NumPy mask per column pair so that we only look at rows with complete data for both columns, then use fast NumPy ops for statistics. Finally, I avoid repeated conversion and slicing.

The implementation below will be vastly faster on non-trivial DataFrames.



### Key optimizations.
- Avoids slow explicit loops over rows with efficient NumPy masking and computation.
- Converts columns to NumPy arrays once outside the main loops.
- Computes means, stds, and covariance using vectorized NumPy functions.
- Reuses per-column mask arrays for validity checking.
- Reduces pure Python statement overhead and memory churn from frequent list appends.

**This version should be orders of magnitude faster for medium/large DataFrames, preserving all semantics and the function signature.**
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 21, 2025
@codeflash-ai codeflash-ai bot requested a review from KRRT7 June 21, 2025 00:59
@KRRT7 KRRT7 closed this Jun 23, 2025
@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-correlation-mc5j6gnf branch June 23, 2025 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ codeflash Optimization PR opened by Codeflash AI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant