Skip to content

⚡️ Speed up function find_last_node by 18,839% #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

codeflash-ai[bot]
Copy link

@codeflash-ai codeflash-ai bot commented Jun 21, 2025

📄 18,839% (188.39x) speedup for find_last_node in src/dsa/nodes.py

⏱️ Runtime : 130 milliseconds 687 microseconds (best of 352 runs)

📝 Explanation and details

Here’s an optimized version of your program. The key optimization is to create a set of all node IDs that appear as the "source" in the edge list, which allows constant time lookups instead of repeatedly scanning the edge list for each node. This reduces overall time complexity from O(N*M) to O(N+M), where N is the number of nodes and M is the number of edges.

This simply builds the source_ids set once in O(M) time and then finds the first node not in that set in O(N) time. The function signature, output, and comment are unchanged.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 39 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest  # used for our unit tests
from src.dsa.nodes import find_last_node

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------

def test_single_node_no_edges():
    # One node, no edges: node should be the last node
    nodes = [{"id": "A"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.62μs -> 1.38μs (18.2% faster)

def test_two_nodes_one_edge():
    # Two nodes, one edge from A to B: B should be the last node
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.38μs -> 1.54μs (54.1% faster)

def test_three_nodes_linear_chain():
    # Linear chain A -> B -> C: C should be the last node
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [{"source": "A", "target": "B"}, {"source": "B", "target": "C"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 3.08μs -> 1.62μs (89.7% faster)

def test_multiple_last_nodes_returns_first():
    # Two disconnected nodes: both are last nodes, should return the first one
    nodes = [{"id": "X"}, {"id": "Y"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.62μs -> 1.33μs (21.8% faster)

def test_branching_graph():
    # A -> B, A -> C: both B and C are last nodes, should return B (first in list)
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [{"source": "A", "target": "B"}, {"source": "A", "target": "C"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.46μs -> 1.58μs (55.3% faster)

# ---------------------------
# Edge Test Cases
# ---------------------------

def test_empty_nodes_and_edges():
    # No nodes, no edges: should return None
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 667ns -> 833ns (19.9% slower)

def test_edges_with_nonexistent_nodes():
    # Edges reference nodes not in the node list: node should still be returned
    nodes = [{"id": "A"}]
    edges = [{"source": "B", "target": "C"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.88μs -> 1.46μs (28.6% faster)

def test_cycle_graph():
    # Cycle: A -> B -> C -> A, no last node, so should return None
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [
        {"source": "A", "target": "B"},
        {"source": "B", "target": "C"},
        {"source": "C", "target": "A"}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 3.12μs -> 1.17μs (168% faster)

def test_all_nodes_are_sources():
    # All nodes are sources in at least one edge: no last node
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B"}, {"source": "B", "target": "A"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.46μs -> 1.17μs (111% faster)

def test_duplicate_node_ids():
    # Duplicate node IDs: function should return the first one that qualifies
    nodes = [{"id": "A"}, {"id": "A"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.62μs -> 1.38μs (18.2% faster)


def test_edge_with_extra_keys():
    # Edge with extra keys should not affect logic
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B", "weight": 3}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.54μs -> 1.62μs (56.4% faster)

def test_node_with_non_string_id():
    # Node IDs can be non-string (e.g., int)
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.54μs -> 1.71μs (48.7% faster)

def test_edge_with_none_source():
    # Edge with None as source should not match any node
    nodes = [{"id": "A"}]
    edges = [{"source": None, "target": "A"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.00μs -> 1.50μs (33.3% faster)

def test_node_with_additional_keys():
    # Node with extra keys should be returned as-is
    nodes = [{"id": "A", "label": "Alpha"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.67μs -> 1.38μs (21.2% faster)

# ---------------------------
# Large Scale Test Cases
# ---------------------------

def test_large_linear_chain():
    # 1000 node linear chain: last node should be the last in the list
    N = 1000
    nodes = [{"id": str(i)} for i in range(N)]
    edges = [{"source": str(i), "target": str(i+1)} for i in range(N-1)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 27.2ms -> 107μs (25286% faster)

def test_large_fully_disconnected_nodes():
    # 1000 nodes, no edges: first node should be returned
    N = 1000
    nodes = [{"id": str(i)} for i in range(N)]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.75μs -> 1.58μs (10.5% faster)

def test_large_branching_graph():
    # One root node with 999 outgoing edges to unique nodes
    N = 1000
    nodes = [{"id": "root"}] + [{"id": f"leaf{i}"} for i in range(1, N)]
    edges = [{"source": "root", "target": f"leaf{i}"} for i in range(1, N)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 52.6μs -> 24.2μs (117% faster)

def test_large_cycle_graph():
    # 1000 node cycle: no last node, should return None
    N = 1000
    nodes = [{"id": str(i)} for i in range(N)]
    edges = [{"source": str(i), "target": str((i+1)%N)} for i in range(N)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 27.0ms -> 107μs (25103% faster)

def test_large_graph_with_multiple_last_nodes():
    # 1000 nodes, only even-indexed nodes have outgoing edges to next node
    N = 1000
    nodes = [{"id": str(i)} for i in range(N)]
    edges = [{"source": str(i), "target": str(i+1)} for i in range(0, N-1, 2)]
    # Odd-indexed nodes (except 0) have no outgoing edges; first is 1
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 28.5μs -> 35.2μs (19.3% slower)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import pytest  # used for our unit tests
from src.dsa.nodes import find_last_node

# unit tests

# ------------------------------
# Basic Test Cases
# ------------------------------

def test_single_node_no_edges():
    # Single node, no edges: should return the node itself
    nodes = [{"id": "A"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.62μs -> 1.38μs (18.2% faster)

def test_two_nodes_one_edge():
    # Two nodes, one edge from A to B: last node is B (no outgoing edges)
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.46μs -> 1.54μs (59.5% faster)

def test_three_nodes_chain():
    # Three nodes in a chain: A -> B -> C, last node is C
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [{"source": "A", "target": "B"}, {"source": "B", "target": "C"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 3.17μs -> 1.62μs (94.9% faster)

def test_multiple_terminal_nodes():
    # Two terminal nodes (no outgoing edges): should return the first one found
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [{"source": "A", "target": "B"}]  # C has no outgoing edges, B has no outgoing edges
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.42μs -> 1.50μs (61.1% faster)

def test_no_nodes():
    # No nodes at all: should return None
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 667ns -> 875ns (23.8% slower)

# ------------------------------
# Edge Test Cases
# ------------------------------

def test_cycle_graph():
    # Cycle: A -> B -> C -> A, no terminal node, should return None
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [{"source": "A", "target": "B"}, {"source": "B", "target": "C"}, {"source": "C", "target": "A"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 3.08μs -> 1.25μs (147% faster)

def test_all_nodes_terminal():
    # All nodes have no outgoing edges: should return the first node
    nodes = [{"id": "X"}, {"id": "Y"}, {"id": "Z"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.67μs -> 1.38μs (21.2% faster)

def test_disconnected_nodes():
    # Some nodes are not connected at all: should return the first disconnected node
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [{"source": "A", "target": "B"}]  # C is disconnected, B is terminal
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.38μs -> 1.50μs (58.3% faster)

def test_edge_with_nonexistent_source():
    # Edge with a source node not in nodes: should not affect result
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "X", "target": "A"}]  # X not in nodes
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.88μs -> 1.50μs (25.0% faster)

def test_edge_with_nonexistent_target():
    # Edge with a target node not in nodes: should not affect last node calculation
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "X"}]  # X not in nodes
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.38μs -> 1.54μs (54.0% faster)

def test_duplicate_edges():
    # Multiple edges from the same node: still, only nodes with no outgoing edges are terminal
    nodes = [{"id": "A"}, {"id": "B"}, {"id": "C"}]
    edges = [
        {"source": "A", "target": "B"},
        {"source": "A", "target": "C"}
    ]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.50μs -> 1.58μs (57.8% faster)

def test_empty_edges_list():
    # All nodes are terminal if no edges
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 1.62μs -> 1.38μs (18.2% faster)

def test_empty_nodes_nonempty_edges():
    # No nodes, but edges present: should return None
    nodes = []
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 708ns -> 959ns (26.2% slower)

def test_node_with_self_loop():
    # Node with a self-loop is not terminal
    nodes = [{"id": "A"}, {"id": "B"}]
    edges = [{"source": "A", "target": "A"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.38μs -> 1.50μs (58.3% faster)

def test_nodes_with_additional_fields():
    # Nodes with extra fields should be returned in full
    nodes = [{"id": "A", "type": "start"}, {"id": "B", "type": "end"}]
    edges = [{"source": "A", "target": "B"}]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 2.33μs -> 1.50μs (55.6% faster)

# ------------------------------
# Large Scale Test Cases
# ------------------------------

def test_large_linear_chain():
    # Large chain: 1000 nodes, each points to the next
    nodes = [{"id": str(i)} for i in range(1000)]
    edges = [{"source": str(i), "target": str(i+1)} for i in range(999)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 27.1ms -> 107μs (25205% faster)

def test_large_star_graph():
    # Star graph: one central node with edges to all others
    nodes = [{"id": "center"}] + [{"id": f"N{i}"} for i in range(1, 1000)]
    edges = [{"source": "center", "target": f"N{i}"} for i in range(1, 1000)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 53.8μs -> 24.3μs (121% faster)

def test_large_disconnected_graph():
    # 500 isolated nodes, 500 in a chain
    nodes = [{"id": f"I{i}"} for i in range(500)] + [{"id": f"C{i}"} for i in range(500)]
    edges = [{"source": f"C{i}", "target": f"C{i+1}"} for i in range(499)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 27.8μs -> 32.3μs (14.1% slower)

def test_large_graph_with_multiple_terminals():
    # 900 nodes in a chain, 100 isolated
    nodes = [{"id": f"C{i}"} for i in range(900)] + [{"id": f"T{i}"} for i in range(100)]
    edges = [{"source": f"C{i}", "target": f"C{i+1}"} for i in range(899)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 21.7ms -> 99.6μs (21684% faster)

def test_large_graph_all_cyclic():
    # 1000 nodes in a cycle, so no terminal node
    nodes = [{"id": str(i)} for i in range(1000)]
    edges = [{"source": str(i), "target": str((i+1)%1000)} for i in range(1000)]
    codeflash_output = find_last_node(nodes, edges); result = codeflash_output # 26.9ms -> 107μs (24889% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-find_last_node-mc5hjaqi and push.

Codeflash

Here’s an optimized version of your program. The key optimization is to create a set of all node IDs that appear as the `"source"` in the edge list, which allows constant time lookups instead of repeatedly scanning the edge list for each node. This reduces overall time complexity from O(N*M) to O(N+M), where N is the number of nodes and M is the number of edges.


This simply builds the `source_ids` set once in O(M) time and then finds the first node not in that set in O(N) time. The function signature, output, and comment are unchanged.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 21, 2025
@codeflash-ai codeflash-ai bot requested a review from KRRT7 June 21, 2025 00:13
@KRRT7 KRRT7 closed this Jun 23, 2025
@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-find_last_node-mc5hjaqi branch June 23, 2025 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ codeflash Optimization PR opened by Codeflash AI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant