Skip to content

[ENH]: parallel implementation of is_reachable() with numpy arrays. #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

akshitasure12
Copy link
Contributor

  • adding a numpy adjacency matrix based representation of is_reachable() yielding better speedups (ref comment).
  • utilising shared memory to avoid passing G across multiple cores via joblib's memmapping.

@Schefflera-Arboricola
Copy link
Member

Schefflera-Arboricola commented Jun 10, 2025

something on the similar lines(in terms of enhancing tournament algorithms)-- we can also consider taking this reusing the pool of workers approach maybe for tournament_is_strongly_connected (in line 80) -- and that might show some more improvements in terms of speedups-- but all our functions are currently wrapped in a _configure_if_nx_active-- which is essentially a joblib.parallel_config context-- so this would require us to change how _configure_if_nx_active works currently.

PS: this doesn't have to be handled in this PR-- just something we can think about in the long run.

return all(tnc)

def is_closed(G, nodes):
return all(v in G[u] for u in set(G) - nodes for v in nodes)
def is_closed(adjM, nodes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try checking the speed with and without this function. That is, inline the contents of the function into line 42 so Python can avoid a function call.

Another possible speed-up could be to put the simple checks first in the compound logical statement in line 42. That should avoid the is_closed code being run at all unless s in S and t not in S.

Do we know where this is taking most of its time? I was just guessing for these suggestions based on the for loop structure. Using the ipython tool %prun you might be able to tell where the code spends the most time -- but I don't know how that works with parallel joblib. Maybe try it with 1 cpu or with the parallel code part removed.

Copy link
Contributor Author

@akshitasure12 akshitasure12 Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try checking the speed with and without this function. That is, inline the contents of the function into line 42 so Python can avoid a function call.

I tried adding this logic to line 42, but it didn’t result in any noticeable speedups. So I’ve decided to keep it as a separate function for better readability. Pls let me know if you think otherwise.

Do we know where this is taking most of its time? I was just guessing for these suggestions based on the for loop structure. Using the ipython tool %prun you might be able to tell where the code spends the most time -- but I don't know how that works with parallel joblib. Maybe try it with 1 cpu or with the parallel code part removed.

I removed the parallel execution and ran %prun using the following input:

G = nx.algorithms.tournament.random_tournament(1600)
s, t = 0, 1599

A summary of the output:

   ncalls   tottime  percall  cumtime  percall filename:lineno(function)
     3/1     0.000    0.000    5.383    5.383   {built-in method builtins.exec}
       1     0.000    0.000    5.383    5.383   <string>:1(<module>)
       1     0.000    0.000    5.383    5.383   decorators.py:17(wrapper)
       1     0.000    0.000    5.383    5.383   tournament.py:14(is_reachable)
       1     0.728    0.728    4.324    4.324   tournament.py:30(two_neighborhood_close)
 1279201     0.296    0.000    2.851    0.000   {built-in method builtins.any}
 6404724     0.878    0.000    2.674    0.000   tournament.py:40(<genexpr>)
 10230374    2.418    0.000    2.418    0.000   memmap.py:357(__getitem__)
       1     0.000    0.000    1.056    1.056   decorators.py:783(func)
       1     0.000    0.000    1.056    1.056   <class 'networkx.utils.decorators.argmap'> compilation 21:1(argmap_to_numpy_array_18)
       1     0.000    0.000    1.042    1.042   backends.py:538(_call_if_any_backends_installed)
       1     0.006    0.006    1.042    1.042   backends.py:1460(_call_with_backend)
       1     0.484    0.484    1.035    1.035   convert_matrix.py:881(to_numpy_array)

It is evident from this that the bulk of time is spent computing two-hop neighbors and the memory access via memmap. I've tried other alternatives you mentioned, but this seems to perform the best.

@akshitasure12
Copy link
Contributor Author

akshitasure12 commented Jul 18, 2025

Hi @Schefflera-Arboricola and @dschult, I'm trying to run asv benchmarks for is_reachable from my branch, but I keep running into a numpy import error. Can you help me out with this? I suspect it might be because NumPy isn't a default dependency. I tried adding it to pyproject.toml, but that didn't resolve the issue.

@dschult
Copy link
Member

dschult commented Jul 18, 2025

How is your environment set up? "conda/mamba" or "venv + pip"?
What script/incantation are you using? Can you give me enough info to recreate the issue locally? (I have your branch locally)

@akshitasure12
Copy link
Contributor Author

akshitasure12 commented Jul 19, 2025

Apologies for the oversight! I was able to figure it out from the documentation. Thank you @dschult!

@akshitasure12
Copy link
Contributor Author

I tried out the following benchmarks:

  1. Compared networkx's recent change to the numpy implementation of is_reachable() on this branch.
    Results:
========== ============ ========== =========== ============ ============
--                                   num_nodes                          
---------- -------------------------------------------------------------
backend        50         100         200         400          800     
========== ============ ========== =========== ============ ============
parallel   6.19±0.4ms   14.5±2ms    44.7±1ms    166±1ms      655±2ms   
networkx    18.5±3ms    74.3±3ms   288±0.7ms   1.13±0.01s   4.53±0.03s 
========== ============ ========== =========== ============ ============
  1. Compared networkx's recent change to a parallel version of the same (i.e, parallelised the pure Python Networkx implementation without Numpy)
    Results:
========== ============ ========== ============ ========== ============
--                                  num_nodes                          
---------- ------------------------------------------------------------
backend        50         100         200         400         800     
========== ============ ========== ============ ========== ============
parallel   1.15±0.1ms   5.00±1ms   17.5±0.3ms   67.8±1ms   270±0.7ms  
networkx    18.3±1ms    73.7±1ms    289±3ms     1.12±0s    4.59±0.03s 
========== ============ ========== ============ ========== ============

The pure Python implementation performs better. I haven't committed the pure Python code yet because I wanted to confirm a few things first. Do these benchmarks imply that the process of converting a graph into a numpy matrix is more expensive than that of passing the huge graph to each core? Because n_jobs is None by default, for these benchmarks, I'm unsure how we are seeing the above speedups.

@akshitasure12
Copy link
Contributor Author

Pure Python parallelism after the recent Networkx commit, for reference:

def two_neighborhood_close(G, chunk):
        for v in chunk:
            v_adj = G._adj[v]
            S = {
                x
                for x, x_pred in G._pred.items()
                if x == v or x in v_adj or any(z in v_adj for z in x_pred)
            }
            if s in S and t not in S and is_closed(G, S):
                return True
        return False

    def is_closed(G, S):
        return all(u in S or all(v in unbrs for v in S) for u, unbrs in G._adj.items())

    if hasattr(G, "graph_object"):
        G = G.graph_object

    n_jobs = nxp.get_n_jobs()

    if get_chunks == "chunks":
        node_chunks = nxp.chunks(G, n_jobs)
    else:
        node_chunks = get_chunks(G)

    return not any(
        Parallel()(delayed(two_neighborhood_close)(G, chunk) for chunk in node_chunks)
    )

@dschult
Copy link
Member

dschult commented Jul 19, 2025

Do these benchmarks imply that the process of converting a graph into a numpy matrix is more expensive than that of passing the huge graph to each core? Because n_jobs is None by default, for these benchmarks, I'm unsure how we are seeing the above speedups.

Which version of config is used when n_jobs is None? That is, how many jobs are you actually running? The last benchmark with parallel w/ python vs sequential python looks like parallel is about 16 times faster. So that is either 16 cores or 4 cores with quadratic speed-up. I don't see any quadratic speedup looking at the code. But it is possible I am missing something.

Can you point me to your benchmarking code? Is it already merged in nx-parallel, or is it only on your local machine?

@akshitasure12
Copy link
Contributor Author

That is, how many jobs are you actually running?

I didn’t modify the number of jobs anywhere in the codebase, so by default, n_jobs=None is being used by joblib which would mean 1 core. I also verified this by checking the activity monitor while running the benchmark — the CPU usage confirmed that only a single core was active during the run.

Can you point me to your benchmarking code? Is it already merged in nx-parallel, or is it only on your local machine?

The local changes I made were:

  • replacing
     _ = nx.tournament.is_reachable(G, 1, num_nodes, backend=backend)
    
    with:
    _ = nx.tournament.is_reachable(G, 0, num_nodes - 1, backend=backend)
    
    to prevent the target node from going out of bounds.
  • adding "networkx" to the list of backends in common.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: Enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

3 participants