Fix (all?) parallel hyperopt issues #2415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

scarlehoff merged 7 commits into master from fix_mongodb

Jan 29, 2026

doc/sphinx/source/n3fit/hyperopt.rst

-Original file line number
+Diff line change
@@ Expand Up @@
     of all partitions as if they were equivalent.
     When they are not equivalent the ``weight`` flag should be used (see :ref:`hyperoptrc-label`)
     - Not all datasets should enter a partition: beware of extrapolation.
     Beyond the last dataset that has entered the fit we find ourselves in what is usually known as
@@ Expand Down Expand Up @@
     All the above options are implemented in the :class:`~n3fit.hyper_optimization.rewards.HyperLoss` class
     which is instantiated and monitored within :meth:`~n3fit.model_trainer.ModelTrainer.hyperparametrizable` method.
-    Restarting hyperoptimization runs
-    ---------------------------------
-    In addition to the ``tries.json`` files, hyperparameter scans also produce ``tries.pkl`` `pickle <https://docs.python.org/3/library/pickle.html>`_ files,
-    which are located in the same directory as the corresponding ``tries.json`` file.
-    The generated ``tries.pkl`` file stores the complete history of a previous hyperoptimization run, making it possible to resume the process using the ``hyperopt`` framework.
-    To achieve this, you can use the ``--restart`` option within the ``n3fit`` command, e.g.,:
-    .. code-block:: bash
-       n3fit runcard.yml 1 -r 10 --hyperopt 20 --restart
-    The above command example is effective when the number of saved trials in the ``test_run/nnfit/replica_1/tries.pkl`` is
-    less than ``20``. If there are ``20`` or more saved trials, ``n3fit`` will simply terminate, displaying the best results.
+    Note that a second hyperoptimization run with the same settings will always try to restart the previous one.
+    If a complete new hyperoptimization is to be run, it is necessary to either remove the previous one or change
+    the runcard name.
     Running hyperoptimizations in parallel with MongoDB
     ---------------------------------------------------
@@ Expand All @@
     which extends the capabilities of `hyperopt <https://github.com/hyperopt/hyperopt>`_'s `MongoTrials` and enables the
     simultaneous evaluation of multiple trials.
-    To run a parallelized hyperopt search, use the following command:
+    Note that the parallelization capabilities of hyperopt exposes MongoDB so that it can be accessed (in the given port)
+    by an external computer.
+    The most common situation is a node in a cluster acting as the server while all other jobs connect to it.
+    It is also possible to run all jobs in the same node or across the internet with appropiate ``--db-host`` and ``--db-port`` variables.
+    To run a parallelized hyperopt search, use the following command to run each of the parallel jobs:
     .. code-block:: bash
-      n3fit hyper-quickcard.yml 1 -r N_replicas --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N
+      n3fit <runcard>.yml 1 -r N_replicas --hyperopt N_trials --parallel-hyperopt
+    Each of the runs will try to look into the ``<runcard>/nnfit/replica_1/hyperopt-db`` folder.
+    If it exists, it will try to connect to the database, otherwise it will consider that no database
+    is currently on and will start an instance of ``mongodb`` for other jobs to connect before spawning its own job.
-    Here, ``N`` represents the number of MongoDB workers you wish to launch in parallel.
     Each mongo worker handles one trial in Hyperopt. So, launching more workers allows for the simultaneous calculation of a greater number of trials.
-    Note that there is no need to manually launch MongoDB databases or mongo workers prior to using ``n3fit``,
+    .. note::
+       Counting the number of trials in parallel runs can be tricky.
+       The total number of trials that ``n3fit`` will try to run is whatever is given by the ``N_trials`` variable,
+       and as soon as it is reached, the jobs that have that number will finish.
+       However, the database will not exit as long as there are workers connected to it, so if the first job is set
+       to run for 3 trials, while the second is set to run for 10, the first job will stay idle until the second one
+       does (by itself) the 4 to 10 trials.
+    There is no need to manually launch MongoDB databases or mongo workers prior to using ``n3fit``,
     as the ``mongod`` and ``hyperopt-mongo-worker`` commands are automatically executed
     by :meth:`~n3fit.hyper_optimization.mongofiletrials.MongodRunner.start` and
     :meth:`~n3fit.hyper_optimization.mongofiletrials.MongoFileTrials.start_mongo_workers` methods, respectivelly.
-    By default, the ``host`` and ``port`` arguments are set to ``localhost`` and ``27017``. The database is named ``hyperopt-db-output_name``, where
-    ``output_name`` is set to the name of the runcard. If the ``n3fit -o OUTPUT`` option is provided, ``output_name`` is set to ``OUTPUT``, with the database being referred to as ``hyperopt-db-OUTPUT``.
-    If necessary, it is possible to modify all the above settings using the ``n3fit --db-host`` , ``n3fit --db-port`` and ``n3fit --db-name`` options.
-    To resume a hyperopt experiment, add the ``--restart`` option to the ``n3fit`` command:
-    .. code-block:: bash
-      n3fit hyper-quickcard.yml 1 -r N_replicas --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N --restart
-    Note that, unlike in serial execution, parallel hyperoptimization runs do not generate ``tries.pkl`` files.
-    Instead, MongoDB databases are saved as ``hyperopt-db-output_name.tar.gz`` files inside ``replica_path`` directory.
-    These are conveniently extracted for reuse in restart runs.
+    Beware, however, of race conditions when starting the database.
+    Make sure the second job is submitted with some delay with respect to the first.
+    By default, the ``port`` and ``host`` arguments are set to ``27017`` and the hostname of the first job respectively.
+    The database is named ``hyperopt-db`` and store in the ``replica_1`` folder of the current run.
+    Note that the database is not uploaded to the server when using ``vp-upload`` unless ``--upload-db`` is explicitly specified.

n3fit/src/n3fit/hyper_optimization/filetrials.py

-Original file line number
+Diff line change
@@ -1,7 +1,19 @@
     """
-        Custom hyperopt trial object for persistent file storage
-        in the form of json and pickle files within the nnfit folder
+    Custom hyperopt trial object for persistent file storage
+    in the form of json and pickle files within the nnfit folder
+    The trials can be easily accessed with ``json``:
+    Example
+    -------
+    >>> import json ; import glob
+    >>> for jsonfile in glob.glob("*/nnfit/replica_1/tries.json"):
+    >>> name = jsonfile.split("/")[0]
+    >>> aa = json.load(open(jsonfile, 'r'))
+    >>> print(f"{len(aa)} trials for {name}")
     """
     import json
     import logging
     import pickle
@@ Expand Down @@

n3fit/src/n3fit/hyper_optimization/hyper_scan.py

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -13,14 +13,14 @@
  
    (and, of course the function in the fitting action that calls the minimization).

    """

    import contextlib

    import copy

    import logging

    import os

    import hyperopt

    from hyperopt.pyll.base import scope

    import numpy as np

    import hyperopt

    from n3fit.backends import MetaLayer, MetaModel

    from n3fit.hyper_optimization.filetrials import FileTrials

    @@ -101,6 +101,8 @@ def optimizer_arg_wrapper(hp_key, option_dict):
  
                choice = hp_uniform(hp_key, min_lr, max_lr)

            elif sampling == "log":

                choice = hp_loguniform(hp_key, min_lr, max_lr)

            else:

                raise ValueError(f"Sampling {sampling} not understood")

        return choice

    @@ -129,60 +131,54 @@ def hyper_scan_wrapper(replica_path_set, model_trainer, hyperscanner, max_evals=
  
        # Tell the trainer we are doing hpyeropt

        model_trainer.set_hyperopt(True, keys=hyperscanner.hyper_keys)

        if hyperscanner.restart_hyperopt:

            # For parallel hyperopt restarts, extract the database tar file

        # Prepare the context manager in the parallel case (and an empty one otherwise)

        if hyperscanner.parallel_hyperopt:

            runner_ctx = hyperscanner.mongod_runner

            # Upon entering the context it will check whether the database is up

            # and, if not, it will start it

        else:

            runner_ctx = contextlib.nullcontext()

        with runner_ctx:

            # Generate the trials object, as a MongoFileTrial or a simple sequential FileTrial

            if hyperscanner.parallel_hyperopt:

                tar_file_to_extract = f"{replica_path_set}/{hyperscanner.db_name}.tar.gz"

                log.info("Restarting hyperopt run using the MongoDB database %s", tar_file_to_extract)

                MongoFileTrials.extract_mongodb_database(tar_file_to_extract, path=os.getcwd())

                # Instantiate `MongoFileTrials` as trials to give to the worker later

                trials = MongoFileTrials(

                    replica_path_set,

                    hyperscanner.mongod_runner,

                    num_workers=1,  # Only one worker per n3fit job will run

                    parameters=hyperscanner.as_dict(),

                )

            else:

                # If we are not running in parallel, check whether there's a pickle to load and restart

                # For sequential hyperopt restarts, reset the state of `FileTrials` saved in the pickle file

                pickle_file_to_load = f"{replica_path_set}/tries.pkl"

                log.info("Restarting hyperopt run using the pickle file %s", pickle_file_to_load)

                trials = FileTrials.from_pkl(pickle_file_to_load)

                pickle_file_to_load = replica_path_set / "tries.pkl"

                if pickle_file_to_load.exists():

                    log.info("Restarting hyperopt run using the pickle file %s", pickle_file_to_load)

                    trials = FileTrials.from_pkl(pickle_file_to_load)

                else:

                    # Instantiate `FileTrials`

                    trials = FileTrials(replica_path_set, parameters=hyperscanner.as_dict())

            # Initialize seed for hyperopt

            trials.rstate = np.random.default_rng(HYPEROPT_SEED)

            # And prepare the generic arguments to fmin

            fmin_args = {

                "fn": model_trainer.hyperparametrizable,

                "space": hyperscanner.as_dict(),

                "algo": hyperopt.tpe.suggest,

                "max_evals": max_evals,

                "trials": trials,

                "rstate": trials.rstate,

            }

        if hyperscanner.parallel_hyperopt:

            # start MongoDB database by launching `mongod`

            hyperscanner.mongod_runner.ensure_database_dir_exists()

            mongod = hyperscanner.mongod_runner.start()

        # Generate the trials object

        if hyperscanner.parallel_hyperopt:

            # Instantiate `MongoFileTrials`

            # Mongo database should have already been initiated at this point

            trials = MongoFileTrials(

                replica_path_set,

                db_host=hyperscanner.db_host,

                db_port=hyperscanner.db_port,

                db_name=hyperscanner.db_name,

                num_workers=hyperscanner.num_mongo_workers,

                parameters=hyperscanner.as_dict(),

            )

        else:

            # Instantiate `FileTrials`

            trials = FileTrials(replica_path_set, parameters=hyperscanner.as_dict())

        # Initialize seed for hyperopt

        trials.rstate = np.random.default_rng(HYPEROPT_SEED)

        # Call to hyperopt.fmin

        fmin_args = dict(

            fn=model_trainer.hyperparametrizable,

            space=hyperscanner.as_dict(),

            algo=hyperopt.tpe.suggest,

            max_evals=max_evals,

            trials=trials,

            rstate=trials.rstate,

        )

        if hyperscanner.parallel_hyperopt:

            trials.start_mongo_workers()

            hyperopt.fmin(**fmin_args, show_progressbar=True, max_queue_len=trials.num_workers)

            trials.stop_mongo_workers()

            # stop mongod command and compress database

            hyperscanner.mongod_runner.stop(mongod)

            trials.compress_mongodb_database()

        else:

            hyperopt.fmin(**fmin_args, show_progressbar=False, trials_save_file=trials.pkl_file)

            if hyperscanner.parallel_hyperopt:

                trials.start_mongo_workers()

                # TODO benchmark how the behaviour depends on max_queue_len (if it does)

                hyperopt.fmin(**fmin_args, show_progressbar=True, max_queue_len=12)

                trials.stop_mongo_workers()

            else:

                hyperopt.fmin(**fmin_args, show_progressbar=False, trials_save_file=trials.pkl_file)

    class ActivationStr:

    @@ -212,56 +208,47 @@ class HyperScanner:
  
        It takes cares of known correlation between parameters by tying them together

        It also provides methods for updating the parameter dictionaries after using hyperopt

        It takes as inpujt the dictionaries defining the NN/fit and the hyperparameter scan

        It takes as input the dictionaries defining the NN/fit and the hyperparameter scan

        from the NNPDF runcard and substitutes in `parameters` samplers according to the

        `hyper_scan` dictionary.

        In the sampling dict,

        Parameters

        ----------

            `parameters`: dict

                the `fitting[parameters]` dictionary of the NNPDF runcard

            `sampling_dict`: dict

                the `hyperscan` dictionary of the NNPDF runcard defining the search space of the scan

            `steps`: int

                when taking discrete steps between two parameters, number of steps to take

        # Arguments:

            - `parameters`: the `fitting[parameters]` dictionary of the NNPDF runcard

            - `sampling_dict`: the `hyperscan` dictionary of the NNPDF runcard defining

                               the search space of the scan

            - `steps`: when taking discrete steps between two parameters, number of steps

                       to take

        # Parameters accepted by `sampling_dict`:

            - `stopping`:

                    - min_epochs, max_epochs

                    - min_patience, max_patience

        """

        def __init__(self, parameters, sampling_dict, steps=5):

        def __init__(

            self, parameters, sampling_dict, steps=5, db_host=None, db_port=None, db_path=None

        ):

            self._original_parameters = parameters

            self.parameter_keys = parameters.keys()

            self.parameters = copy.deepcopy(parameters)

            self.steps = steps

            # adding extra options for restarting

            restart_config = sampling_dict.get("restart")

            self.restart_hyperopt = True if restart_config else False

            # adding extra options for parallel execution

            parallel_config = sampling_dict.get("parallel")

            if parallel_config is None:

                self.parallel_hyperopt = False

            elif _has_pymongo:

            self._db_path = db_path

            self._db_host = db_host

            self._db_port = db_port

            self.mongod_runner = None

            self.parallel_hyperopt = False

            if db_path is not None:

                # If we get a db_path, assume we want to run in parallel, therefore check whether we can

                if not _has_pymongo:

                    raise ModuleNotFoundError(

                        "Could not import pymongo modules, please install with `.[parallelhyperopt]`"

                    )

                self.parallel_hyperopt = True

            else:

                raise ModuleNotFoundError(

                    "Could not import pymongo modules, please install with `.[parallelhyperopt]`"

                )

            self.parallel_hyperopt = True if parallel_config else False

            # setting up MondoDB options

            if self.parallel_hyperopt:

                # add output_path to db name to avoid conflicts

                db_name = f'{sampling_dict.get("db_name")}-{sampling_dict.get("output_path")}'

                self.db_host = sampling_dict.get("db_host")

                self.db_port = sampling_dict.get("db_port")

                self.db_name = db_name

                self.num_mongo_workers = sampling_dict.get("num_mongo_workers")

                self.mongod_runner = MongodRunner(self.db_name, self.db_port)

                self.mongod_runner = MongodRunner(self._db_path, self._db_host, self._db_port)

            self.hyper_keys = set([])

    @@ -323,14 +310,11 @@ def _update_param(self, key, sampler):
  
            if key not in self.parameter_keys and key != "parameters":

                raise ValueError(

                    "Trying to update a parameter not declared in the `parameters` dictionary: {0} @ HyperScanner._update_param".format(

                        key

                    )

                    f"Trying to update a parameter not declared in the `parameters` dictionary: {key} @ HyperScanner._update_param"

                )

            self.hyper_keys.add(key)

            log.info("Adding key {0} with value {1}".format(key, sampler))

            log.info(f"Adding key {key} with value {sampler}")

            self.parameters[key] = sampler

        def stopping(self, min_epochs=None, max_epochs=None, min_patience=None, max_patience=None):

    @@ -376,8 +360,8 @@ def optimizer(self, optimizers):
  
                ]

            and will sample one from this list.

            Note that the keys within the dictionary (`optimizer_name` and `learning_rate`) should be named

            as the keys used by the compiler of the model as they are used as they come.

            Note that the keys within the dictionary (`optimizer_name` and `learning_rate`)

            should be named as the keys used by the compiler of the model.

            """

            # Get all accepted optimizer to check against

            all_optimizers = MetaModel.accepted_optimizers

    @@ -393,7 +377,7 @@ def optimizer(self, optimizers):
  
                name = optimizer[optname_key]

                optimizer_dictionary = {optname_key: name}

                if name not in all_optimizers.keys():

                if name not in all_optimizers:

                    raise NotImplementedError(

                        f"HyperScanner: Optimizer {name} not implemented in MetaModel.py"

                    )

    @@ -476,8 +460,8 @@ def architecture(
  
            else:

                if min_units is None or max_units is None:

                    raise ValueError(

                        "A max/min number of units must always be defined if the number of layers is to be sampled"

                        "i.e., make sure you add the keywords 'min_units' and 'max_units' to the 'architecutre' dict"

                        "A max/min number of units must always be defined when the number of layers"

                        "is to be sampled, i.e., add 'min_units' and 'max_units' to 'architecture' dict"

                    )

            activation_key = "activation_per_layer"

    @@ -497,7 +481,7 @@ def architecture(
  
            for n in n_layers:

                units = []

                for i in range(n):

                    units_label = "nl{0}:-{1}/{0}".format(n, i)

                    units_label = f"nl{n}:-{i}/{n}"

                    units_sampler = hp_quniform(

                        units_label, min_units, max_units, step_size=1, make_int=True

                    )

    @@ -516,7 +500,7 @@ def architecture(
  
            for ini_name in initializers:

                if ini_name not in imp_init_names:

                    raise NotImplementedError(

                        "HyperScanner: Initializer {0} not implemented in MetaLayer.py".format(ini_name)

                        f"HyperScanner: Initializer {ini_name} not implemented in MetaLayer.py"

                    )

                # For now we are going to use always all initializers and with default values

                ini_choices.append(ini_name)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix (all?) parallel hyperopt issues #2415

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!