Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 29 additions & 34 deletions doc/sphinx/source/n3fit/hyperopt.rst
Original file line number Diff line number Diff line change
Expand Up @@ -115,8 +115,6 @@ All loss functions implemented in ``n3fit`` for the optimization of hyperparamet
of all partitions as if they were equivalent.
When they are not equivalent the ``weight`` flag should be used (see :ref:`hyperoptrc-label`)



- Not all datasets should enter a partition: beware of extrapolation.

Beyond the last dataset that has entered the fit we find ourselves in what is usually known as
Expand Down Expand Up @@ -465,21 +463,9 @@ The optimal approach for this combination is still under development.
All the above options are implemented in the :class:`~n3fit.hyper_optimization.rewards.HyperLoss` class
which is instantiated and monitored within :meth:`~n3fit.model_trainer.ModelTrainer.hyperparametrizable` method.

Restarting hyperoptimization runs
---------------------------------

In addition to the ``tries.json`` files, hyperparameter scans also produce ``tries.pkl`` `pickle <https://docs.python.org/3/library/pickle.html>`_ files,
which are located in the same directory as the corresponding ``tries.json`` file.
The generated ``tries.pkl`` file stores the complete history of a previous hyperoptimization run, making it possible to resume the process using the ``hyperopt`` framework.
To achieve this, you can use the ``--restart`` option within the ``n3fit`` command, e.g.,:

.. code-block:: bash

n3fit runcard.yml 1 -r 10 --hyperopt 20 --restart

The above command example is effective when the number of saved trials in the ``test_run/nnfit/replica_1/tries.pkl`` is
less than ``20``. If there are ``20`` or more saved trials, ``n3fit`` will simply terminate, displaying the best results.

Note that a second hyperoptimization run with the same settings will always try to restart the previous one.
If a complete new hyperoptimization is to be run, it is necessary to either remove the previous one or change
the runcard name.

Running hyperoptimizations in parallel with MongoDB
---------------------------------------------------
Expand All @@ -489,28 +475,37 @@ This functionality is provided by the :class:`~n3fit.hyper_optimization.mongofil
which extends the capabilities of `hyperopt <https://github.com/hyperopt/hyperopt>`_'s `MongoTrials` and enables the
simultaneous evaluation of multiple trials.

To run a parallelized hyperopt search, use the following command:
Note that the parallelization capabilities of hyperopt exposes MongoDB so that it can be accessed (in the given port)
by an external computer.
The most common situation is a node in a cluster acting as the server while all other jobs connect to it.
It is also possible to run all jobs in the same node or across the internet with appropiate ``--db-host`` and ``--db-port`` variables.

To run a parallelized hyperopt search, use the following command to run each of the parallel jobs:

.. code-block:: bash

n3fit hyper-quickcard.yml 1 -r N_replicas --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N
n3fit <runcard>.yml 1 -r N_replicas --hyperopt N_trials --parallel-hyperopt

Each of the runs will try to look into the ``<runcard>/nnfit/replica_1/hyperopt-db`` folder.
If it exists, it will try to connect to the database, otherwise it will consider that no database
is currently on and will start an instance of ``mongodb`` for other jobs to connect before spawning its own job.

Here, ``N`` represents the number of MongoDB workers you wish to launch in parallel.
Each mongo worker handles one trial in Hyperopt. So, launching more workers allows for the simultaneous calculation of a greater number of trials.
Note that there is no need to manually launch MongoDB databases or mongo workers prior to using ``n3fit``,

.. note::
Counting the number of trials in parallel runs can be tricky.
The total number of trials that ``n3fit`` will try to run is whatever is given by the ``N_trials`` variable,
and as soon as it is reached, the jobs that have that number will finish.
However, the database will not exit as long as there are workers connected to it, so if the first job is set
to run for 3 trials, while the second is set to run for 10, the first job will stay idle until the second one
does (by itself) the 4 to 10 trials.

There is no need to manually launch MongoDB databases or mongo workers prior to using ``n3fit``,
as the ``mongod`` and ``hyperopt-mongo-worker`` commands are automatically executed
by :meth:`~n3fit.hyper_optimization.mongofiletrials.MongodRunner.start` and
:meth:`~n3fit.hyper_optimization.mongofiletrials.MongoFileTrials.start_mongo_workers` methods, respectivelly.
By default, the ``host`` and ``port`` arguments are set to ``localhost`` and ``27017``. The database is named ``hyperopt-db-output_name``, where
``output_name`` is set to the name of the runcard. If the ``n3fit -o OUTPUT`` option is provided, ``output_name`` is set to ``OUTPUT``, with the database being referred to as ``hyperopt-db-OUTPUT``.
If necessary, it is possible to modify all the above settings using the ``n3fit --db-host`` , ``n3fit --db-port`` and ``n3fit --db-name`` options.

To resume a hyperopt experiment, add the ``--restart`` option to the ``n3fit`` command:

.. code-block:: bash

n3fit hyper-quickcard.yml 1 -r N_replicas --hyperopt N_trials --parallel-hyperopt --num-mongo-workers N --restart

Note that, unlike in serial execution, parallel hyperoptimization runs do not generate ``tries.pkl`` files.
Instead, MongoDB databases are saved as ``hyperopt-db-output_name.tar.gz`` files inside ``replica_path`` directory.
These are conveniently extracted for reuse in restart runs.
Beware, however, of race conditions when starting the database.
Make sure the second job is submitted with some delay with respect to the first.
By default, the ``port`` and ``host`` arguments are set to ``27017`` and the hostname of the first job respectively.
The database is named ``hyperopt-db`` and store in the ``replica_1`` folder of the current run.
Note that the database is not uploaded to the server when using ``vp-upload`` unless ``--upload-db`` is explicitly specified.
16 changes: 14 additions & 2 deletions n3fit/src/n3fit/hyper_optimization/filetrials.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,19 @@
"""
Custom hyperopt trial object for persistent file storage
in the form of json and pickle files within the nnfit folder
Custom hyperopt trial object for persistent file storage
in the form of json and pickle files within the nnfit folder

The trials can be easily accessed with ``json``:

Example
-------
>>> import json ; import glob
>>> for jsonfile in glob.glob("*/nnfit/replica_1/tries.json"):
>>> name = jsonfile.split("/")[0]
>>> aa = json.load(open(jsonfile, 'r'))
>>> print(f"{len(aa)} trials for {name}")

"""

import json
import logging
import pickle
Expand Down
186 changes: 85 additions & 101 deletions n3fit/src/n3fit/hyper_optimization/hyper_scan.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@
(and, of course the function in the fitting action that calls the minimization).
"""

import contextlib
import copy
import logging
import os

import hyperopt
from hyperopt.pyll.base import scope
import numpy as np

import hyperopt
from n3fit.backends import MetaLayer, MetaModel
from n3fit.hyper_optimization.filetrials import FileTrials

Expand Down Expand Up @@ -101,6 +101,8 @@ def optimizer_arg_wrapper(hp_key, option_dict):
choice = hp_uniform(hp_key, min_lr, max_lr)
elif sampling == "log":
choice = hp_loguniform(hp_key, min_lr, max_lr)
else:
raise ValueError(f"Sampling {sampling} not understood")
return choice


Expand Down Expand Up @@ -129,60 +131,54 @@ def hyper_scan_wrapper(replica_path_set, model_trainer, hyperscanner, max_evals=
# Tell the trainer we are doing hpyeropt
model_trainer.set_hyperopt(True, keys=hyperscanner.hyper_keys)

if hyperscanner.restart_hyperopt:
# For parallel hyperopt restarts, extract the database tar file
# Prepare the context manager in the parallel case (and an empty one otherwise)
if hyperscanner.parallel_hyperopt:
runner_ctx = hyperscanner.mongod_runner
# Upon entering the context it will check whether the database is up
# and, if not, it will start it
else:
runner_ctx = contextlib.nullcontext()

with runner_ctx:
# Generate the trials object, as a MongoFileTrial or a simple sequential FileTrial
if hyperscanner.parallel_hyperopt:
tar_file_to_extract = f"{replica_path_set}/{hyperscanner.db_name}.tar.gz"
log.info("Restarting hyperopt run using the MongoDB database %s", tar_file_to_extract)
MongoFileTrials.extract_mongodb_database(tar_file_to_extract, path=os.getcwd())
# Instantiate `MongoFileTrials` as trials to give to the worker later
trials = MongoFileTrials(
replica_path_set,
hyperscanner.mongod_runner,
num_workers=1, # Only one worker per n3fit job will run
parameters=hyperscanner.as_dict(),
)
else:
# If we are not running in parallel, check whether there's a pickle to load and restart
# For sequential hyperopt restarts, reset the state of `FileTrials` saved in the pickle file
pickle_file_to_load = f"{replica_path_set}/tries.pkl"
log.info("Restarting hyperopt run using the pickle file %s", pickle_file_to_load)
trials = FileTrials.from_pkl(pickle_file_to_load)
pickle_file_to_load = replica_path_set / "tries.pkl"
if pickle_file_to_load.exists():
log.info("Restarting hyperopt run using the pickle file %s", pickle_file_to_load)
trials = FileTrials.from_pkl(pickle_file_to_load)
else:
# Instantiate `FileTrials`
trials = FileTrials(replica_path_set, parameters=hyperscanner.as_dict())

# Initialize seed for hyperopt
trials.rstate = np.random.default_rng(HYPEROPT_SEED)
# And prepare the generic arguments to fmin
fmin_args = {
"fn": model_trainer.hyperparametrizable,
"space": hyperscanner.as_dict(),
"algo": hyperopt.tpe.suggest,
"max_evals": max_evals,
"trials": trials,
"rstate": trials.rstate,
}

if hyperscanner.parallel_hyperopt:
# start MongoDB database by launching `mongod`
hyperscanner.mongod_runner.ensure_database_dir_exists()
mongod = hyperscanner.mongod_runner.start()

# Generate the trials object
if hyperscanner.parallel_hyperopt:
# Instantiate `MongoFileTrials`
# Mongo database should have already been initiated at this point
trials = MongoFileTrials(
replica_path_set,
db_host=hyperscanner.db_host,
db_port=hyperscanner.db_port,
db_name=hyperscanner.db_name,
num_workers=hyperscanner.num_mongo_workers,
parameters=hyperscanner.as_dict(),
)
else:
# Instantiate `FileTrials`
trials = FileTrials(replica_path_set, parameters=hyperscanner.as_dict())

# Initialize seed for hyperopt
trials.rstate = np.random.default_rng(HYPEROPT_SEED)

# Call to hyperopt.fmin
fmin_args = dict(
fn=model_trainer.hyperparametrizable,
space=hyperscanner.as_dict(),
algo=hyperopt.tpe.suggest,
max_evals=max_evals,
trials=trials,
rstate=trials.rstate,
)
if hyperscanner.parallel_hyperopt:
trials.start_mongo_workers()
hyperopt.fmin(**fmin_args, show_progressbar=True, max_queue_len=trials.num_workers)
trials.stop_mongo_workers()
# stop mongod command and compress database
hyperscanner.mongod_runner.stop(mongod)
trials.compress_mongodb_database()
else:
hyperopt.fmin(**fmin_args, show_progressbar=False, trials_save_file=trials.pkl_file)
if hyperscanner.parallel_hyperopt:
trials.start_mongo_workers()
# TODO benchmark how the behaviour depends on max_queue_len (if it does)
hyperopt.fmin(**fmin_args, show_progressbar=True, max_queue_len=12)
trials.stop_mongo_workers()
else:
hyperopt.fmin(**fmin_args, show_progressbar=False, trials_save_file=trials.pkl_file)


class ActivationStr:
Expand Down Expand Up @@ -212,56 +208,47 @@ class HyperScanner:
It takes cares of known correlation between parameters by tying them together
It also provides methods for updating the parameter dictionaries after using hyperopt

It takes as inpujt the dictionaries defining the NN/fit and the hyperparameter scan
It takes as input the dictionaries defining the NN/fit and the hyperparameter scan
from the NNPDF runcard and substitutes in `parameters` samplers according to the
`hyper_scan` dictionary.

In the sampling dict,


Parameters
----------
`parameters`: dict
the `fitting[parameters]` dictionary of the NNPDF runcard
`sampling_dict`: dict
the `hyperscan` dictionary of the NNPDF runcard defining the search space of the scan
`steps`: int
when taking discrete steps between two parameters, number of steps to take

# Arguments:
- `parameters`: the `fitting[parameters]` dictionary of the NNPDF runcard
- `sampling_dict`: the `hyperscan` dictionary of the NNPDF runcard defining
the search space of the scan
- `steps`: when taking discrete steps between two parameters, number of steps
to take

# Parameters accepted by `sampling_dict`:
- `stopping`:
- min_epochs, max_epochs
- min_patience, max_patience
"""

def __init__(self, parameters, sampling_dict, steps=5):
def __init__(
self, parameters, sampling_dict, steps=5, db_host=None, db_port=None, db_path=None
):
self._original_parameters = parameters
self.parameter_keys = parameters.keys()
self.parameters = copy.deepcopy(parameters)
self.steps = steps

# adding extra options for restarting
restart_config = sampling_dict.get("restart")
self.restart_hyperopt = True if restart_config else False

# adding extra options for parallel execution
parallel_config = sampling_dict.get("parallel")
if parallel_config is None:
self.parallel_hyperopt = False
elif _has_pymongo:
self._db_path = db_path
self._db_host = db_host
self._db_port = db_port
self.mongod_runner = None
self.parallel_hyperopt = False

if db_path is not None:
# If we get a db_path, assume we want to run in parallel, therefore check whether we can
if not _has_pymongo:
raise ModuleNotFoundError(
"Could not import pymongo modules, please install with `.[parallelhyperopt]`"
)
self.parallel_hyperopt = True
else:
raise ModuleNotFoundError(
"Could not import pymongo modules, please install with `.[parallelhyperopt]`"
)

self.parallel_hyperopt = True if parallel_config else False

# setting up MondoDB options
if self.parallel_hyperopt:
# add output_path to db name to avoid conflicts
db_name = f'{sampling_dict.get("db_name")}-{sampling_dict.get("output_path")}'
self.db_host = sampling_dict.get("db_host")
self.db_port = sampling_dict.get("db_port")
self.db_name = db_name
self.num_mongo_workers = sampling_dict.get("num_mongo_workers")
self.mongod_runner = MongodRunner(self.db_name, self.db_port)
self.mongod_runner = MongodRunner(self._db_path, self._db_host, self._db_port)

self.hyper_keys = set([])

Expand Down Expand Up @@ -323,14 +310,11 @@ def _update_param(self, key, sampler):

if key not in self.parameter_keys and key != "parameters":
raise ValueError(
"Trying to update a parameter not declared in the `parameters` dictionary: {0} @ HyperScanner._update_param".format(
key
)
f"Trying to update a parameter not declared in the `parameters` dictionary: {key} @ HyperScanner._update_param"
)

self.hyper_keys.add(key)
log.info("Adding key {0} with value {1}".format(key, sampler))

log.info(f"Adding key {key} with value {sampler}")
self.parameters[key] = sampler

def stopping(self, min_epochs=None, max_epochs=None, min_patience=None, max_patience=None):
Expand Down Expand Up @@ -376,8 +360,8 @@ def optimizer(self, optimizers):
]
and will sample one from this list.

Note that the keys within the dictionary (`optimizer_name` and `learning_rate`) should be named
as the keys used by the compiler of the model as they are used as they come.
Note that the keys within the dictionary (`optimizer_name` and `learning_rate`)
should be named as the keys used by the compiler of the model.
"""
# Get all accepted optimizer to check against
all_optimizers = MetaModel.accepted_optimizers
Expand All @@ -393,7 +377,7 @@ def optimizer(self, optimizers):
name = optimizer[optname_key]
optimizer_dictionary = {optname_key: name}

if name not in all_optimizers.keys():
if name not in all_optimizers:
raise NotImplementedError(
f"HyperScanner: Optimizer {name} not implemented in MetaModel.py"
)
Expand Down Expand Up @@ -476,8 +460,8 @@ def architecture(
else:
if min_units is None or max_units is None:
raise ValueError(
"A max/min number of units must always be defined if the number of layers is to be sampled"
"i.e., make sure you add the keywords 'min_units' and 'max_units' to the 'architecutre' dict"
"A max/min number of units must always be defined when the number of layers"
"is to be sampled, i.e., add 'min_units' and 'max_units' to 'architecture' dict"
)

activation_key = "activation_per_layer"
Expand All @@ -497,7 +481,7 @@ def architecture(
for n in n_layers:
units = []
for i in range(n):
units_label = "nl{0}:-{1}/{0}".format(n, i)
units_label = f"nl{n}:-{i}/{n}"
units_sampler = hp_quniform(
units_label, min_units, max_units, step_size=1, make_int=True
)
Expand All @@ -516,7 +500,7 @@ def architecture(
for ini_name in initializers:
if ini_name not in imp_init_names:
raise NotImplementedError(
"HyperScanner: Initializer {0} not implemented in MetaLayer.py".format(ini_name)
f"HyperScanner: Initializer {ini_name} not implemented in MetaLayer.py"
)
# For now we are going to use always all initializers and with default values
ini_choices.append(ini_name)
Expand Down
Loading