ML Tracking Ops represents an MLOps Python library/platform which can be used for tracking machine learning projects. This platform enables users to track distinct training runs and complete hyperparameter sweeps.
- Exposes an API to the user which enables them to log Machine Learning/Data Science metrics during training
- Enables users to initiate a hyperparameter sweep and log the sweep artifacts
- Enables users to start an interactive web app for visualizing the experiment results. In this app they can compare different experiments and visualize different metrics
- It enables users to compare different training runs executed within the same hyperparameter sweep
- Simplest form of tracking runs
- Hyperparameter Sweeps
- ML Tracking Ops Web App
- An Important Note
- Licence
Below we can see a PyTorch example of how we can track an experiment using ML-Tracking-Ops.
ML Tracking Ops is library agnostic, i.e. you do not have to use PyTorch. As long as the ExperimentLogger.add_scalar is provided with a simple float the experiment logging process will be possible.
from ml_tracking_ops.experiment.logger import ExperimentLogger
...
# Dataset setup, model instantiation etc.
...
writer = ExperimentLogger(logdir="runs")
max_epochs = 10
for epoch in range(max_epochs):
print("Epoch:", epoch)
for x, y_true in dataloader:
train_step += 1
optimizer.zero_grad()
y_pred = model(x)
loss = loss_fcn(y_pred, y_true)
loss.backward()
optimizer.step()
# We need to pass the scalar value in the form of a simple 'float'
writer.add_scalar("Loss", loss.item(), train_step)When an instance of ExperimentLogger is created a directory with the name corresponding to the argument logdir is created (if it didn't previously exist). In this logdir directory a new directory gets created which corresponds to the time the instance of ExperimentLogger was created. This directory contains logs related to the training run started at the time indicated by the directory name. Each of these directories contains a single .dat file which contains time-series logs for each metric logged during that particular training run. See image below for an example.
Each of these folders represents a different training run (possibly after changing some hyperparameters). This logdir directory should be used to group different training runs so they can be easily compared by using the ML Tracking Ops web app
ML Tracking Ops enables users to run a hyperparameter sweep for their machine learning pipeline. This is relatively easy to do since all you need is defining a simple configuration file and an argument parser. After defining those two things we can start the hyperparameter sweep with a simple command
ml-tracking-ops --run_sweep=True --logdir=runs-
Passing the argument
logdiris not mandatory since it it will default to the stringruns. -
When the sweep is started a directory with the name corresponding to the argument
logdiris created (if it didn't previously exist). In thislogdirdirectory a new directory gets created which corresponds to the time the sweep was started. This directory contains aexperiment_description.jsonfile which is automatically created and describes the configuration of the sweep (this is used by the web app and SHOULD NOT be deleted). -
Besides this file a separate
.datfile gets created for each hyperparameter combination tried. These files contain time-series logs created by theExperimentLoggerinstances created inside of the training script specified in the configuration file every time a new training run is started. This file is named by the timestamp at which the training process for the new hyperparameter combination was started. -
On the other hand specifying the
--run_sweep=Trueis necessary since not passing this argument will result in the valueFalsewhich would lead to starting the ML Tracking Ops web app
This file is used to explain:
- What hyperparameters you wish to explore and how to sample them
- What is the entry point for training your model
- How many different hyperparameter combinations you wish to try. NOTE: Hyperparameter search is not exhaustive, and is thereby limited by the specified maximum number of training runs.
- Do we wish to apply early stopping to each of the training runs
- If yes, to which metric should we pay attention to when trying to optimize the model
- Is the optimization process maximizing or minimizing the
optimization_metric?
This file must be named "experiment_cfg.json"
Below we can see an example of the configuration file. The JSON object keys main_script_name, max_runs, hyperparameters and early_stopping must be present.
{
"main_script_name": "train_script.py",
"hyperparameters": {
"learning_rate": {
"type": "uniform",
"min": 1e-5,
"max": 1e-2
},
"batch_size": {
"type": "choice",
"candidates": [32, 64, 128]
},
"train_steps": {
"type": "choice",
"candidates": [700, 850, 1000]
}
},
"max_runs": 100,
"early_stopping": true,
"early_stopping_patience": 5,
"optimization_metric": "Accuracy",
"optimization_goal": "max"
}
-
In the example above we can see that hyperparameters we wish to explore must be defined in a specific format. Each hyperparameter must have a key
typewhich can take values ofuniformwhich represents a continuous parameter, orchoicewhich represents a discrete parameter. The other keys likemin,max,candidatesare required for the according hyperparameter type i.e.minandmaxare required for usinguniformsampling andcandidatesis required when using adiscretesampling. Hyperparameters can have any name the user wants them to have. Note: these names must match with the expected hyperparameter names in the script specified with themain_script_name.py. -
We should specify if we wish to apply the EarlyStopping strategy to each of the training runs. If we set the property
early_stoppingtotrue, then we must specify the other properties as well:optimization_metricThe metric which we need to track in order to decide should the EarlyStopping event occurearly_stopping_patiencerepresents the maximum number of steps (during which the metric was logged) during which the metric specified by theoptimization_metricparameter is allowed not to improve. When this threshold is reached, EarlyStopping event triggers and the training process (for the current hyperparameter combination) terminates.optimization_goalThis parameter serves as a way to keep track if the metric has improved or not. It can take the values ofmaxandminwhich correspond to maximization and minimization of theoptimization_metric, respectively.
Note
Both the configuration file experiment_cfg.json and the training script specified in the main_script_name must be present in the current working directory where the ml-tracking-ops --run_sweep=True --logdir=runs command will be run.
In each training run we sample a hyperparameter combination according to the previously specified sampling preferences. After this step your training script specified in the main_script_name in the experiment_cfg.json file is started as a separate subprocess and sampled hyperparameters are passed to it in the form of command line arguments.
This means that in order to use the exact sampled values of these hyperparameters we need to have an argument parser instance inside of our training script. This argument parser needs to be able to accept the arguments for which the names are equal to the ones defined in the experiment_cfg.json in the hyperparameters section.
Below we can see an example of this argument parser. This parser was designed in order to be able to accept hyperparameters defined in the experiment_cfg.json example above.
from argparse import ArgumentParser
parser = ArgumentParser()
# Having this argument is really important since you would need to pass this argument when creating ExperimentLogger instance
parser.add_argument("--logdir", type=str, default="runs)
parser.add_argument("--learning_rate", type=float, default=1e-3)
parser.add_argument("--batch_size", type=int, default=32)
parser.add_argument("--train_steps", type=int, default=1000)
# Collect arguments (sampled hyperparameter values) that got passed when the training script was started
config = parser.parse_args()We can start the ML Tracking Ops web app by running a simple command
ml-tracking-ops --logdir=runsThe logdir argument represents the directory which contains the experiment and sweep logs which we would like to observe and analyze.
Passing the logdir argument is optional since not passing it will default to the string runs but be aware of this behavior since the directory runs may not contain the logs you are interested in or may not exist at all!
After running the previous command our app starts on a local server 127.0.0.1:5000 or localhost:5000. Visiting any of these two addresses will result to immediate redirect to a page where different experiment runs are properly visualized. An example of a page you would see when you start the app is given below.
As we can see on the Experiments tab below, the sidebar contains the list of all experiments present in the specified logdir directory. This does not include logs which correspond to hyperparameter sweeps.
When an experiment is selected all of the metrics which were logged in it's according log file(directory) are displayed on their separate graphs.
As we can see there is also a possibility for us to select multiple experiments at once and compare different experiments. In this case the graphs for different experiments are drawn on top of one another so it would be easier to compare different training runs. We can see below an example of such case.
As we can see on the Sweeps tab below, the sidebar contains the list of all hyperparameter sweeps present in the specified logdir directory. This does not include logs which correspond to regular training runs which aren't sweeps.
Below we can see an example of how this tab can look like
When a sweep is selected all of the data relevant for that sweep is displayed.
This section describes the content of the experiment_cfg.json file in a structured and visually appealing way. This section gets automatically created.
This table contains description of every training run started during the sweep. The description consists out of the exact values of hyperparameters which correspond to that particular run and the best value of the metric specified in the optimization_metric field. If no value was given for that field, this column won't be present in the table.
As we can see below, on this chart we can see the selected metric for every run that is present on the current page of the table. As we can see below EarlyStopping event was triggered for some the runs present on the current page.
- Here is a short demo of usage of Experiments tab
experiments_demo.mp4
- Here is a short demo of usage of Sweeps tab
sweeps_demo.mp4
This tool was created as a part of my learning process and therefore is provided "as is".
Use this tool at your own risk.