1) trajectory.son: An aiMD trajectory file from the FHI-vibes. This will be used to obtain training and testing data.
2) geometry.in.supercell: A supercell with the ground-state atomic structure. This should have the same-size structure that you plan to use for training data.
3) nequip.yaml: A NequIP input file. You need to modify this file depending on your system. r_max, l_max, and n_features are controlled by input.in. Particularly, you need to check chemical_symbols and loss_coeffs.
dataset_seed: 0 # data set seed
append: true # set true if a restarted run should append to the previous log file
default_dtype: float64 # type of float to use, e.g. float32 and float64
allow_tf32: false # whether to use TensorFloat32 if it is available
device: cuda # which device to use. Default: automatically detected cuda or "cpu"
# network
model_builders:
- SimpleIrrepsConfig
- EnergyModel
- PerSpeciesRescale
- ForceOutput
- RescaleEnergyEtc
num_layers: 4 # number of interaction blocks, we find 3-5 to work best
parity: true # whether to include features with odd mirror parity; often turning parity off gives equally good results but faster networks, so do consider this
nonlinearity_type: gate # may be 'gate' or 'norm', 'gate' is recommended
# alternatively, the irreps of the features in various parts of the network can be specified directly:
# the following options use e3nn irreps notation
# either these four options, or the above three options, should be provided--- they cannot be mixed.
# radial network basis
num_basis: 8 # number of basis functions used in the radial basis, 8 usually works best
BesselBasis_trainable: true # set true to train the bessel weights
PolynomialCutoff_p: 6 # p-value used in polynomial cutoff function
invariant_layers: 2 # number of radial layers, we found it important to keep this small, 1 or 2
invariant_neurons: 64 # number of hidden neurons in radial function, again keep this small for MD applications, 8 - 32, smaller is faster
avg_num_neighbors: auto # number of neighbors to divide by, None => no normalization.
use_sc: true # use self-connection or not, usually gives big improvement
# data set
# the keys used need to be stated at least once in key_mapping, npz_fixed_field_keys or npz_keys
# key_mapping is used to map the key in the npz file to the NequIP default values (see data/_key.py)
# all arrays are expected to have the shape of (nframe, natom, ?) except the fixed fields
dataset: npz # type of data set, can be npz or ase
key_mapping:
z: atomic_numbers # atomic species, integers
E: total_energy # total potential eneriges to train to
F: forces # atomic forces to train to
R: pos # raw atomic positions
unit_cell: cell
pbc: pbc
chemical_symbols:
- Cu
- I
verbose: info # the same as python logging, e.g. warning, info, debug, error; case insensitive
log_batch_freq: 100 # batch frequency, how often to print training errors withinin the same epoch
log_epoch_freq: 1 # epoch frequency, how often to print
save_checkpoint_freq: -1 # frequency to save the intermediate checkpoint. no saving of intermediate checkpoints when the value is not positive.
save_ema_checkpoint_freq: -1 # frequency to save the intermediate ema checkpoint. no saving of intermediate checkpoints when the value is not positive.
# scalar nonlinearities to use — available options are silu, ssp (shifted softplus), tanh, and abs.
# Different nonlinearities are specified for e (even) and o (odd) parity;
# note that only tanh and abs are correct for o (odd parity).
nonlinearity_scalars:
e: silu
o: tanh
nonlinearity_gates:
e: silu
o: tanh
# training
learning_rate: 0.01 # learning rate, we found 0.01 to work best - this is often one of the most important hyperparameters to tune
batch_size: 1 # batch size, we found it important to keep this small for most applications
max_epochs: 1000000 # stop training after _ number of epochs
train_val_split: random # can be random or sequential. if sequential, first n_train elements are training, next n_val are val, else random
shuffle: true # If true, the data loader will shuffle the data
metrics_key: validation_loss # metrics used for scheduling and saving best model. Options: loss, or anything that appears in the validation batch step header, such as f_mae, f_rmse, e_mae, e_rmse
use_ema: True # if true, use exponential moving average on weights for val/test
ema_decay: 0.999 # ema weight, commonly set to 0.999
ema_use_num_updates: True # whether to use number of updates when computing averages
# early stopping based on metrics values.
# LR, wall and any keys printed in the log file can be used.
# The key can start with Training or validation. If not defined, the validation value will be used.
early_stopping_patiences: # stop early if a metric value stopped decreasing for n epochs
validation_loss: 25
early_stopping_delta: # If delta is defined, a decrease smaller than delta will not be considered as a decrease
validation_loss: 0.005
early_stopping_cumulative_delta: false # If True, the minimum value recorded will not be updated when the decrease is smaller than delta
early_stopping_lower_bounds: # stop early if a metric value is lower than the bound
LR: 1.0e-5
early_stopping_upper_bounds: # stop early if a metric value is higher than the bound
wall: 1.0e+100
metrics_components:
- - forces # key
- rmse # "rmse" or "mse"
- report_per_component: True # if true, statistics on each component (i.e. fx, fy, fz) will be counted separately
- - forces
- mae
- report_per_component: True
- - total_energy
- mae
- PerAtom: True # if true, energy is normalized by the number of atoms
- - total_energy
- rmse
- PerAtom: True
# the name `optimizer_name`is case sensitive
optimizer_name: Adam # default optimizer is Adam in the amsgrad mode
optimizer_amsgrad: true
optimizer_betas: !!python/tuple
- 0.9
- 0.999
optimizer_eps: 1.0e-08
optimizer_weight_decay: 0
# lr scheduler
lr_scheduler_name: ReduceLROnPlateau
lr_scheduler_patience: 50
lr_scheduler_factor: 0.5
4) job-cont.slurm: A job script for the MLIP-MD.
(This is an example for the COBRA system.)
#!/bin/bash -l
#SBATCH -J test_gpu
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH --ntasks=1 # launch job on a single core
#SBATCH --constraint="gpu"
#SBATCH --gres=gpu:v100:1
#SBATCH --cpus-per-task=1 # on a shared node
#SBATCH --mem=92500 # memory limit for the job
#SBATCH --time=01:00:00
(Load your modules)
conda activate almomd
srun almomd cont >> almomd.out
5) job-nequip-gpu.slurm: A job script for training the NequIP models. This script should be empty.
(This is an example for the COBRA system.)
#!/bin/bash -l
#SBATCH -J test_gpu
#SBATCH -o ./out.%j
#SBATCH -e ./err.%j
#SBATCH -D ./
#SBATCH --ntasks=1 # launch job on a single core
#SBATCH --constraint="gpu"
#SBATCH --gres=gpu:v100:1
#SBATCH --cpus-per-task=1 # on a shared node
#SBATCH --mem=92500 # memory limit for the job
#SBATCH --time=01:00:00
(Load your modules)
conda activate almomd
(Just leave here empty)
6) input.in: An input file for the ALmoMD code.
#[Active learning types]
calc_type : active # (active) sampling up to the number of training data needed (period) sampling during the assigned period
al_type : force_max # Uncertainty type: force_max (Maximum atomic force uncertainty)
uncert_type : absolute # Absolute or relative uncertainty (absolute / relative)
output_format : trajectory.son # File format of the output file (trajectory.son / aims.out / nequip)
device : cuda # Calating core type (cpu / cuda)
#[Uncertainty sampling criteria]
uncert_shift : 2.0 # How far from the average? 1.0 means one standard deviation
uncert_grad : 1.0 # Gradient near boundary
#[Active learning setting]
nstep : 3 # The number of models using the subsampling
nmodel : 2 # The number of models with different random initialization
ntrain_init : 25 # The number of training data for the initial NequIP model
ntrain : 25 # The number of newly added training data for each AL step
#[Molecular dynamics setting]
ensemble : NVTLangevin # Currently, only NVT Langevin MD is available
temperature : 300
timestep : 5
loginterval : 1
friction : 0.03
#[NequIP setting]
rmax : 5.0
lmax : 3
nfeatures : 32
num_calc : 16 # The number of job scripts for DFT calculations
num_mdl_calc : 6 # The number of job scripts for MLIP training calculations
E_gs : -1378450.95449287 # The reference potential energy (Energy of the geometry.in.supercell)
7) DFT_INPUTS: A directory for the DFT inputs
aims.in: A input for the FHI-vibes single-point calculation. You need to make sure that your calculation reproduces the aiMD energy and force appropriately. This is very tricky since some old FHI-aims calculations have different environments, which requires very careful check for inputs.
job-vibes.slurm: A job script for the FHI-vibes. This script should be empty, too.
(This is an example for the COBRA system.)
#!/bin/bash -l
#SBATCH -o ./out_o.%j
#SBATCH -e ./out_e.%j
#SBATCH -D ./ # Initial working directory
#SBATCH -J test_slurm # Job Name
#
#SBATCH --nodes=1 # Number of nodes and MPI tasks per node
#SBATCH --ntasks-per-node=40
#
#SBATCH --mail-type=none
#SBATCH --mail-user=kang@fhi-berlin.mpg.de
#
#SBATCH --time=01:40:00 # Wall clock limit (max. is 24 hours)
(Load your modules)
conda activate almomd (or your conda environment for FHI-vibes)
(Just leave here empty)