learning_rate: How much the model’s weights are adjusted per step. Too low and the model will take a long time to learn or get stuck in a suboptimal solution. Too high can cause can divergent behaviors.
num_train_epochs: The number of times the training data is iterated over.
weight_decay: A type of regularization. It prevents weights from getting too large. Thus, preventing overfitting.
adam_beta1: The beta1 parameter for the Adam with weight decay optimizer.
adam_beta2: The beta2 parameter for the Adam with weight decay optimizer.
adam_epsilon: The epsilon parameter for the Adam with weight decay optimizer.
max_grad_norm: Used to prevent exploding gradients. Prevents the derivatives of the loss function from exceed the absolute value of “max_grad_norm”.
batch_size: Number of training examples used per iteration