Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becoming too strongly tailored to the particularities of the training set and generalizing poorly to new data. The SRM principle addresses this problem by balancing the model’s complexity against its success at fitting the training data.

Structural Risk Minimization is implemented by minimizing Etrain+βH(W), where Etrain is the train error, the function H(W) is called a regularization function, and β is a constant.

H(W) is designed so that it gives larger values when the parameters W belong to what we call high-capacity subsets of the parameter space. This just means that when the model becomes more complex, H(W) gives bigger numbers. High-capacity models can fit the training data really well, but they also run the risk of overfitting, meaning they might not work well on new, unseen data.

By minimizing this function, we’re effectively limiting how complex or large the accessible parts of the model’s parameter space can be. This helps control the trade-off between two important things:

  1. Minimizing the training error (how well the model fits the training data).
  2. Minimizing the difference between the training error and test error (how well the model will generalize to new data).

In simpler terms, minimizing H(W) helps the model find a balance between learning from the training data and being general enough to perform well on unseen data.

  • The first term is the mean squared error (MSE) term between the value of the learned model, , and the given labels y. This term is the training error, Etrain, that was discussed earlier.
  •  The second term, places a prior over the weights, to favor sparsity and penalize larger weights. The trade-off coefficient, λ, is a hyperparameter that places more or less importance on the regularization term. Larger λ encourages sparser weights at the expense of a more optimal MSE, and smaller λ relaxes regularization allowing the model to fit to data.