LeNet - Gradient-Based Learning Applied to Document Recognition

SECTION - I

A. Learning from Data

B. Gradient Based Learning

$W_{k} = W_{k - 1} - ϵ \frac{\partial E ( W )}{\partial W}$ In the simplest case, ϵ is a scalar constant. More sophisticated procedures use variable ϵ, or

substitute it for a diagonal matrix, where each diagonal element corresponds to an individual learning rate for each parameter. This approach helps account for differences in the magnitude of gradients across different dimensions.
substitute it for a Inverse Hessian Matrix. Some algorithms (like Newton’s Method) use the inverse Hessian matrix to update parameters, which allows the optimization process to take into account how the curvature of the error surface changes with each parameter.
substitute it for an approximation of Inverse Hessian Matrix. Quasi-Newton Methods, like the popular BFGS algorithm, attempt to approximate the inverse Hessian matrix rather than compute it exactly, offering a compromise between better convergence behavior and computational efficiency. These methods often perform better than gradient descent with a fixed learning rate, particularly for non-convex optimization problems.
Here, The Conjugate Gradient Method can also be used By using the Hessian, the algorithm can potentially take larger, more accurate steps towards the minimum. However, computing the inverse Hessian is computationally expensive and impractical for large-scale problems, especially in machine learning models with millions of parameters.

D. Learning in Real Handwriting Recognition Systems

Overview of Handwriting Recognition

Handwritten Character Recognition has been widely explored and was one of the early successes for neural networks.
Convolutional Networks (CNNs) are particularly effective in this domain, performing better than other methods on handwritten digits by learning to extract features directly from pixel images.
Isolated handwritten character recognition is relatively well understood, but segmenting characters from a word or sentence presents a more complex problem.

Challenges in Handwriting Recognition

Segmentation Problem:
- One of the major difficulties is to correctly segment characters from their neighboring characters within a word or sentence.
- Segmentation refers to the task of separating individual characters in a string for recognition.
- It’s challenging because of the natural overlap and ambiguity of handwritten characters.
Heuristic Over-Segmentation:
- A widely used method is Heuristic Over-Segmentation, which generates a large number of potential cuts between characters using heuristic image processing techniques.
- These cuts are then scored by the recognizer to select the best combination of characters.
- Accuracy here depends on:
  - The quality of cuts generated by the heuristics.
  - The ability of the recognizer to differentiate correctly segmented characters from partially cut characters or multiple characters.
Challenges in Labeling Incorrect Segments:
- Creating a labeled database of incorrectly segmented characters is difficult and expensive.
- Manually labeling hypotheses generated by the segmenter is tedious and prone to inconsistency.
- Ambiguities arise, for example, when deciding if half of a split “4” should be labeled as a “1” or a non-character.

Solutions to the Segmentation Problem

Training on Whole Strings of Characters:
- Instead of training the system at the individual character level, the system can be trained to recognize whole strings of characters.
- This approach uses Gradient-Based Learning to minimize an overall loss function measuring the probability of an incorrect recognition.
- A differentiable loss function is essential for the application of Gradient-Based methods.
- This method also introduces the use of directed acyclic graphs (DAGs) where arcs carry numerical information, representing alternative hypotheses.
Segmentation-Free Recognition:
- Another approach eliminates segmentation altogether by sweeping the recognizer over every possible location on the input image.
- This takes advantage of the recognizer’s character-spotting property: its ability to correctly identify well-centered characters in its input field, even when surrounded by other characters.
- The outputs from the sweeping process are fed into a Graph Transformer Network (GTN), which applies linguistic constraints to extract the most likely interpretation of the input.
- This GTN is conceptually similar to Hidden Markov Models (HMM), reminiscent of techniques used in speech recognition.
- This approach is computationally intensive, but Convolutional Neural Networks significantly reduce the computational cost, making it an attractive solution.

Key Takeaways:

Neural networks, especially CNNs, are highly effective for handwriting recognition, particularly when recognizing individual characters.
Segmenting characters from a word is a major challenge due to natural variations in handwriting.
Two primary solutions:
- Training systems on whole strings using gradient-based methods with loss functions.
- Avoiding segmentation with a sweeping recognition technique and applying linguistic models (GTN) to interpret the sequence of characters.
CNNs help mitigate the computational cost of these advanced methods, making handwriting recognition feasible at a large scale.

SECTION - II

A. Convolutional Networks

Convolutional Networks combine three architectural ideas to ensure some degree of shift, scale, and distortion invariance:

Local Receptive Fields:
- In a Convolutional Network, each neuron is connected to only a small, localized region of the input image or feature map, known as the local receptive field.
- This concept mimics the structure of biological visual systems, where neurons respond to specific parts of the visual field.
- By limiting the connections to local regions, the network focuses on extracting elementary features such as edges, corners, or textures from small patches of the input.
- As the layers progress, these features are combined to detect higher-level structures (e.g., shapes or objects).
- This helps the network to capture localized patterns while reducing computational complexity.
Shared Weights (or Weight Replication):
- In Convolutional Networks, a single set of weights (i.e., a filter or kernel) is used for multiple locations across the input. This concept is called weight sharing or weight replication.
- This ensures that the same feature (e.g., an edge or corner) is detected regardless of its position in the image, contributing to shift invariance.
- By applying the same filter across different parts of the image, the network can efficiently detect patterns while significantly reducing the number of parameters it needs to learn.
- This improves generalization and reduces the risk of overfitting, as fewer parameters need to be optimized.
Spatial or Temporal Sub-Sampling:
- Sub-sampling (also known as pooling) reduces the dimensionality of the feature maps while preserving important information.
- Spatial sub-sampling refers to down-sampling in the spatial domain (i.e., across the width and height of an image), often done through techniques like max pooling or average pooling, where only the most salient features are kept.
- Temporal sub-sampling applies the same principle to sequential data (e.g., videos or audio), reducing the time resolution while maintaining essential temporal patterns.
- Sub-sampling introduces a degree of scale and distortion invariance by making the network less sensitive to small variations or shifts in the input.
- It also helps in reducing computational load and memory usage while preventing overfitting by simplifying the representation passed to deeper layers.

Summary:

Local receptive fields focus on extracting localized features.
Shared weights enable pattern detection across different regions of the input, ensuring shift invariance.
Spatial or temporal sub-sampling reduces dimensionality and enforces robustness to small variations in scale and distortion, allowing the network to focus on key features across different layers.

Together, these three ideas enable Convolutional Networks to efficiently learn and recognize patterns with robustness to common variations in real-world data, such as shifts, distortions, or changes in scale.

An interesting property of convolution layers is that if the input image is shifted, the feature map output will be shifted by the same amount, but will be left unchanged otherwise. This property is at the basis of the robustness of convolutional networks to shifts and distortions of the input.

Reducing Sensitivity to Feature Location via Sub-Sampling

Imprecision in Exact Feature Location:
- Once a feature (e.g., an edge, corner, or endpoint) has been detected in an image, its exact location becomes less critical.
- What matters is the approximate position of the feature relative to other features, not its precise coordinates.
- For instance, identifying a ‘7’ from its components (an endpoint of a horizontal segment in the upper left, a corner in the upper right, and a vertical segment in the lower area) relies on recognizing the presence of these parts, rather than their exact placement.
Importance of Relative Position:
- The relative arrangement of features is key to identifying patterns, but minor variations in feature positions (due to noise, transformations, or distortions) are expected in different instances of the same character.
- Hence, encoding the precise location of features can be counterproductive, as slight shifts between instances of the same pattern (e.g., different handwriting styles) can lead to errors in recognition.
Sub-Sampling Layers:
- A sub-sampling layer addresses this problem by reducing the spatial resolution of the feature map.
- Sub-sampling (often done via pooling operations) performs local averaging or selects the most prominent value (e.g., in max pooling) over a small region, thus downsampling the feature map.
- This reduces the network’s sensitivity to small shifts, distortions, or variations in the exact position of detected features, ensuring shift and distortion invariance.
Benefits of Sub-Sampling:
- By reducing the resolution of the feature map, sub-sampling helps focus on the presence and relative positions of features without being overly sensitive to their precise locations.
- This makes the network more robust and less prone to errors due to minor positional changes in the input image, such as variations in handwriting.
- Additionally, sub-sampling helps in reducing computational complexity and improving the network’s generalization by simplifying the data passed to deeper layers.

Summary:

Sub-sampling layers help reduce the precision with which feature locations are encoded by lowering the spatial resolution of feature maps. This makes the network less sensitive to small shifts and distortions, which is crucial for tasks like character recognition, where exact feature positions are less important than their relative arrangement.

Convolution and Sub-Sampling Layers in Neural Networks

Trainable Coefficients and Bias:
- In sub-sampling layers, a trainable coefficient and bias control the effect of the sigmoid non-linearity.
- If the coefficient is small (the sigmoid ~= a straight line) , the unit operates in a quasi-linear mode, meaning it merely blurs the input.
- If the coefficient is large (The sigmoid function becomes very non-linear, compressing the input more tightly into a 0 or 1 output, making the sub-sampling behave like a binary decision), the units perform more complex functions, akin to a “noisy OR” or “noisy AND”, depending on the bias value. This introduces a level of non-linearity in the sub-sampling process, affecting how the network aggregates information.
Layer Architecture: Alternating Convolution and Sub-Sampling:
- Convolutional layers extract features by applying filters to the input data, and sub-sampling layers (often pooling layers) reduce the spatial resolution while preserving essential information.
- These layers are typically alternated in deep neural networks, creating a “bi-pyramid” structure:
  - At each successive layer, the number of feature maps increases, while the spatial resolution (the size of each feature map) decreases.
- This results in progressively richer, more abstract representations of the input data as the spatial details get reduced.
Inspiration from Biological Systems:
- This convolution/sub-sampling combination is inspired by the biological findings of Hubel and Wiesel, who discovered “simple” and “complex” cells in the visual cortex of animals. Simple cells respond to specific features, such as edges or orientations, while complex cells integrate these responses over larger areas to detect more complex patterns.
- This idea was first implemented in the Neocognitron by Fukushima, although it lacked a globally supervised learning procedure like backpropagation at that time.
Achieving Invariance through Layered Reduction:
- By progressively reducing the spatial resolution while increasing the richness of the representation (i.e., the number of feature maps), a large degree of invariance to geometric transformations of the input can be achieved.
- This means that the network becomes less sensitive to shifts, rotations, or distortions in the input, while still being able to accurately identify important patterns or features.
Summary:
- Sub-sampling layers are controlled by coefficients and biases that adjust their non-linearity.
- Alternating between convolution and sub-sampling layers creates a structure that increases the richness of the data representation while reducing spatial resolution, allowing the network to maintain important information while becoming invariant to distortions.
- This architecture is inspired by biological vision systems and allows for robust pattern recognition in tasks like image processing.

Capacity of the Network

The capacity of a machine learning model refers to its ability to learn a wide range of functions or mappings from inputs to outputs. A model with high capacity has the potential to fit very complex functions and can capture more intricate patterns in the data. However, high capacity also means the model is more prone to overfitting, especially if the number of training examples is limited.

Weight sharing reduces the capacity of the network because it limits the number of trainable parameters.
This reduction in capacity is beneficial because it ensures the network doesn’t learn overly complex patterns that only apply to the training data, and instead focuses on more generalizable patterns that work across both training and test data.

Time-Delay Neural Networks (TDNNs)

TDNNs are a special type of convolutional network designed to handle temporal data. While standard CNNs are applied to spatial dimensions (e.g., image data), TDNNs apply the same idea of weight sharing along a temporal dimension.
- In CNNs, weights (filters) are shared across different parts of an image. In TDNNs, weights are shared along time steps (one-dimensional temporal data), allowing them to process sequential data.
- Temporal data: Refers to data that unfolds over time, such as audio, video, or sequences of text or handwriting.

Applications of TDNNs

Phoneme recognition: TDNNs have been used to recognize phonemes, the basic units of sound in speech, without using sub-sampling (where the network reduces the resolution of its input). Recognizing phonemes is crucial for tasks like speech recognition.
Spoken word recognition: TDNNs have also been used to recognize whole words from spoken input. In this case, sub-sampling is used to reduce the complexity of the input data, while still maintaining key features for word identification.
Online recognition of handwritten characters: TDNNs have been applied to recognize characters as they are being written in real-time, capturing the temporal dynamics of handwriting.
Signature verification: In this task, TDNNs are used to verify the authenticity of signatures by analyzing the temporal sequence of strokes.

B. LeNet5

LeNet-5 is a Convolutional Neural Network (CNN) designed for image classification tasks, particularly handwriting recognition. This architecture has played a foundational role in the development of deep learning models used in image recognition. Below is a detailed breakdown of the seven layers of LeNet-5, the principles governing its design, and the rationale behind its architecture.

1. Input Layer

Input Size: The network accepts 32x32 pixel grayscale images as input. This input size is larger than the typical character in handwriting datasets, where characters can be as small as 20x20 pixels, often positioned in a 28x28 pixel field. The reason for this larger input size is to ensure that critical features of the character, such as stroke endpoints or corners, can appear at the center of the receptive fields of higher-level feature detectors (since padding is not used here, we want everything not to be at the edges).
Pixel Value Normalization: The input pixel values are normalized such that:
- The background (white) corresponds to a value of -0.1.
- The foreground (black) corresponds to a value of 1.175.
- This normalization ensures that the input data has a mean close to 0 and a variance of around 1, which helps in speeding up the learning process.

2. Layer C1 (Convolutional Layer)

Number of Feature Maps: 6
Filter Size: Each unit in the feature maps is connected to a 5x5 neighborhood in the input image.
Output Size: Since the input is 32x32 pixels, and the receptive field is 5x5, the output feature maps have a size of 28x28 pixels. This smaller size prevents the receptive fields from falling off the image boundaries.
Trainable Parameters: 156
Total Connections: 122,304
This layer applies convolution to detect basic patterns such as edges, corners, or textures.

3. Layer S2 (Sub-sampling Layer)

Number of Feature Maps: 6 (same as C1)
Receptive Field Size: Each unit in S2 takes input from a 2x2 non-overlapping region in the corresponding feature map in C1.
Operation: The values from the 2x2 region are summed, multiplied by a trainable coefficient, added to a trainable bias, and passed through a sigmoid activation function.
Output Size: Since each unit subsamples a 2x2 area, the output feature maps are of size 14x14 pixels (half the dimensions of C1).
Trainable Parameters: 12
Total Connections: 5,880
Sub-sampling reduces the spatial resolution while preserving the most important information, helping to create invariance to small translations in the input.

4. Layer C3 (Convolutional Layer)

Number of Feature Maps: 16
Filter Size: Each unit in C3 is connected to a 5x5 neighborhood in a subset of feature maps from S2.
Connection Scheme: Unlike C1, which connects all feature maps to the input, not every feature map in C3 is connected to every feature map in S2. This non-complete connection helps control the number of parameters and breaks symmetry in the network, encouraging each feature map to learn different representations.
- For example, some feature maps in C3 are connected to contiguous subsets of 3 or 4 feature maps from S2, while others may receive input from non-contiguous subsets or even all feature maps.
Output Size: 10x10 pixels (after convolution).
Trainable Parameters: 1,516
Total Connections: 151,600
This layer combines the outputs of multiple S2 feature maps, enabling the network to learn more complex patterns by combining simpler ones.

5. Layer S4 (Sub-sampling Layer)

Number of Feature Maps: 16 (same as C3)
Receptive Field Size: Similar to S2, each unit in S4 takes input from a 2x2 non-overlapping region in the corresponding feature map in C3.
Output Size: The output feature maps have a size of 5x5 pixels.
Trainable Parameters: 32
Total Connections: 2,000
This layer further reduces the spatial resolution, capturing important spatial relationships while discarding less significant details.

6. Layer C5 (Convolutional Layer)

Number of Feature Maps: 120
Filter Size: Each unit in C5 is connected to a 5x5 neighborhood on all 16 feature maps from S4.
Full Connection: Since the input to this layer (S4) has a size of 5x5, and the filter size is also 5x5, each C5 feature map is fully connected to the input feature maps.
Output Size: 1x1 pixels (because of full connectivity).
Trainable Parameters: 48,120
Although C5 is fully connected to S4, it is still considered a convolutional layer rather than a fully-connected layer because, if the input size were increased, the size of C5’s feature maps would also increase. This characteristic distinguishes it from a typical fully-connected layer.

7. Layer F6 (Fully-Connected Layer)

Number of Units: 84
Connections: Each unit in F6 is fully connected to all 120 units in C5.
Trainable Parameters: 10,164
The number 84 is derived from the design of the output layer and its connection to the previous layer (explained later). In this layer, similar to classical neural networks, units compute a dot product between their input vector and weight vector, followed by a bias term and a sigmoid activation function.
The sigmoid activation function used is a scaled hyperbolic tangent, defined as:

f (a) = A \cdot tanh (S \cdot a)

 where **A = 1.7159** and **S** controls the slope at the origin. This function is chosen to provide a smooth non-linearity while ensuring the units operate in a non-saturated regime for faster learning.

8. Output Layer (Radial Basis Function - RBF)

RBF Units: The output layer consists of Euclidean Radial Basis Function (RBF) units, one for each class in the classification task.
Number of Inputs: Each RBF unit takes input from all 84 units in F6.
Operation: Each RBF unit computes the Euclidean distance between its input vector and a parameter vector (associated with a specific class). The further the input vector is from the RBF’s parameter vector, the larger the output of the RBF.
- This output can be interpreted as a penalty term measuring the fit between the input pattern and the model of the class.
Parameter Vectors: These vectors are chosen by hand and kept fixed initially. They are designed to represent stylized images of the corresponding character class, drawn on a 7x12 bitmap (hence the number 84 inputs). Each component of these vectors is either +1 or -1.
- The parameter vectors are within the range of the sigmoid units of F6 (which prevents saturation) and allow the units to operate in their maximally non-linear range.

9. Rationale Behind the Output Codes

The RBF output layer does not use a standard one-hot encoding (or “1 of N” code). Instead, it uses a distributed coding scheme where each character class is represented by a unique code in a high-dimensional space.
This coding scheme is particularly advantageous when the network has to distinguish between many classes (e.g., the full ASCII character set) or needs to reject non-characters. In cases of confusable characters (e.g., the digit “0” vs. the letter “O”), their corresponding output codes are similar, which makes it easier for a linguistic post-processor to correct classification errors.
Distributed codes also help prevent problems associated with the saturation of sigmoid units, which occurs when using a large number of output classes in a one-hot scheme.

Summary of Key Concepts:

LeNet-5 Architecture: A deep CNN designed for image classification with alternating convolutional and sub-sampling layers, followed by fully-connected layers and an RBF output layer.
Layer Specialization: Each layer extracts progressively more complex features, starting with basic edges and textures in early layers (C1), and moving to more abstract representations in higher layers (C3, C5).
Sub-sampling Layers: These reduce spatial dimensions while retaining the most important information, helping the model achieve invariance to small translations.
RBF Output Layer: This layer uses Euclidean distance to classify inputs based on pre-defined parameter vectors, which represent the stylized appearance of each class in the

dataset. This avoids common pitfalls of one-hot encoding, such as saturation of sigmoid units when dealing with many classes.

Loss Function in LeNet-5 Architecture

The LeNet-5 architecture primarily employs a Maximum Likelihood Estimation (MLE) criterion as its basic loss function. This is equivalent to the Minimum Mean Squared Error (MSE) when applied to the Radial Basis Function (RBF) output layer. The MSE criterion effectively reduces the penalty of incorrect classifications by adjusting the output of the RBF unit corresponding to the true class of the input pattern.

1. Formulation of the Loss Function

The MSE-based loss for a set of training samples can be represented as: $E (W) = p \sum E (Y_{D p} (Z_{p}, W))$ Here, $E (Y_{D p} (Z_{p}, W))$ represents the error for the D-th RBF unit associated with the correct class label of the input pattern $Z_{p}$ , and W refers to the weights of the network. This function, though useful in reducing the classification error, does have limitations when training the network, especially in terms of class separability and network stability.

2. Limitations of MSE as a Loss Function

Collapsing Phenomenon: When RBF parameters are allowed to adapt, a trivial solution arises where all RBF parameter vectors become identical. This makes the state of F6 constant, irrespective of the input, resulting in all RBF outputs converging to zero. Consequently, the network no longer responds correctly to different inputs.
Lack of Class Competition: MSE does not introduce competition between the classes, which is crucial for enhancing classification performance by pushing incorrect classes away while favoring the correct class.

3. Introducing the Maximum a Posteriori (MAP) Criterion

To counter these issues, a Maximum a Posteriori (MAP) criterion is employed. This criterion is more discriminative and resembles the Maximum Mutual Information criterion used in training some neural network architectures.
The MAP criterion aims to maximize the posterior probability of the correct class by minimizing the log probability of the correct class. It introduces a competitive aspect by also penalizing incorrect classes:
$E (W) = - E (Y_{D p} (Z_{p}, W)) + lo g (i \sum e^{Y_{i}}) + j$
where:
- The first term $- E (Y_{D p} (Z_{p}, W))$ reduces the penalty for the correct class.
- The second term acts as a competitive term by pulling up the penalties for incorrect classes, ensuring the total loss remains positive.
- The constant j is added to prevent already large penalties from being pushed unnecessarily higher, ensuring that the penalty distribution remains balanced.
The inclusion of this competition prevents the collapsing effect by keeping the RBF centers distinct, maintaining class separability.

4. Gradient Computation and Backpropagation

The loss function gradient with respect to the network weights is computed using backpropagation, but with slight modifications for the weight-sharing structure. In practice, the derivatives are first calculated as if the network had no weight sharing. Then, the partial derivatives of all connections sharing the same parameter are aggregated to get the true derivative for that parameter.
This approach allows efficient gradient computation and helps maintain computational efficiency despite the network’s large size.

5. Additional Training Techniques

Efficient training of such a large architecture necessitates certain advanced techniques. These include:
- Modified sigmoid functions to ensure that units operate within a non-saturated regime.
- Appropriate weight initialization strategies to avoid issues like vanishing or exploding gradients.
- Stochastic Levenberg-Marquardt approximation for optimization, which stabilizes training by adaptively adjusting the learning rate.

🤓 Yashwanth's Notes

Explorer

LeNet - Gradient-Based Learning Applied to Document Recognition

SECTION - I

A. Learning from Data

B. Gradient Based Learning

D. Learning in Real Handwriting Recognition Systems

Overview of Handwriting Recognition

Challenges in Handwriting Recognition

Solutions to the Segmentation Problem

Key Takeaways:

SECTION - II

A. Convolutional Networks

Summary:

Reducing Sensitivity to Feature Location via Sub-Sampling

Summary:

Convolution and Sub-Sampling Layers in Neural Networks

Capacity of the Network

Time-Delay Neural Networks (TDNNs)

Applications of TDNNs

B. LeNet5

1. Input Layer

2. Layer C1 (Convolutional Layer)

3. Layer S2 (Sub-sampling Layer)

4. Layer C3 (Convolutional Layer)

5. Layer S4 (Sub-sampling Layer)

6. Layer C5 (Convolutional Layer)

7. Layer F6 (Fully-Connected Layer)

8. Output Layer (Radial Basis Function - RBF)

9. Rationale Behind the Output Codes

Summary of Key Concepts:

Loss Function in LeNet-5 Architecture

1. Formulation of the Loss Function

2. Limitations of MSE as a Loss Function

3. Introducing the Maximum a Posteriori (MAP) Criterion

4. Gradient Computation and Backpropagation

5. Additional Training Techniques

Explorer

Backlinks

Table of Contents

Graph View