The goal of this project is to reproduce the methods and experiments of the following paper:
C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, S. Yang AdaNet: Adaptive Structural Learning of Artificial Neural Networks. We will try to reproduce their method that consists in building neural networks whose structure is learned and optimized at the same time as it’s weights.This method will be applied to a binary classification task on images from the CIFAR-10 dataset.
How does Adanet work?
The network’s structure is generated. The full network, which we will call AdaNet, is augmented at each iteration by a sub-network. This sub-network is chosen in regards to it’s effect on an objective function. This allows us to select the best possible sub-network to add to AdaNet at each step. To be able to understand how this process works we will be using the following notation: the AdaNet network at step , and the candidate sub-networks at step . For each of these sub-networks, is the layer and the neuron in layer . is the maximum number of layers in the sub-network, so .
The algorithm, executes the following steps:
Network initialization: The input and output layers are generated.
Candidate sub-networks generation: 2 candidate sub-networks, and , are generated (cf. figure below):
One with a depth identical to the sub-network selected in the previous step
One that is deeper by 1 level than the previously selected sub-network
Both of these sub-networks follow the same connectivity rules. The first layer of the candidate sub-networks is connected to the input layer of , the last layer of the candidate sub-networks, is connected to the output layer of . For the intermediary layers, each layer is connected to the layer below , and can be conncted (randomly) to any of the lower layers of th epreviously selected sub-networks, that is to say .
During the first iteration, only the sub-network that augments the depth can be generated since .
Sub-network choice: From the 2 candidate sub-networks and , we chose the one that minimises the most the following objective function:
being the training samples, their labels, the number of training samples, the weights of the candidate sub-network and either the exponential or the logistic function. is a regularization term left at during experimentation.\ If , the chosen sub-network is and .
Stop condition: Once the best sub-network is selected, we see if it ameliorates AdaNet. If the objective function of the training of (which is not the same as the one used to select the best candidate sub-network) is better with the sub-network rather than without, then: . Otherwise is returned and the algorithm is stopped. If this stop condition is never met, the algorithm stops after iterations and returns .
It is easy to see that the generation of the network as well as the weights optimization occurs iteratively at each step of the algorithm.
Both of the candidate networks generated at step (red), one of them with a depth of (a), like the previously selected sub-network (blue), and one of depth that adds a depth level. Both of these potential AdaNet networks are the ones that will be evaluated.
One of two experiments on different datasets had to be chosen:
The choice was made on the size of the respective datasets. If we concentrate on a given binary classification task, CIFAR has samples per class, whereas the Criteo dataset has million. Therefore, considering our limited resources, the choice was made to work on the CIFAR-10 dataset.
All programs were coded in
Python using the
Keras API for neural networks because of it’s simple syntax and high modularity. The
TensorFlow backend is used but it has no influence on the
Ambiguities in the paper
Multiple input tensors management
When a layer, is connected (on the input side), to more than one layer of the other sub-networks, it is unclear if the tensors are:
It is not explicitly stated wether the sub-networks are regularized using an or norm. The usage of term leads us to believe that an norm was used. It is also unclear wether it is the loss function or the weights during training that are regularized. We have chosen to the weights of each layer of the sub-networks as well as the general output layer with an norm.
Input data pre-processing
To be able to use images outside of a Convolutional Neural Network framework, we vectorize the images to vector of size
Input and output layers architecture
The input layer is of size , which is equal to in the case of CIFAR-10.
The output layer is composed of a single neuron with a sigmoid activation function, since the classification task is binary.
We chose to generate the sub-networks randomly. Each layer is connected to the preceding layer in it’s own sub-network, however connections to the lower layers in previously selected sub-networks are made randomly.
being the training samples, their labels, the number of training samples, the weights of the candidate sub-network and the exponential function.
is a regularization term left at during experimentation.
We interpret the formula as the following one:
We have tried to be as faithful as possible with regrads to the implementation in the paper. We had to make some decisions however, when the paper did not explicitly state what method was used.
Generation and selection of sub-networks
We give the algorithm a reps parameter, allowing us to specify how many candidate sub-networks are generated a each iteration. Seeing as there are only 2 types of sub-networks generated (equal depth or depth + 1), we have chosen to generate same depth sub-networks for the first generation steps, and deeper sub-networks for the remaining generation steps.
Connections between the different layers follow the connectivity ules laid out earlier, that is to say: the last layer of the sub-network is connected to the output layer, the first layer of the sub-network is connected to the input layer and the intermediary layers are connected to each other within a sub-network. Variability between the different candidate sub-networks lies in their connectivity to the previously selected sub-networks. For each layer of depth of the considered candidate sub-network, a random number of connections to the layers of depth of previously selected sub-networks are established.
The number of neurons by layer is determined at the beginning of the algorithm by a parameter. So all of the variability come from the random connections.
All the candidate sub-networks are inserted into the AdaNet network of the previous step to be evaluated on a test set, after training, using the objective function defined in [subsec:fobj]. The lowest result of this function is stored as well as the corresponding AdaNet network, and updated if a better candidate is found at subsequent steps. At the end of the evaluation of all candidates, the best AdaNet model is saved to diskand used as the base model for the next iteration of the algorithm.
Multiple input management implementation
As mentionned in [subsec:mult], it is not stated in the paper how multiple input connections to a layer are managed. If we concatenate all the input tensors the size of the input tensor for the output layer changes over iterations. This can lead to problems when loading weights from a previous network, since the saved tensor and the new one are not the same size. One solution would be to not save the weights for these concatenation layers, however this could lead to a decrease in performance. We have therefore chosen to add the input tensors for such layers and keep the existing weights.
Since the structure of the network is constantly changing, we cannot save the models directly to disk, since once compiled in
Keras they are not modifiable. To be able to bypass this issue, we store the exact topology of the best model in a dictionnary at each iteration. This dictionnary is then used to rebuild the network from scratch at the next iteration. Once reconstructed, we load the weights of the previous model in the common layers between and . The dictionnary is easily serializable since it is only text.
In the paper, it is stated that the algorithm stops when the addition of a new sub-network doesn’t reduce the objective function. Cosidering our limited computing resources, we made the choice to stop when the decrease in the objective function was under a certain threshold. The training of the various sub-networks was also submitted to a similar stop condition when the classification error did not decrease anymore.
We were unable to redo all the hyperparameter optimizations that were done in the paper, therefore we have chosen a set of hyperparameters for our experiments:
, the number of neurons in each layer
, the normalization parameter
, the learning rate
, the number of iterations
, the number of sub-networks generated at each iteration.
We then executed our algorithm on different binary classification tasks extracted from the CIFAR-10 dataset (identical to the ones used in the paper), measured the classification accuracy on a test set 10 times, to have an average performance as well as standard deviation. It is important to note that CIFAR-10 is already separated in training and testing sets. The results are shown in the following table.
|classification task||our results||paper results|
The results vary a lot between different tasks. This can be explained by the fact that no hyper-parameter optimization was done. Therefore the chosen hyper-parameters favor certain tasks (deer-horse) and lead to poor results for others.
We have strived to reproduce the results from the AdaNet paper. However we were forced to make some choices, notably on the sub-network generation steps, as well as forego any hyper-parameter optimization (it would have required implementation of an additional sophisticated method). Therefore, the results we have obtained, apart from the deer-horse classification task, are significantly lower than the one found in the paper. Certain results (deer-truck, automobile-truck and cat-dog) seem biased since we have obtained accuracies lower than with balanced classes. In anycase, better results can be obtained with transfer learning (accuracy ), so the benefits of this method are unclear. This method is very costly in computing time (due to the generation of a large number of sub-networks). However if we do not have a pre-trained network this method might yield results more quickly than when using convolutional neural networks, and could be interesting.
All the code can be found in this repository