Hello, I'm

Carles Riera

I want to answer why Deep Neural Networks work

I am a Deep Learning indie predoc researcher at Universitat de Barcelona.

I research in Neural Architecture Search, the Lottery Ticket Hypothesis and I’m particularly interested in Minimum Width Networks and Zero Initializations.

About me

I am an unconventional predoc researcher at Universitat de Barcelona. I’m more interested in creating insight into the mechanics (and dynamics) of Neural Networks, rather than achieving 0.0001% more accuracy. Hyperparameter tuning nor dataset + model articles are not my cup of tea.

Currently, I’m working on training neural networks with minimum width (in terms of units per layer), and figuring out how to train with initial parameters set to zero (zero initialization). I’m also interested in the Lottery Ticket Hypothesis and NAS (Neural Network Search).

Interests

Lottery Ticket

The lottery Ticket Hypothesis conjectures that every random initialized network contains a subnetwork that is in charge of solving the problem. Such subnetwork can be found using traditional pruning techniques.

What I find interesting is how this subnetwork can match (and even surpass) the performance of the whole network if retrained, but only when reinitialized to the same original values of the whole network. When randomly reinitialized, it won’t work. The conclusion is that initialization is the key (hence the name Winning Lottery Ticket or LT).

I find this phenomenon most interesting, as it shows that overparametrization is only required during initialization. This opens up the possibility to do massive savings (on parameters), if we could find the way to train this subnetwork from scratch.

NAS and Constructive Networks

Neural Network Search (NAS) is a technique for automating the design of Neural Networks. This means avoiding the need for choosing from within a plethora of architectural hyperparameters (e.g. number and width of layers, connectivity, residuals or dense connections, etc). Instead of using reinforcement learning or Genetic Algorithms, I advocate for something less stochastic: constructive algorithms.

I envision something like Online-SVMs: you start from scratch, and keep adding supports (automatically), until reaching a certain level of complexity, that is a function of the regularization term. For Deep Learning, this implies adding units (correctly positioned) until convergence is reached (contrary to pruning). At first, this problem seems like a combinatorial nightmare, yet very stimulating as an end-goal.

Minimum Width Networks

Minimum width Networks are Networks whose layers have the least amount of units needed to solve a problem. They are a stepping stone both in NAS and Lottery Ticket. However, choosing the numbers of units per layer remains an open problem in Deep Learning, despite the fact that some guidelines exist in the form of -for example- EfficientNet or SimpNet. Ultimately, choosing the number of units remains to this day an empirical finding and depends on test error.

My interest is in finding a characterization of the capacity of layers (hopefully extensible to the entire network), that match the complexity of the dataset (perhaps by some geometrical measure). To do so, I first need to answer the question of why is it so difficult to train thin neural networks? In light of the LT hypothesis, we know that thinner networks should work.

Zero Initialization

In order to develop a constructive algorithm for NAS, it is desireable to set weights initially to zero. However, this is not possible in currently used Deep Learning algorithms.

I’m under the impression that random initializations places an inductive bias (particularly with the Lottery Ticket Hypothesis in sight) that is poorly justified. Why do we choose one random distribution over another (beyond exploding/vanishing gradient arguments)?

Instead, I advocate for Zero initialization: it’s clean, and theoretically sounder. Hopefully, I can end --once and for all-- the debate over which initialization scheme is superior.

Publications

Superseding Model Scaling by Penalizing Dead Units and Points with Separation Constraints

Carles Riera, Camilo Rey-Torres, Eloi Puertas, Oriol Pujol

2019

In this article, we study a proposal that enables to train extremely thin (4 or 8 neurons per layer) and relatively deep (more than 100 layers) feedforward networks without resorting to any architectural modification such as Residual or Dense connections, data normalization or model scaling. We accomplish that by alleviating two problems. One of them are neurons whose output is zero for all the dataset, which renders them useless. This problem is known to the academic community as dead neurons. The other is a less studied problem, dead points. Dead points refers to data points that are mapped to zero during the forward pass of the network. As such, the gradient generated by those points is not propagated back past the layer where they die, thus having no effect in the training process. In this work, we characterize both problems and propose a constraint formulation that added to the standard loss function solves them both. As an additional benefit, the proposed method allows to initialize the network weights with constant or even zero values and still allowing the network to converge to reasonable results. We show very promising results on a toy, MNIST, and CIFAR-10 datasets.

  • Dead Point
  • Dead Unit
  • Model Scaling
  • Separation Constraints
{title} diagram
Solving internal covariate shift in deep learning with linked neurons

Carles Riera, Oriol Pujol

2017

This work proposes a novel solution to the problem of internal covariate shift and dying neurons using the concept of linked neurons. We define the neuron linkage in terms of two constraints: first, all neuron activations in the linkage must have the same operating point. That is to say, all of them share input weights. Secondly, a set of neurons is linked if and only if there is at least one member of the linkage that has a non-zero gradient in regard to the input of the activation function. This means that for any input in the activation function, there is at least one member of the linkage that operates in a non-flat and non-zero area. This simple change has profound implications in the network learning dynamics. In this article we explore the consequences of this proposal and show that by using this kind of units, internal covariate shift is implicitly solved. As a result of this, the use of linked neurons allows to train arbitrarily large networks without any architectural or algorithmic trick, effectively removing the need of using re-normalization schemes such as Batch Normalization, which leads to halving the required training time. It also solves the problem of the need for standarized input data. Results show that the units using the linkage not only do effectively solve the aforementioned problems, but are also a competitive alternative with respect to state-of-the-art with very promising results.

  • Neuron linkage
  • Concatenated ReLU
  • Convariate shift
{title} diagram
An Approximate Support Vector Machines Solver with Budget Control

Carles Riera, Oriol Pujol

2016

We propose a novel approach to approximately solve on-line kernel Support Vector Machines (SVM) with the number of support vectors set beforehand. To this aim, we modify the original formulation introducing a new constraint that penalizes the deviation with respect to the desired number of support vectors. The resulting problem is solved using stochastic subgradient methods with block coordinate descent. Comparison with state-of-the-art online methods shows very promising results.

  • SVM
  • Budget
  • #Supports Constraint
{title} diagram
Sparse methods for Online Support Vector Machine

Carles Riera, Oriol Pujol

2016

This work proposes a new online classifier based on Kernel SVM that minimizes the number of support vectors thanks to an additional restriction imposed by means of continuation methods. We build an homotopy between a normal SVM formulation to a restricted one with threshold on the magnitude of the coefficients, tightening it iteratively. We show how the use of the continuation method helps in achieving solutions that would be unfeasible directly. Experimentation in UCI datasets show significant savings.

  • SVM
  • Homotopy
  • Continuation
  • Pruning
{title} diagram

Work

Computer vision engineer @ Colorsensing
10/18 - 02/19

  • Design and implementation of a visual water contamination detection system for Bluephage.
  • The system retrieves an image of a Bluephage test bottle and outputs if the water is contaminated and the concentration of viruses present.
  • Various algorithms tested such as Random Forests, Convolutional Networks and Logistic Regression.

  • Classification
  • CNN
  • Keras
  • Sklearn
  • Matplotlib

Education

Ph.D. candidate in Mathematics and Computer Science
Universitat de Barcelona

2015 -

M.Sc. in Artificial Intelligence
Universitat Politècnica de Catalunya

2012 - 2014

B.Sc. in Computer Engineering
Universitat de Barcelona

2008 - 2012

B.A. in Philosophy
Universitat de Barcelona

2003 - 2006

Contact

I am available for hiring

Send me an email if you are interested in working with me.