# Deep Learning

**Deep learning** is a set of algorithms in machine learning that attempt to learn layered models of inputs, commonly neural networks. The layers in such models correspond to distinct levels of concepts, where higher-level concepts are defined from lower-level ones, and the same lower-level concepts can help to define many higher-level concepts

Deep learning is just a buzzword for neural nets, and neural nets are just a stack of matrix-vector multiplications, interleaved with some non-linearities. No magic there.—Ronan Collobert

Deep learning is part of a broader family of machine learning methods based on learning representations. An observation (e.g., an image) can be represented in many ways (e.g., a vector of pixels), but some representations make it easier to learn tasks of interest (e.g., is this the image of a human face?) from examples, and research in this area attempts to define what makes better representations and how to learn them.

The term “deep learning” gained traction in the mid-2000s after a publication by Geoffrey Hinton^{[3]}^{[4]} showed how a many-layered neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then using supervised backpropagation for fine-tuning. The field itself, however, is much older and dates back at least to the deep Neocognitron of Kunihiko Fukushima.^{[5]} In 1992, Jürgen Schmidhuber already showed how a multi-level hierarchy of recurrent neural networks can be effectively pre-trained (through unsupervised learning) one level at a time, then using backpropagation for fine-tuning.^{[6]}

Although the backpropagation algorithm had been available for training neural networks since 1974,^{[7]} it was often considered too slow for practical use^{[3]}, due to the so-called vanishing gradient problem analyzed in 1991 by Schmidhuber’s student Sepp Hochreiter (more details in the section on artificial neural networks below). As a result, neural networks fell out of favor in practical machine learning and simpler models such as support vector machines (SVMs) dominated much of the field in the 1990s and 2000s. However, SVM learning is essentially a linear process, while neural network learning can be highly non-linear. In 2010 it was shown^{[8]} that plain back-propagation in deep non-linear networks can outperform all previous techniques on the famous MNIST handwritten digit benchmark, without unsupervised pretraining.

Advances in hardware have been an important enabling factor for the resurgence of neural networks and the advent of deep learning, in particular the availability of powerful and inexpensive graphics processing units (GPUs) also suitable for general-purpose computing. GPUs are highly suited for the kind of “number crunching” involved in machine learning, and have been shown to speed up training algorithms by orders of magnitude, bringing running times of weeks back to days.^{[8]}^{[9]}

Deep learning is often presented as a step towards realising strong AI^{[10]} and has attracted the attention of such thinkers as Ray Kurzweil, who was hired by Google to do deep learning research.^{[11]} Gary Marcus has expressed skepticism of deep learning’s capabilities, noting that

Realistically, deep learning is only part of the larger challenge of building intelligent machines. Such techniques lack ways of representing causal relationships (…) have no obvious ways of performing logical inferences, and they are also still a long way from integrating abstract knowledge, such as information about what objects are, what they are for, and how they are typically used. The most powerful A.I. systems, like Watson (…) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning.

^{[3]}

Deep learning algorithms are based on distributed representations, a notion that was introduced with connectionism in the 1980’s. The underlying assumption behind distributed representations is that the observed data were generated by the interactions of many factors (not all known to the observer), and that what is learned about a particular factor from some configurations of the other factors can often generalize to other, unseen configurations of the factors. Deep learning adds the assumption (seen as a prior about the unknown, data-generating process) that these factors are organized into multiple levels, corresponding to different levels of abstraction or composition: higher-level representations are obtained by transforming or generating lower-level representations. The relationships between these factors can be viewed as similar to the relationships between entries in a dictionary or in Wikipedia, although these factors can be numerical (e.g., the position of the face in the image) or categorical (e.g., is it human face?), whereas entries in a dictionary are purely symbolic. The appropriate number of levels and the structure that relates these factors is something that a deep learning algorithm is also expected to discover from examples.

Deep learning algorithms often involve other important ideas that correspond to broad a priori beliefs about these unknown underlying factors. An important prior regarding a supervised learning task of interest (e.g., given an input image, predicting the presence of a face and the identity of the person) is that among the factors that explain the variations observed in the inputs (e.g. images), some of them are relevant to the prediction interest. This is a special case of the semi-supervised learning setup, which allows a learner to exploit large quantities of unlabeled data (e.g., images for which the presence of a face and the identity of the person, if any, are not known).

Many deep learning algorithms are actually framed as unsupervised learning, e.g., using many examples of natural images to discover good representations of them. Because most of these learning algorithms can be applied to unlabeled data, they can leverage large amounts of unlabeled data, even when these examples are not necessarily labeled, and even when the data cannot be associated with labels of the immediate tasks of interest.