A.I. Research — of interest…
Some (mostly recent) research that may be of interest.
Title/comment/abstract/link.
Mesh-TensorFlow: Deep Learning for Supercomputers
This paper deals with parallelism (problems & solutions) and though focused on supercomputers, I’m interested in parallelism solutions for smaller environments (e.g. not hundreds of cores) … I suspect there are some interesting innovations lurking here for larger models/data sets + smaller/local environments.
Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the “batch” dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer [16] sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT’14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh.
http://papers.nips.cc/paper/8242-mesh-tensorflow-deep-learning-for-supercomputers.pdf
Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?
Boris did some nice work so you don’t have to … you should probably send him some money.
We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant β, given by the sum of the reciprocals of the hidden layer widths. When β is large, the gradients computed by N at initialization vary wildly. Our approach complements the mean field theory analysis of random networks. From this point of view, we rigorously compute finite width corrections to the statistics of gradients at the edge of chaos.
Relational recurrent neural networks
My guess is that progress in relational reasoning and inter-memory vector interactions will produce some very important results … particularly around some of the parallelism solutions on the horizon.
Memory-based neural networks model temporal data by leveraging an ability to remember information for long periods. It is unclear, however, whether they also have an ability to perform complex relational reasoning with the information they remember. Here, we first confirm our intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected — i.e., tasks involving relational reasoning. We then improve upon these deficits by using a new memory module — a Relational Memory Core (RMC) — which employs multi-head dot product attention to allow memories to interact. Finally, we test the RMC on a suite of tasks that may profit from more capable relational reasoning across sequential information, and show large gains in RL domains (e.g. Mini PacMan), program evaluation, and language modeling, achieving state-of-the-art results on the WikiText-103, Project Gutenberg, and GigaWord datasets.
http://papers.nips.cc/paper/7960-relational-recurrent-neural-networks.pdf
Multichannel Raw Waveform Neural Network Acoustic Models
A little old and not sure where things stand now, but interesting work around implicit beamforming with neural nets.
https://drive.google.com/file/d/0ByfBg9vVsBJ-UlAzMEJkRmtXYkU/view
*******************************************************************
About Nathan Allen
Formerly of Xio Research, an A.I. appliance company. Previously a strategy and development leader at IBM Watson Education. His views do not necessarily reflect anyone’s, including his own. (What.) Nathan’s academic training is in intellectual history; his next book, Weapon of Choice, examines the creation of American identity and modern Western power. Don’t get too excited, Weapon of Choice isn’t about wars but rather more about the seeming ex nihilo development of individual agency … which doesn’t really seem sexy until you consider that individual agency covers everything from voting rights to the cash in your wallet to the reason mass communication even makes sense….