Important and Seminal Papers in AI

I am constantly adding to this curated list. Please come back again in the future for an update.

1994

Bengio, Y., et al., 1994. Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural Networks 5(2)

An early paper by Yoshua Bengio and team that explored trying to train recurrent neural networks with long-term memory (“context representation”). They showed that using simple gradient descent to train context representation becomes increasingly inefficient as the temporal span of the dependencies increase. This was a limiting factor to training language models, for instance, to accurately capture context in long sentences (until LSTM was invented see below). Footnote: A young Joseph Hochreiter had also studied the difficulties of memory in neural networks in his 1991 PhD thesis.

#RNN #NLP

1996

Wolpert, David H., 1996. The Lack of A Priori Distinctions Between Learning Algorithms. Neural Computation 8(7): 1341-1390

This paper presents the “no free lunch” (NFL) theorem, specifically for supervised machine learning (Wolpert follows up with one for search and optimization the next year). In this paper, the NFL theorem states that for each problem, you need to select a particular algorithm and that no single algorithm works best across all different problems. Basically it says that when you have trained and optimized a machine learning model for a particular task, it will be great for that task but there will be an infinite number of other types of tasks that that it will suck at. Hence finding a machine learning model that works well across different problems is nearly impossible. Of course, Wolpert used elegant math to prove this theorem and did not use technical terms like “suck”. https://doi.org/10.1162/neco.1996.8.7.1341

#supervised-learning

1997

Hochreiter, S. & Schmidhuber, J., 1997. Long Short-Term Memory. Neural Computation 9(8):1735-1780

LSTM and later “Attention” are the foundations of modern LLMs and other neural networks that are trained for context. This invention creates a vector called a “cell state” that runs throughout the network, which forms the basis of memory. What gets onto the cell state is carefully regulated by, four interacting neural network layers (there are modern variants to this structure).

#RNN #LSTM

2015

Karpathy, A., 2015-05-21. The Unreasonable Effectiveness of Recurrent Neural Networks

Not really a peer reviewed paper but a much-cited blog post in the field. AK gives a good summary about recurrent neural networks (RNNs) and its ability to create powerful models in machine vision and language models. In particular he shows the emergent property of language ability in LSTM with more training iterations. This write up can help with understanding the emergent properties of LLMs.

#RNN #LSTM

2016

Iizuka, S et al. 2016. Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics 35(4) Article 110

I was blown away when I read this paper. And then blown further when I tried the model myself. The problem was: how do you colorize grey scale images (e.g. from old photographs and movies) using computation? The innovation was new architecture which fused both local and global features for colorization together. It amazed me that such a simple idea could create a model that was a quantum-leap ahead of the SOTA then. https://doi.org/10.1145/2897824.2925974

#CNN #image-processing

2017

Vaswani et al., 2017. Attention Is All You Need. arXiv:170603762

This breakthrough paper from Google Research and Google Brain, essentially introduces a method to be able to selectively remember the most important past information that a model has seen in order to hold on to context. That is, the self-attention mechanism allows the model to weigh the importance of different elements in its input and adjust how those elements affect the output of the model. This new architecture is called a “transformer” and it does away with the need of RNNs.

#CNN #transformer #NLP

2019

Liu, X., et al., 2019. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The Lancet Digital Health 1:e271-97

Machine learning models have been applied to medical imaging (e.g. the detection of tumours from scans) for almost a decade. But what is its state of the art? This paper describes a systematic review and meta-analysis done on published papers and conference proceedings. While the data show what we expect, that is that diagnostic performance of deep learning models were found to be equivalent to medical professionals, there are many flaws in the submitted papers. 1) Very few papers have externally validated results because results from the ML models were not used in clinical processes; 2) few studies were done in real clinical environments and most were retrospective and done on a computer. Also, missing data etc are never consistently described; 3) different metrics were used to describe performance and in many cases, thresholds for sensitivity and specificity were chosen without a description of why they were chosen; 4) there was no standardisation of terminology so there is ambiguity about what was done during model creation – the word “validation” was very troublesome; 5) very few studies tested both the machine and doctors against the same samples and most studies were not externally validated (no temporal or geographical split) or internally validated (random splitting of sets into training, validation, tuning and test sets). Bottom line: while it appears that machine learning models can be as good as professionals in detecting disease in terms of sensitivity and specificity, we are still unsure how good those models are because of flawed experimental design and lack of standardization over the 7 years studied.

#AI-in-Medicine

Chollet, F. 2019. On the Measure of Intelligence. arXiv:1911.01547

François Chollet is a heavy weight in the field of ML who works at Google. The emerging properties of large neural network models have prompted us to ask: is this model intelligent? But what is intelligence? And how do we measure it? It’s not only to satisfy our curiosity, but also, how does the field measure performance? This paper draws together ideas from psychology, developmental psychology, computer science, cognitive science and mathematical logic to try to answer those questions. It’s a good review of ideas by someone who has spent time thinking and researching the subject and forms a good springboard for further discussion.

#AGI

2022

Hendrycks, D., et al., 2022. Unsolved Problems in ML Safety. arXiv:2019.13916

An important survey (and update) of problems in machine learning safety that have not been solved. The authors, from UC Berkeley, Google and OpenAI, classify the problems into 4 areas:

Robustness. Risk of Black Swan and Tail Risks events and accidents; robustness from adversarial attacks.
Monitoring. Detection of anomalies to prevent malicious use; assessing outputs to prevent hazards (i.e. calibrating a model to understand accuracy/veracity) ; detecting hidden model functionalities such as back doors and emergent hazardous properties.
Alignment. Encoding human goals and intent in objectives; translating human values into action that optimises for the objective; proxies we select for objectives can be gamed (e.g. laws, rules) and objective proxies can lead to unintended consequences (laws, rules, etc.)
Systemic Safety. Detecting and protecting from cyberattacks; prevention from institutions steering ML to make hazardous decisions. A safe system requires work in all of these areas to mitigate ML system risks, operational risks, institutional and societal risks and future risks.

#AI-Safety

2023

OpenAI, 2023. GPT-4 Technical Report. arXiv:2023.08774

The emergent properties of GPT-4 has some people, even engineers, claiming the beginning of sentience or general intelligence. While the jury is still out on those claims, it has kicked off debate in society from the question of what exactly is intelligence, to how do we ensure that models are aligned to ensure human well-being. This paper describes the scale of the model, how it performs of various tests and benchmarks and some of its limitations, for example it “hallucinates” and is prone to adversarial questioning. This paper is not “technical” in the sense of engineering, but rather describes the model’s abilities compared to the earlier versions and other LLMs and only gives a passing nod to the issue of safety. While I don’t think this paper itself is seminal, the model that has been created through scaling of parameters is, and this describes that model’s capabilities. In fact, a more important paper is the one below. This paper I think also reflects on how commercial the field has become where technical details take second place to chest-puffing.

#GPT #LLM

OpenAI, 2023. GPT-4 System Card. No journal

Together with the technical report for GPT-4, this paper (not pushed to arXiv), describes the raw outputs that GPT-4 made as it was first trained which were not aligned to decent human values. It includes experiments on disinformation, writing code to hack, asking the system to create weapons etc. These were studied by OpenAI and the mitigations they installed were to remove “inappropriate” material in the training set and applied Reinforcement Learning with Human Feedback (RLHF), among other methods, to shape the model to prevent it from causing harm. Questions remain - what is defined as “inappropriate”? Who flags or classifies outputs for the human feedback (or do they have a mechanism to mitigate introducing the classifiers’ biases into the system as well?). Nonetheless this papers is interesting in that it reveals what’s in the training material (and hence humanity), and how the model can hallucinate and convince itself (and hence the user) of facts/reality.

#GPT #LLM #AI-Safety

Bubeck, S., et al., 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2023.12712

A paper by a team of Microsoft researchers who had early access to GPT-4 in September 2022. In a series of “experiments” with GPT-4 (in comparison with ChatGPT), the authors show that this LLM GPT has the ability to “understand” images and concepts at a higher level than mere recall from memorization. In this paper they got it to solve complex word problems, mathematical problems, spatial problems, compose music and code. The authors tested its mathematical sophistication (since they are mathematicians themselves) of several levels: Technical Proficiency, Creative Reasoning and Critical Reasoning. They used the same techniques as interviewing a PhD student to test the GPT-4.

#LLM #GPT #AGI