NTD in AI: BERT

Non-technical definitions in AI

BERT is an acronym for Bidirectional Encoder Representations from Transformers. It is a language representation model built by Google which is used for natural language processing tasks.

It was built to be able to predict missing words in a sentence. Google engineers created the massive labeled training set by randomly masking 15% of the words in their data, which was 2.5 billion words from the English Wikipedia and 800 million words from a database called the BooksCorpus (Devlin, et al, 2019). This is called a Masked Language Model.

When BERT was first presented it was far superior to other such models like GloVe because, it was able to discern the context of a sentence from the surrounding words before choosing the right words to fill in the blanks. For instance the word “pen” can be both a noun (the writing instrument) and a verb (“to pen a novel”), which models like GloVe are unable to finesse.

This ability is due to its architecture which its name describes.

It is “Bidrectional” because this neural network is computed both in the forward direction and backwards. Neural networks computed in a single direction soon lose information on earlier elements. In natural language processing, this means that information contained earlier in a sentence becomes less prominent, essentially “forgotten”. This is important because the model could lose the context of a sentence. Processing in both directions was a very successful strategy invented to overcome this problem (Schuster & Paliwal, 1997).

“Encoder Representations” describes the function of converting the input data, in BERT’s case it is sentences, into a form or representation that computers understand, here a matrix of numbers.

Finally, “Transformer” refers to an architecture that converts/encodes an input into the representations mentioned above followed by taking those encodings and producing another output. The original transformer was a language translation model, hence the input would be, for instance, an English sentence, which is then transformed to a German sentence. (Vaswani, et al, 2017) BERT uses only the encoder part of the architecture hence the emphasis on “ER” above.

Machine learning is a technical subject and the use of technical terms by engineers have the potential of coming between clear communication with non-engineers, especially in the business setting. In spare moments I started to put together simple, non-technical definitions of nouns and verbs used in the field of machine learning as a kind of Rosetta Stone for non-engineers.This is a work-in-progress which I may collect into a book one day. This is one of those definitions.

NTD in AI: BERT

Other non-technical definitions: