https://en.bioerrorlog.work/entry/gpt-2-paper

This is a summary of the GPT-2 paper "Language Models are Unsupervised Multitask Learners."

Introduction

The paper summarized here:

Language Models are Unsupervised Multitask Learners

Published: February 2019
OpenAI
Code: GitHub - openai/gpt-2: Code for the paper "Language Models are Unsupervised Multitask Learners"

This is the GPT-2 paper.

The original GPT paper can be found here:

en.bioerrorlog.work

*All figures in this article are cited from the paper.

Note: This article was translated from my original post.

Language Models are Unsupervised Multitask Learners

Overview

Background
- Natural language processing tasks have traditionally been solved using supervised learning
Challenge
- Previous language models were essentially narrow specialists trained through supervised learning on specific domains
- Creating a generalist language model that works across broad domains had not yet been achieved
What they did
- Created WebText, a web-scraped dataset
- Built a GPT model using unsupervised learning
  - Trained on WebText
  - Used byte-level BPE (Byte Pair Encoding)
  - Architecture based on the original GPT with minor modifications
  - The largest model (1.5B parameters) is called GPT-2
Results
- Achieved results comparable to existing models on reading comprehension tasks without any additional training on labeled data
- Achieved state-of-the-art performance on 7 out of 8 language modeling benchmark datasets
- Demonstrated reasonable performance on other tasks as well
- Larger models consistently produced better results

Method

Creating the WebText Training Dataset

WebText: A dataset created by scraping web pages linked from Reddit
- To ensure text quality, only included Reddit posts with 3+ karma
  - Number of links: 45 million
  - Used data as of December 2017
- Used Dragnet and Newspaper for text extraction from HTML
- Total: 8 million documents / 40GB of text data
- Excluded Wikipedia pages since proper Wikipedia datasets already exist

Example of English-French translation information found in WebText

Even without explicitly preparing translation pairs as training data, the presence of such bilingual content within web pages means the model could potentially learn to handle translation tasks

BPE: Byte Pair Encoding

How should strings be encoded as input to the model?

Challenges:

The conventional approach of treating Unicode strings as UTF-8 byte sequences doesn't perform well on word-level tasks
BPE (Byte Pair Encoding), despite its name, is typically performed on Unicode code points rather than byte sequences
- When applying BPE to Unicode code points, the required vocabulary becomes enormous
- Byte-level BPE keeps the required vocabulary small (256 base tokens)
However, applying BPE directly at the byte level doesn't optimize well
- Combinations of common words and punctuation get merged into single tokens inappropriately

Their solution:

Byte-level BPE
But with merges between different character categories prevented

This aims to achieve word-level performance while maintaining the generality of the byte-level approach.

Model Architecture

Based on the original GPT architecture with some modifications:

Moved layer normalization to the input of each transformer sub-block
Added an additional layer normalization after the final self-attention block

Original GPT architecture | Image from here

They also created models of multiple sizes. The largest model is called GPT-2.

Results

Language Modeling Tasks

Achieved state-of-the-art on 7 out of 8 datasets
Larger improvements were seen on smaller datasets
- WikiText2, PTB
Significant improvements were also seen on datasets requiring long-range dependencies
- LAMBADA, CBT
Poor results on 1BW (One Billion Word Benchmark)

Relationship between CBT results and model size

Larger model sizes consistently produced better results

Common Sense Reasoning Ability

Winograd Schema Challenge: Measures common sense reasoning ability through pronoun disambiguation in sentences
Achieved state-of-the-art performance

Reading Comprehension

Tested on Conversation Question Answering (CoQA).

Achieved 55 F1 score
Results matched or exceeded 3 out of 4 baseline systems
- Without any additional training on question-answer pairs
State-of-the-art is a BERT-based model with 89 F1 score, close to human performance
- [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Summarization Tasks

CNN and Daily Mail dataset summarization task results

Results not significantly better than existing models
Tends to focus on recent content without properly understanding details

Translation Tasks

Task Set	GPT-2 Results
WMT-14 English-French	5 BLEU
WMT-14 French-English	11.5 BLEU

WMT-14 English-French results are slightly lower than existing unsupervised models
WMT-14 French-English results are better than many unsupervised baselines but lower than the unsupervised state-of-the-art

QA Tasks

How well can it answer factual questions?

Dataset: Natural Questions
GPT-2 accuracy: 4.1% (exact match evaluation)
- Better than the smallest model's results (below 1%), suggesting that increasing model size could improve performance
Existing QA systems combining information retrieval achieve 30-50%
- GPT-2's results are significantly lower

High-probability GPT-2 answers from the Natural Questions dataset

Generalization vs Memorization

Are these results truly due to GPT-2's generalization ability? Or is it solving tasks through memorization because of overlap between training and test datasets? Since WebText is a massive collection of diverse web pages, this is a valid concern.

They investigated this question.

8-gram overlap between training and test datasets

Examined overlap between training data (WebText) and each test dataset using 8-gram Bloom filters
WebText overlap ranges from about 1-6%
- Average of 3.2%
Training data for each test dataset itself has an average 5.9% overlap with test sets
WebText actually had less overlap than typical benchmark training sets

Conclusion/Thoughts

That's my summary of the paper "Language Models are Unsupervised Multitask Learners."

Below are my personal notes:

What are the authors trying to achieve?
- Demonstrate the general language capabilities of models through unsupervised learning and zero-shot transfer
What are the key elements of their approach?
- Creating high-quality web-scraped data: WebText
- Scaling up model size and using unsupervised learning
- Byte-level BPE
Which cited papers would I like to read next?
- BPE [1508.07909] Neural Machine Translation of Rare Words with Subword Units
- [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Thoughts
- While the original GPT relied on fine-tuning, GPT-2 demonstrates strong capabilities through unsupervised learning without fine-tuning. Combined with subsequent work on scaling laws, it's fascinating to see the beginning of the era of "just make models bigger and performance improves"—at least in these early stages.