以下の内容はhttps://en.bioerrorlog.work/entry/gpt-2-paperより取得しました。


Understanding GPT-2 | Paper Summary: Language Models are Unsupervised Multitask Learners

This is a summary of the GPT-2 paper "Language Models are Unsupervised Multitask Learners."

Introduction

The paper summarized here:

Language Models are Unsupervised Multitask Learners

This is the GPT-2 paper.

The original GPT paper can be found here:

en.bioerrorlog.work

*All figures in this article are cited from the paper.

Note: This article was translated from my original post.

Language Models are Unsupervised Multitask Learners

Overview

  • Background
    • Natural language processing tasks have traditionally been solved using supervised learning
  • Challenge
    • Previous language models were essentially narrow specialists trained through supervised learning on specific domains
    • Creating a generalist language model that works across broad domains had not yet been achieved
  • What they did
    • Created WebText, a web-scraped dataset
    • Built a GPT model using unsupervised learning
      • Trained on WebText
      • Used byte-level BPE (Byte Pair Encoding)
      • Architecture based on the original GPT with minor modifications
      • The largest model (1.5B parameters) is called GPT-2
  • Results
    • Achieved results comparable to existing models on reading comprehension tasks without any additional training on labeled data
    • Achieved state-of-the-art performance on 7 out of 8 language modeling benchmark datasets
    • Demonstrated reasonable performance on other tasks as well
    • Larger models consistently produced better results

Method

Creating the WebText Training Dataset

  • WebText: A dataset created by scraping web pages linked from Reddit
    • To ensure text quality, only included Reddit posts with 3+ karma
      • Number of links: 45 million
      • Used data as of December 2017
    • Used Dragnet and Newspaper for text extraction from HTML
    • Total: 8 million documents / 40GB of text data
    • Excluded Wikipedia pages since proper Wikipedia datasets already exist

Example of English-French translation information found in WebText

  • Even without explicitly preparing translation pairs as training data, the presence of such bilingual content within web pages means the model could potentially learn to handle translation tasks

BPE: Byte Pair Encoding

How should strings be encoded as input to the model?

Challenges:

  • The conventional approach of treating Unicode strings as UTF-8 byte sequences doesn't perform well on word-level tasks
  • BPE (Byte Pair Encoding), despite its name, is typically performed on Unicode code points rather than byte sequences
    • When applying BPE to Unicode code points, the required vocabulary becomes enormous
    • Byte-level BPE keeps the required vocabulary small (256 base tokens)
  • However, applying BPE directly at the byte level doesn't optimize well
    • Combinations of common words and punctuation get merged into single tokens inappropriately

Their solution:

  • Byte-level BPE
  • But with merges between different character categories prevented

This aims to achieve word-level performance while maintaining the generality of the byte-level approach.

Model Architecture

Based on the original GPT architecture with some modifications:

  • Moved layer normalization to the input of each transformer sub-block
  • Added an additional layer normalization after the final self-attention block

Original GPT architecture | Image from here


They also created models of multiple sizes. The largest model is called GPT-2.

Hyperparameters for four model sizes

Results

Language Modeling Tasks

Results on each dataset

  • Achieved state-of-the-art on 7 out of 8 datasets
  • Larger improvements were seen on smaller datasets
    • WikiText2, PTB
  • Significant improvements were also seen on datasets requiring long-range dependencies
    • LAMBADA, CBT
  • Poor results on 1BW (One Billion Word Benchmark)

Relationship between CBT results and model size

  • Larger model sizes consistently produced better results

Common Sense Reasoning Ability

Winograd Schema Challenge results

  • Winograd Schema Challenge: Measures common sense reasoning ability through pronoun disambiguation in sentences
  • Achieved state-of-the-art performance

Reading Comprehension

Tested on Conversation Question Answering (CoQA).

Summarization Tasks

CNN and Daily Mail dataset summarization task results

  • Results not significantly better than existing models
  • Tends to focus on recent content without properly understanding details

Translation Tasks

Task Set GPT-2 Results
WMT-14 English-French 5 BLEU
WMT-14 French-English 11.5 BLEU
  • WMT-14 English-French results are slightly lower than existing unsupervised models
  • WMT-14 French-English results are better than many unsupervised baselines but lower than the unsupervised state-of-the-art

QA Tasks

How well can it answer factual questions?

  • Dataset: Natural Questions
  • GPT-2 accuracy: 4.1% (exact match evaluation)
    • Better than the smallest model's results (below 1%), suggesting that increasing model size could improve performance
  • Existing QA systems combining information retrieval achieve 30-50%
    • GPT-2's results are significantly lower

High-probability GPT-2 answers from the Natural Questions dataset

Generalization vs Memorization

Are these results truly due to GPT-2's generalization ability? Or is it solving tasks through memorization because of overlap between training and test datasets? Since WebText is a massive collection of diverse web pages, this is a valid concern.

They investigated this question.

8-gram overlap between training and test datasets

  • Examined overlap between training data (WebText) and each test dataset using 8-gram Bloom filters
  • WebText overlap ranges from about 1-6%
    • Average of 3.2%
  • Training data for each test dataset itself has an average 5.9% overlap with test sets
  • WebText actually had less overlap than typical benchmark training sets

Conclusion/Thoughts

That's my summary of the paper "Language Models are Unsupervised Multitask Learners."

Below are my personal notes:

  • What are the authors trying to achieve?
    • Demonstrate the general language capabilities of models through unsupervised learning and zero-shot transfer
  • What are the key elements of their approach?
    • Creating high-quality web-scraped data: WebText
    • Scaling up model size and using unsupervised learning
    • Byte-level BPE
  • Which cited papers would I like to read next?
  • Thoughts
    • While the original GPT relied on fine-tuning, GPT-2 demonstrates strong capabilities through unsupervised learning without fine-tuning. Combined with subsequent work on scaling laws, it's fascinating to see the beginning of the era of "just make models bigger and performance improves"—at least in these early stages.

[Related Articles]

en.bioerrorlog.work

References




以上の内容はhttps://en.bioerrorlog.work/entry/gpt-2-paperより取得しました。
このページはhttp://font.textar.tv/のウェブフォントを使用してます

不具合報告/要望等はこちらへお願いします。
モバイルやる夫Viewer Ver0.14