1

Machine Learning for Design

Lecture 5 - Part b

Natural Language Processing

2

Previously on ML4D

3

Natural Language Processing

High-level understanding of the language spoken and written by humans

Also, generation (e.g., ChatGPT)

An enabler for technology like Siri or Alexa

4

Why natural language processing?

5

Fora, social media

6

Product review

7

Books

Digital, or digitised

8

Interviews

9

Big Textual Data = Language at scale

  • One of the largest reflections of the world, a man-made one
  • Essential to better understand people, organisations, products, services, systems
    • and their relationships!
  • Language is a proxy for human behaviour and a strong signal of individual characteristics
    • Language is always situated
    • Language is also a political instrument
10
  • Answer questions using the Web
  • Translate documents from one language to another
  • Do library research; summarize
  • Archive and allow access to cultural heritage
  • Interact with intelligent devices
  • Manage messages intelligently
  • Help make informed decisions
  • Follow directions given by any user
  • Fix your spelling or grammar
  • Grade exams
  • Write poems or novels
  • Listen and give advice
  • Estimate public opinion
  • Read everything and make predictions
  • Interactively help people learn
  • Help disabled people
  • Help refugees/disaster victims
  • Document or reinvigorate indigenous languages
11

What is Natural Language Processing?

12
  • Computer using natural language as input and/or output

  • Natural: human communication, unlike e.g., programming languages
  • Language: signs, meanings, and a code connecting signs with their meanings
  • Pprocessing: computational methods to allow computers to `understand’, or to generate
13

Beyond keyword matching

  • Identify the structure and meaning of words, sentences, texts and conversations
  • Deep understanding of broad language
14

Why is NLP Hard?

15

Human languages are messy, ambiguous, and ever-changing

A string may have many possible interpretations at every level

The correct resolution of the ambiguity will depend on the intended meaning, which is often inferable from the context

16

There is tremendous diversity in human languages

Languages express the same kind of meaning in different ways

Some languages express some meanings more readily/often

17

Knowledge Bottleneck

About language

About the world: Common sense and Reasoning

18

Ambiguity and Expressivity

19

Christopher Robin is alive and well. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchford Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book

Who wrote Winnie the Pooh?

Where did Chris live?

20

Lexical ambiguity (Word sense ambiguity)

The presence of two or more possible meanings within a single word

21

Syntactic ambiguity (Word sense ambiguity)

The presence of two or more possible meanings within a single sentence or sequence of words

22

Attachment ambiguity

The policeman shot the thief with the gun

23

Pronoun Reference ambiguity

24

Semantic Ambiguity

Every fifteen minutes a woman in this country gives birth. Our job is to find this woman, and stop her!

Groucho Marx
25

Sparsity

26

Zip's Law

“... given some document collection, the frequency of any word is inversely proportional to its rank in the frequency table...”

27
28

Language Evolution

29
LOL Laugh out loud
G2G Got to go
BFN Bye for now
B4N Bye for now
Idk I don't know
FWIW For what it's worth
LUWAMH Love you with all my heart
30
31

NLP Tasks

32

An example of NLP Process

33

Morphology

34

Tokenisation

  • Separation of words (or of morphemes) in a sentence
  • Issues
    • Separators: punctuations
    • Exceptions: „m.p.h“, „Ph.D“
    • Expansions: „we're“ = „we are“
    • Multi-words expressions: “New York”, “doghouse”
35

Stop-word Removal

  • Removal of high-frequency words, which carry less information
    • E.g. determiners, prepositions
  • English stop list is about 200-300 terms (e.g., been, a, about, otherwise, the, etc..)
36

Stemming

  • Heuristic process that chops off the ends of words in the hope of achieving the goal correctly most of the time
  • Stemming collapses derivationally related words
  • Two basic types:
    • Algorithmic: uses programs to determine related words
    • Dictionary-based: uses lists of related words
37

Lemmatisation

It uses dictionaries and morphological analysis of words to return the base or dictionary form of a word

Example: Lemmatization of saw —> attempts to return see or saw depending on whether the use of the token is a verb or a noun

38

Syntax

39

Part-of-speech Tagging

Tagging each word in a sentence with a corresponding part-of-speech (e.g. noun, verb, adverbs)

40

Named Entity Recognition

  • Factual information and knowledge are usually expressed by named entities
    • Who, Whom, Where, When, Which, ...
  • Identify words that refer to proper names of interest in a particular application
    • E.g. people, companies, locations, dates, product names, prices, etc.
  • Classify them to the corresponding classes (e.g. person, location)
  • Assign a unique identifier from a database
41
42

Language Analysis

  • Idea: people's language can provide insights into their psychological states (e.g. emotions, thinking style)
  • For instance
    • Frequency of words associated with positive or negative emotions
    • Use of pronouns as a proxy for confidence and character traits
43
  • Analytic Thinking: the degree to which people use words that suggest formal, logical, and hierarchical thinking patterns.
    • low Analytical Thinking —> language that is more intuitive and personal
  • Clout: the relative social status, confidence, or leadership that people display through their writing or talking
  • Authenticity: the degree to which a person is self-monitoring
    • Low authenticity: prepared texts (i.e., speeches written ahead of time) and texts where a person is being socially cautious
  • Emotional tone: the higher the number, the more positive the tone. Numbers below 50 suggest a more negative emotional tone.
44
45
46

47

Sentiment Analysis

  • The detection of attitudes

    "enduring, affectively colored beliefs, dispositions towards objects or persons
  • Main elements
    • Holder (source)
    • Target (aspect)
    • Type of attitude
    • Text containing the attitude
  • Tasks
    • Classification: Is the attitude of the text positive or negative?
    • Regression: Rank the attitude of the text from 1 to 5
    • Advanced: Detect the target, source, or complex attitude types
48

49

50

Emotion Analysis

51

Semantics

52

Document Categorisation

  • Assigning a label or category to an entire text or document
  • Supervised learning
  • For instance
    • Spam vs. Not spam
    • Language identification
    • Authors attribution
    • Assigning a library subject category or topic label

ML4D Course Description

53

Topic Modeling

  • A topic is the subject or theme of a discourse
  • Topic modeling: group documents/text according to their (semantic) similarity
  • An unsupervised machine learning approach

ML4D Course Description

54

Word Sense Disambiguation

  • Multiple words can be spelled the same way (homonymy)
  • The same word can also have different, related senses (polysemy)
  • Disambiguation depends on context!

55

Automated Summarisation

  • Condensing a piece of text to a shorter version while preserving key informational elements and the meaning of content
  • A challenging task!

https://textsummarization.net/

https://brevi.app/single-demo (not working!)

56

Machine Translation (popular languages)

57

Machine Translation (languages with fewer resources)

58

Natural Language Instructions / Dialog systems

59

Natural Language Generation

60

State of the Art in NLP

As of 2022

61

Credits: Nava Tintarev

62

Machine Learning for Design

Lecture 5 - Part b

Natural Language Processing

63

Credits

CIS 419/519 Applied Machine Learning. Eric Eaton, Dinesh Jayaraman. https://www.seas.upenn.edu/~cis519/spring2020/

EECS498: Conversational AI. Kevin Leach. https://dijkstra.eecs.umich.edu/eecs498/

CS 4650/7650: Natural Language Processing. Diyi Yang. https://www.cc.gatech.edu/classes/AY2020/cs7650_spring/

Natural Language Processing. Alan W Black and David Mortensen. http://demo.clab.cs.cmu.edu/NLP/

IN4325 Information Retrieval. Jie Yang.

Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Third Edition. Daniel Jurafsky, James H. Martin.

Natural Language Processing, Jacob Eisenstein, 2018.