Find words - knowing when to split → results in tokens.
lemma is the base form of the token
Challenges
- Punctuation marks and special symbols
- where does punctuation go?
 
 - Morphosyntactic words
- contractions
- e.g. doesn’t
 - simple expanding may not work
- doesn’t → does not
 - Carla’s home → home of Carla OR Carla is home
 
 
 - compound words
- e.g. rock-and-roll
 
 
 - contractions
 - numbers
- measurements e.g. 5 miles
 
 - utf-8 characters
- emojis
 
 - errors
- if source is from OCR, errors from OCR might propagate
 
 - other languages
- transliterations - (e.g. Hindi written in English alphabets)
 - non whitespace delimited (e.g. Chinese/Japanese)
 - right to left (e.g. Arabic)
 
 - Formats/conventions of stuff may be different
- telephone numbers, dates, decimals, monetary values, coordinates, etc.
 
 - domain dependence
 - named Entities?
- Named Entity Recognition combines tokens
 - e.g. San Francisco
 
 
Implementations
spaCy’s tokenization