Documents are typically unstructured or have complicated structures - keeping human readability in mind.
Annotations are mark-ups in the (converted raw text) document that enable machine readability. This can be for presentation, metadata and understanding content.
Types
- Boundary Notation Annotation
 - inline markup language elements
 - stand-off annotation
- delimiter separated values (DSV)
 - JSON
 
 
Boundary Notation Annotation
- done at level of individual tokens
 - in the text itself
 - for units of interest spanning multiple tokens/ Named Entity Recognition, we can use BIO
- BIO: B =begin, I = inside, O = outside
 - create a table of tokens with their annotations
 - every non entity is tagged O
 - every entity tagged with B (and optionally the type of entity) and when it spans multiple tokens, use I
 - e.g. The British Prime Minister Tony Blair and … →
- The → O
 - British → B-Person
 - Prime → I-Person
 - Minister → I-Person
 - Tony → B-Person
 - Blair → I-Person
 - and → O
 
 - cannot handle hierarchical or structured annotations
- nested entities, entity relations, events, etc.
 
 
 
event is a type of entity relation that contain multiple entities participating in an event. e.g. → attendees in a meeting or attacker and defender in a war, etc.
Inline markup language elements
- addition of markup tags within text (in-text)
 - HTML, XML
 - can deal with hierarchical & structured annotations - nested entities, entity relations and events
- relations can be done using some attributes in the tags and linking via IDs
 
 - impossible to encode overlapping/intersecting annotations
- Iraqi city of Basra → Iraqi city and city of Basra
 
 - processing/parsing required to and from
 
Stand-off annotations
- stored separately from text
 - links annotations to text using indexing based on character offsets
 - can be delimiter separated values (DSV) like CSV or JSON
- e.g. Tony Blair → Person 144 154
 - e.g. { “id” : “199”, “ne_type”: “Person” “begin” : “144” “end”: “154” “surface_form” : “Tony Blair” }
 
 - can handle hierarchical, structured and overlapping annotations
 - not readily human readable