Generative Grammar and Four Types of Grammars: Parsing Techniques Notes (1)
Generative Grammar
There are several views of defining a language. The computer science and formal linguistics perspective:
terms | definition |
---|---|
Language | a set of sentences |
Sentence | a sequence of symbols |
Alphabet | a set of all symbols |
The semantics - meaning - of a sentence is described by its tokens cooperating with its structure.
Grammar is the set of rules describing a language. Generative Grammar is
- exact
- fixed-sized
Language can be specified by infinite Bit-String
Sorted Alphabet: $\Sigma$ Language $\Sigma^{* }$ contains all combinations of symbols in $\Sigma$
The order of sentences in $\Sigma^* $ follows that:
from length $l = 0$
- Enumerate all combinations that of length $l$
- these combinations are listed according to alphabetical order (lexicographical sorting)
- increment $l$ and repeat from 1.
This will form an infinitely long sorted list. Every language composed by alphabet $\Sigma$ can be identified by choosing from $\Sigma^* $. If we encode this by binary representation, 0 → not including, 1 → including, we can create an infinitely long bit string that indicates every sentence the language contains.
For example: Language $L = 010010110…$
Formal Grammar
Recipe of replacing symbols:
Name -> tom | dick | harry // Name may be replaced by tom, dick or harry
...
A grammar is a 4-tuple $(V_N, V_T, R, S)$:
- $V_N$ non-terminals, $V_T$ terminals are finite sets of symbols
- $V_N \cap V_T = \varnothing$ terminals and non-terminals cannot have common symbols
- $R$ is the set of rules, a set that contains ordered pairs: ${(P, Q) \mid P\in (V_N\cup V_T)^+ \land Q\in (V_N \cup V_T)^* }$
- $S$ is the start symbol, $S\in V_N$
Four types of Grammars
Type 0: Phrase Structure Grammar (PS Grammar)
Most freedom. Follows 4-tuple $(V_N, V_T, R, S)$, without further restriction. Represented as Directed Acyclic Graph
: No cycle exists
Type 1: Context-Sensitive Grammar (CS Grammar)
There are two equivalent definitions: Monotonic
and Context-Sensitive
. Can be represented by a DAG, similar to PS Grammar.
Monotonic
For each rule, left-hand side has more or equal number of symbols to right-hand side.
Context-Sensitive
Every rule is context-sensitive.
- Left-hand side contains only one symbol to be changed in the right-hand side.
Type 2: Context-Free Grammar (CF Grammar)
LHS could only contain one non-terminal symbol. (Thus not related to neighboring symbols). Represented by a tree, as branches of a node is not relevant to other nodes.
The generative power of CF Grammar comes from two operations:
- Concatenation
- Choice (choosing from one of the alternatives in the RHS)
NT -> tom | NT dick | ...
Type 3: Regular Grammar
Mostly referring to right regular grammar
. Each rule could only contain one non-terminal, as the rightmost item. Represented by a list, because each sentential has only one replaceable item (non-terminal), or in a production chain.
All regular grammar can be expressed in a regular expression, which sufficiently equal to all rules in the grammar.
Regular Expression Notation Styles:
Notation | Description |
---|---|
$^{+ }$ | One or more instances of the left-adjacent item |
$^{* }$ | Zero or more instances of the left-adjacent item |
$^{? }$ | Zero or one instance of the left-adjacent item |
$[abc]$ | Choosing one from $(a, b, c)$, i.e., $(a\mid b\mid c)$ |
Example:
$S_S\to(([tdh],)^{* }[tdh] & )^{? }[tdh]$
Type 4: Finite Choice Grammar (FC Grammar)
Each rule could have only terminals in right-hand side. Very limited expressive power.