((G)) A grammar for the SiteRiter Site Definition Language --------- ((G.0)) Table of contents (G.1) Grammar overview (G.2) About EBNF (G.3) An EBNF grammar for the SiteRiter SDL --------- ((G.1)) Grammar overview ((G.1.1)) Programming languages are often described by a 'grammar', giving a set of rules that determine what constitutes a legal string in the language. Grammars can provide a clear and concise description of a language, but they can be difficult to implement directly, because a grammar for a language, on its own, doesn't necessarily provide enough information to specify unambiguously how processing must be performed, and may expose certain aspects of the language that can make efficient implementation difficult. Nonetheless, for what precision it does afford, below we present an EBNF grammar for an SiteRiter SDL program. ((G.1.2)) Low-level SiteRiter SDL input processing ((G.1.2.1)) Traditionally in programming languages grammar specifications, a concept of a 'token' is invoked. A 'token' is sort of a 'generalized input character', meant to simplify grammar descriptions by merging together sets of characters that will be treated identically by the grammar. ((G.1.2.2)) Tokens have _types_ and _values_, the type determining the general category of the token, and the value holding the rest of the needed details. The grammar rules will only be concerned with the types of tokens, but other processing can examine the token values as well. ((G.1.2.2.1)) For example, grammar rules often involve numbers and where they can appear in input, but are rarely concerned with what _specific numbers_ are specified. So, one token type might be something like DECIMALDIGIT, with possible token values ranging over any of the characters: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'. Or, at a higher level, a token type might be something like NUMBER, with possible token values ranging over anything that fits in an int. ((G.1.2.3)) The definition of what constitutes a token, in any particular language, defines a boundary line between what is called 'lexical analysis' -- which is the job of mapping from a raw input stream to a sequences of tokens -- and what is called 'parsing' -- which is the job of determining whether a given sequence of tokens is grammatically correct, and deciding what to about it if it is (or isn't, for that matter). ((G.1.2.3)) The SiteRiter SDL is broadly defined in terms of three groups of token types: NAME, LITERAL, and OPERATOR. The value of a NAME is the specific identifier names; the value of a LITERAL is either a single- or double-quoted string; the value of an OPERATOR is a single char identifying which operator is involved. ((G.1.2.4)) So we can define the complete set of possible token types as follows: ((G.1.2.4.1)) Token type: EQUAL Token value: '=' ((G.1.2.4.2)) Token type: BAR Token value: '|' ((G.1.2.4.3)) Token type: COLON Token value: ':' ((G.1.2.4.4)) Token type: SEMICOLON Token value: ';' ((G.1.2.4.5)) Token type: DLITERAL Token value: '"' followed by 0 or more non-'"' chars, followed by '"' ((G.1.2.4.5)) Token type: SLITERAL Token value: '\'' followed by 0 or more non-'\'' chars, followed by '\'' ((G.1.2.4.6)) Token type: NAME Token value: One or more char such that the first char is a valid beginning of a Java identifier, and the subsequent chars, if any, are valid parts of a Java identifier ((G.1.2.4.7)) Token type: EOI Token value: none ((G.1.2.5)) The meanings of the seven tokens should mostly be obvious to anyone familiar with the SiteRiter SDL language. ((G.1.2.6)) The EOI -- meaning 'End Of Input' -- token does not correspond to _any_ char whatsoever; it is instead a (conceptual) marker meaning the end of the input stream has been encountered. ((G.1.2.7)) Handling Whitespace ((G.1.2.7.1)) There is one additional important task performed by the SiteRiter lexer that is not covered in that description of the token types. That task is dealing with 'whitespace' -- chars like spaces, TABs, newlines, and whatever other chars are defined to be whitespace according to the Java rules for characters. ((G.1.2.7.2)) In SDL, how whitespace is handled depends on context. In general whitespace chars are discarded by the lexer, except: ((G.1.2.7.2.1)) The appearance of whitespace terminates any in-progress NAME token (and then the whitespace is ignored); and ((G.1.2.7.2.2)) Inside LITERAL tokens each and every whitespace char must be preserved, exactly as given, as part of the token value. ((G.1.2.7.3)) Because whitespace -- which includes newlines -- are significant parts of LITERALs (e.g, see (I.3.3.2)), SDL is emphatically NOT a 'line oriented language', and SiteRiter system implementors are STRONGLY CAUTIONED against trying to parse SDL input on a line-by-line basis. --------- ((G.2)) About EBNF ((G.2.1)) If you are not familiar with EBNF, this section will probably prove very instructive (and the next section very confusing, before understanding this one)! Here's a quick crib sheet on EBNF, which stands for 'Extended Backus-Naur Form': ((G.2.1.1)) EBNF is a notational scheme for describing what counts as syntactically correct input for some given language. It's a 'metalanguage' for describing languages. Sometimes it's used directly as input to a 'compiler-compiler' program---a program that produces a compiler (or pieces of a compiler) given a description of a language. (More often, though, the simpler BNF notation is used for that purpose, and) EBNF is used for semi-formal descriptions intended to be read by people, such as yourself, rather than programs. ((G.2.1.2)) EBNF can be found in a huge number of variations (which is ironic since it was first defined in a paper by Wirth entitled 'What Can We Do About the Unnecessary Diversity of Notation for Syntatic Definitions'. The version presented here is typical but by no means unique. ((G.2.1.3)) EBNF consists of _terminal_ symbols, which in this presentation are token types such as BAR (G.1.2.4.2), _non-terminal_ symbols, such as , below, parentheses for grouping, and three other notations: ((G.2.1.3.1)) The '|' symbol for expressing alternatives, ((G.2.1.3.2)) Surrounding '{}' brackets for expressing 'zero-or-more repetitions of', and ((G.2.1.3.3)) Surrounding '[]' brackets for expressing optional components ('zero-or-one repetition of'). ((G.2.1.4)) The grammar is expressed as a series of _rules_, or _productions_, consisting of a 'left-hand side', which is a single _non-terminal name_, the symbol '::=', and a 'right-hand side', which is some expression indicating how the indicated non-terminal can be rewritten legally in the language. ((G.2.1.5)) The expression on the right-hand side of a rule is interpreted sequentially, except that sub-expressions grouped by '()', '[]', or '{}' are interpreted first. ((G.2.1.6)) '|' has lower precedence that sequentiality, so, for example, the rule: ::= BAR BLETCH | MUMBLE is interpreted as: ::= (BAR BLETCH) | MUMBLE rather than: ::= BAR (BLETCH | MUMBLE) -- i.e., a 'foo' is either a 'BAR'-type token followed by a 'BLETCH'-type token, _or_ just a 'MUMBLE'-type token by itself. ((G.2.1.7)) Unless otherwise indicated, the set of legal strings in a language is determined by the set of all possible strings that can be produced by rewriting the non-terminal appearing on the left-hand-side of rule that appears _first_ in the grammar. ((G.2.1.7.1)) That first non-terminal symbol is usually called the _start symbol_. --------- ((G.3)) An EBNF grammar for the SiteRiter SDL ((G.3.1)) An SiteRiter SDL program can be described by the following simple EBNF grammar, containing only five productions: ((G.3.1.1)) ::= { } EOI ((G.3.1.2)) ::= EQUAL SEMICOLON ((G.3.1.3)) ::= [ COLON ] NAME [ COLON NAME ] ((G.3.1.4)) ::= | { BAR } ((G.3.1.5)) ::= { NAME | SLITERAL | DLITERAL } ((G.3.2)) As an example of using this grammar, we could ask: Is this sequence of tokens a legal SiteRiter SDL program according to the grammar (G.3.1): NAME EQUAL NAME SEMICOLON ((G.3.2.1.1)) (An example of a sequence of _char_ that would lead to that sequence of _tokens_ could be, say, 'foo = bar;'.) ((G.3.2.1.2)) The answer is 'Yes'; we can prove it by _deriving_ that sequence, starting from the start symbol: ((G.3.2.1.2.1)) Begin with start symbol: ((G.3.2.1.2.2)) Rewrite (G.3.1.1): { } EOI ((G.3.2.1.2.2.1)) Pick zero copies for { }: EOI ((G.3.2.1.2.3)) Rewrite (G.3.1.2): EQUAL SEMICOLON ((G.3.2.1.2.4)) Rewrite : [ COLON ] NAME [ COLON NAME ] EQUAL SEMICOLON ((G.3.2.1.2.4.1)) Pick zero copies for both [ ]: NAME EQUAL SEMICOLON ((G.3.2.1.2.5)) Rewrite NAME EQUAL ( | { BAR } ) SEMICOLON ((G.3.2.1.2.5.1)) Pick 2nd alt for |: NAME EQUAL { BAR } SEMICOLON ((G.3.2.1.2.5.2)) Pick zero copies for { }: NAME EQUAL SEMICOLON ((G.3.2.1.2.6)) Rewrite : NAME EQUAL { NAME | SLITERAL | DLITERAL } SEMICOLON ((G.3.2.1.2.6.1)) Pick 1st alt for |: NAME EQUAL NAME SEMICOLON ((G.3.2.1.3)) We have derived a sequence of all terminal symbols, so no more grammar rules can possibly apply, and the derived sequence is just what we were trying to obtain. ((G.3.2.1.3.1)) So we thus have proved that NAME EQUAL NAME SEMICOLON is a legal sequence of tokens according to the grammar, and so a sequence of characters like 'foo = bar;' is therefore legal too, according to the grammar. (What such a program actually might _output_, however, is a separate question! If that is puzzling, think about what output might be produced by another program that parses into that same legal sequence of tokens: 'foo=foo;'!)