1
0
mirror of https://github.com/golang/go synced 2024-11-25 03:47:57 -07:00

Finish the lexical section.

DELTA=176  (172 added, 0 deleted, 4 changed)
OCL=25182
CL=25222
This commit is contained in:
Rob Pike 2009-02-19 16:20:00 -08:00
parent d1107adb52
commit 3d50b1e0e8

View File

@ -13,6 +13,47 @@ Go is a general-purpose language designed with systems programming in mind. It i
The grammar is simple and regular, allowing for easy analysis by automatic tools such as integrated development environments. The grammar is simple and regular, allowing for easy analysis by automatic tools such as integrated development environments.
</p> </p>
<h2>Notation</h2>
<p>
The syntax is specified using Extended Backus-Naur Form (EBNF):
</p>
<pre>
Production = production_name "=" Expression .
Expression = Alternative { "|" Alternative } .
Alternative = Term { Term } .
Term = production_name | token [ "..." token ] | Group | Option | Repetition .
Group = "(" Expression ")" .
Option = "[" Expression ")" .
Repetition = "{" Expression "}" .
</pre>
<p>
Productions are expressions constructed from terms and the following operators, in increasing precedence:
</p>
<pre>
| alternation
() grouping
[] option (0 or 1 times)
{} repetition (0 to n times)
</pre>
<p>
Lower-case production names are used to identify lexical tokens. Non-terminals are in CamelCase. Lexical symbols are enclosed in double quotes <tt>""</tt> (the
double quote symbol is written as <tt>'"'</tt>).
</p>
<p>
The form <tt>"a ... b"</tt> represents the set of characters from <tt>a</tt> through <tt>b</tt> as alternatives.
</p>
<p>
Where possible, recursive productions are used to express evaluation order
and operator precedence syntactically.
</p>
<h2>Lexical properties</h2> <h2>Lexical properties</h2>
<p> <p>
@ -43,9 +84,14 @@ There are two forms of comments. The first starts at a the character sequence <
<h3>Identifiers</h3> <h3>Identifiers</h3>
<p> <p>
An identifier is a sequence of one or more letters and digits. The meaning of <i>letter</i> and <i>digit</i> is defined by the Unicode properties for the corresponding characters, with the addition that the underscore character <tt>_</tt> (U+005F) is considered a letter. The first character in an identifier must be a letter. <font color=red>(Current implementation accepts only ASCII digits for digits.)</font> An identifier is a sequence of one or more letters and digits. The meaning of <i>letter</i> and <i>digit</i> is defined by the Unicode properties for the corresponding characters, with the addition that the underscore character <tt>_</tt> (U+005F) is considered a letter. The first character in an identifier must be a letter.
</p> </p>
<pre>
letter = unicode_letter | "_" .
identifier = letter { letter | unicode_digit } .
</pre>
<h3>Keywords</h3> <h3>Keywords</h3>
<p> <p>
@ -76,13 +122,139 @@ The following character sequences are tokens representing operators, delimiters,
<h4>Integer literals</h4> <h4>Integer literals</h4>
<p>
An integer literal is a sequence of one or more digits in the corresponding base, which may be 8, 10, or 16. An optional prefix sets a non-decimal base: <tt>0</tt> for octal, <tt>0x</tt> or <tt>0X</tt> for hexadecimal. In hexadecimal literals, letters <tt>a-f</tt> and <tt>A-F</tt> represent values 10 through 15.
</p>
<pre>
int_lit = decimal_lit | octal_lit | hex_lit .
decimal_lit = ( "1" ... "9" ) { decimal_digit } .
octal_lit = "0" { octal_digit } .
hex_lit = "0" ( "x" | "X" ) hex_digit { hex_digit } .
decimal_digit = "0" ... "9" .
octal_digit = "0" ... "7" .
hex_digit = "0" ... "9" | "A" ... "F" | "a" ... "f" .
</pre>
<p>
Integer literals represent values of arbitrary precision, or <i>ideal integers</i>; they have no implicit size or type.
</p>
<h4>Floating-point literals</h4> <h4>Floating-point literals</h4>
<p>
A floating-point literal is a decimal representation of a floating-point number. It has an integer part, a decimal point, a fractional part, and an exponent part. The integer and fractional part comprise decimal digits; the exponent part is an <tt>e</TT> or <tt>E</tt> followed by an optionally signed decimal exponent. One of the integer part or the fractional part may be elided; one of the decimal point or the exponent may be elided.
</p>
<pre>
float_lit = decimals "." [ decimals ] [ exponent ] |
decimals exponent |
"." decimals [ exponent ] .
decimals = decimal_digit { decimal_digit } .
exponent = ( "e" | "E" ) [ "+" | "-" ] decimals .
</pre>
<p>
As with integers, floating-point literals represent values of arbitrary precision, or <i>ideal floats</i>.
</p>
<h4>Character literals</h4> <h4>Character literals</h4>
<p>
A character literal represents an integer value, typically a Unicode code point, as one or more characters enclosed in single quotes. Within the quotes, any character may appear except single quote and newline; a quoted single character represents itself, while multi-character sequences beginning with a backslash encode values in various formats.
</p>
<p>
The simplest form represents the exact character within the quotes; since Go source text is Unicode characters encoded in UTF-8, multiple UTF-8-encoded bytes may represent a single integer value. For instance, the literal <tt>'a'</tt> holds a single byte representing a literal <tt>a</tt>, Unicode U+0061, value <tt>0x61</tt>, while <tt>'ä'</tt> holds two bytes (<tt>0xc3</tt> <tt>0xa4</tt>) representing a literal <tt>a</tt>-dieresis, U+00E4, value <tt>0xe4</tt>.
</p>
<p>
Several backslash escapes allow arbitrary values to be represented as ASCII text. There are four ways to represent the integer value as a numeric constant: <tt>\x</tt> followed by exactly two hexadecimal digits; <tt>\u</tt> followed by exactly four hexadecimal digits; <tt>\U</tt> followed by exactly eight hexadecimal digits, and a plain backslash <tt>\</tt> followed by exactly three octal digits. In each case the value of the literal is the value represented by the digits in the appropriate base.
</p>
<p>
Although these representations all result in an integer, they have different valid ranges. Octal escapes must represent a value between 0 and 255 inclusive. (Hexadecimal escapes satisfy this condition by construction). The `Unicode' escapes <tt>\u</tt> and <tt>\U</tt> represent Unicode code points so within them some values are illegal, in particular those above <tt>0x10FFFF</tt> and surrogate halves.
</p>
<p>
After a backslash, certain single-character escapes represent special values:
</p>
<pre>
\a U+0007 alert or bell
\b U+0008 backspace
\f U+000C form feed
\n U+000A line feed or newline
\r U+000D carriage return
\t U+0009 horizontal tab
\v U+000b vertical tab
\\ U+005c backslash
\' U+0027 single quote (legal within character literals only)
\" U+0022 double quote (legal within interpreted string literals only)
</pre>
<p>
All other sequences are illegal inside character literals.
</p>
<pre>
char_lit = "'" ( unicode_value | byte_value ) "'" .
unicode_value = unicode_char | little_u_value | big_u_value | escaped_char .
byte_value = octal_byte_value | hex_byte_value .
octal_byte_value = "\" octal_digit octal_digit octal_digit .
hex_byte_value = "\" "x" hex_digit hex_digit .
little_u_value = "\" "u" hex_digit hex_digit hex_digit hex_digit .
big_u_value = "\" "U" hex_digit hex_digit hex_digit hex_digit
hex_digit hex_digit hex_digit hex_digit .
escaped_char = "\" ( "a" | "b" | "f" | "n" | "r" | "t" | "v" | "\" | "'" | """ ) .
</pre>
<p>
The value of a character literal is an ideal integer, just as with integer literals.
</p>
<h4>String literals</h4> <h4>String literals</h4>
<p>
String literals represent constant values of type <tt>string</tt>. There are two forms: raw string literals and interpreted string literals.
</p>
<p>
Raw string literals are character sequences between back quotes <tt>``</tt>. Within the quotes, any character is legal except newline and back quote. The value of a raw string literal is the string composed of the uninterpreted bytes between the quotes.
</p>
<p>
Interpreted string literals are character sequences between double quotes <tt>&quot;&quot;</tt>. The text between the quotes forms the value of the literal, with backslash escapes interpreted as they are in character literals. The three-digit octal (<tt>\000</tt>) and two-digit hexadecimal (<tt>\x00</tt>) escapes represent individual <i>bytes</i> of the resulting string; all other escapes represent the (possibly multi-byte) UTF-8 encoding of individual <i>characters</i>. Thus inside a string literal <tt>\377</tt> and <tt>\xFF</tt> represent a single byte of value <tt>0xFF</tt>=255, while <tt>ÿ</tt>, <tt>\u00FF</tt>, <tt>\U000000FF</tt> and <tt>\xc3\xbf</tt> represent the two bytes <tt>0xc3 0xbf</tt> of the UTF-8 encoding of character U+00FF.
</p>
<pre>
string_lit = raw_string_lit | interpreted_string_lit .
raw_string_lit = "`" { unicode_char } "`" .
interpreted_string_lit = """ { unicode_value | byte_value } """ .
</pre>
<p>
During tokenization, two adjacent string literals separated only by the empty string, white space, or comments are implicitly combined into a single string literal whose value is the concatenated values of the literals.
</p>
<pre>
StringLit = string_lit { string_lit } .
</pre>
<h2>Everything else</h2>
<p>
I don't believe this organization is complete or correct but it's here to be worked on and thought about.
</p>
<h2>Types</h2>
<h2>Constants</h2>
<h2>Expressions</h2>
<h2>Declarations</h2>
<h2>Control Structures</h2>
<h2>Program structure</h2>
<h2>Packages</h2>
<h2>Differences between this doc and implementation - TODO</h2>
<p>
<font color=red>
Current implementation accepts only ASCII digits for digits; doc says Unicode.
<br>
</font>
</p>
</div> </div>
<br class="clearboth" /> <br class="clearboth" />