Refactored build + buildTrie into build + buildOrdering.
Note that since the tailoring code is not checked in yet, all tailorings are identical
to root. The table therefore should not and does not grow at this point.
R=r
CC=golang-dev
https://golang.org/cl/6500087
for both locale-specific exemplar characters and tailorings to
the collation table.
Some specifices:
- Moved stringSet to the beginning of the file and added some functionality
to parse command line files.
- openReader now modifies the input URL for localFiles to guarantee that
any http source listed in the generated file is indeed this source.
- Note that the implementation of the Tailoring API used by maketables.go
is not yet checked in. So for now adding tailorings are simply no-ops.
- The generated file of exemplar characters will be used somewhere else.
Here is a snippet of how the body of the generated file looks like:
type exemplarType int
const (
exCharacters exemplarType = iota
exContractions
exPunctuation
exAuxiliary
exCurrency
exIndex
exN
)
var exemplarCharacters = map[string][exN]string{
"af": [exN]string{
0: "a á â b c d e é è ê ë f g h i î ï j k l m n o ô ö p q r s t u û v w x y z",
3: "á à â ä ã æ ç é è ê ë í ì î ï ó ò ô ö ú ù û ü ý",
4: "a b c d e f g h i j k l m n o p q r s t u v w x y z",
},
...
}
R=r
CC=golang-dev
https://golang.org/cl/6501070
- Elements in the array are now sorted as a linked list. This makes it easier to
apply tailorings.
- Added code to sort entries by collation elements.
- Added logical reset points. This is used for tailoring relative to certain
properties, rather than characters.
NOTE: all code for type entry should now be in order.go. To keep the diffs for
this CL reasonable, though, the existing code is left in builder.go. I'll move
this in a separate CL.
R=r
CC=golang-dev
https://golang.org/cl/6493063
Also rename Node.{Add,Remove} to Node.{AppendChild,RemoveChild} to
be consistent with the DOM.
benchmark old ns/op new ns/op delta
BenchmarkParser 4042040 3749618 -7.23%
benchmark old MB/s new MB/s speedup
BenchmarkParser 19.34 20.85 1.08x
BenchmarkParser mallocs per iteration is also:
10495 before / 7992 after
R=andybalholm, r, adg
CC=golang-dev
https://golang.org/cl/6495061
In the regtest data, surrogates are assigned primary weights based on
the surrogate code point value. Go now converts surrogates to FFFD, however,
meaning that the primary weight is based on this code point instead.
This change drops tests with surrogates and lets the tests pass.
R=r
CC=golang-dev
https://golang.org/cl/6461100
with table changes.
NOTE: there is no test for this, but 1) the code has now the same
control flow as scan in exp/locale/collate/contract.go, which is
tested and 2) Builder verifies the generated table so bugs in this
code are quickly and easily found (which is how this bug was discovered).
R=r
CC=golang-dev
https://golang.org/cl/6461082
The main table will need to get a slightly different collation table as the one
used by regtest, as the regtest is based on the standard UCA DUCET, while
the locale-specific tables are all based on a CLDR root table.
This change allows changing the table without affecting the regression test.
R=r
CC=golang-dev
https://golang.org/cl/6453089
Now that the parser passes all tests in the test suite,
it is no longer necessary to keep track of which tests
pass and which don't. So remove the testlogs directory
and the code that uses it.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6453124
All of the remaining tests that had as status of PARSE rather than PASS had
good reasons for not passing the render-and-reparse step: the correct parse tree is
badly formed, so when it is rendered out as HTML, the result doesn't parse into the
same tree. So add them to the list of tests where that step is skipped.
Also, I discovered that it is possible to end up with HTML elements (not just text)
inside a raw text element through reparenting. So change the rendering routines to
handle that situation as sensibly as possible (which still isn't very sensible, but
this is HTML5).
R=nigeltao
CC=golang-dev
https://golang.org/cl/6446137
When generating replacement elements for an <isindex> tag, the old
addSyntheticElement method was producing the wrong nesting. Replace
it with parseImpliedToken.
Pass the one remaining test in the test suite.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6453114
If a tag doesn't have a closing '>', it isn't considered a tag;
it is just ignored and EOF is returned instead.
Pass one additional test in the test suite.
Change tokenizer tests to match correct behavior.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6454131
In HTML content, having a self-closing tag is a parse error unless
the tag would be self-closing anyway (like <img>). The only place a
self-closing tag actually makes a difference is in XML-based foreign
content.
Pass 1 additional test.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6450109
Also simplified parsing of interface
types since they can only contain
methods (and no embedded interfaces)
in the export data.
R=rsc
CC=golang-dev
https://golang.org/cl/6446084
If a table contained whitespace, text nodes would not get foster parented
correctly.
Pass 1 additional test.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6459054
The <title> element was getting removed from the stack of open elements,
when its parent, the <head> element should have been removed instead.
Pass 2 additional tests.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6449101
When an element (like <nobr> or <p>) was implicitly closed by another
start tag, it would keep foster parenting from working because
the check for what was on top of the stack of open elements was
in the wrong place.
Move the check to addChild.
Pass 2 additional tests.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6460045
Since NUL usually terminates strings in underlying syscalls, allowing
it when converting string arguments is a security risk, especially
when dealing with filenames. For example, a program might reason that
filename like "/root/..\x00/" is a subdirectory or "/root/" and allow
access to it, while underlying syscall will treat "\x00" as an end of
that string and the actual filename will be "/root/..", which might
be unexpected. Returning EINVAL when string arguments have NUL in
them makes sure this attack vector is unusable.
R=golang-dev, r, bradfitz, fullung, rsc, minux.ma
CC=golang-dev
https://golang.org/cl/6458050
The content of an HTML <title> element is RCDATA, but the content of an SVG
<title> element is parsed as tags. Now the parser doesn't go into RCDATA
mode in foreign content.
Pass 4 additional tests.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6448111
for dealing with CLDR files:
- Add now taxes a list of indexes of colelems that are variables. Checking and
handling is now done by the Builder. VariableTop is now also properly generated
using the Build method.
- Introduced separate Builder, called Tailoring, for creating tailorings of root
table. This clearly separates the functionality for building a table based on
weights (the allkeys* files) versus tables based on LDML XML files.
- Tailorings are now added by two calls instead of one: SetAnchor and Insert.
This more closely reflects the structure of LDML side and simplifies the
implementation of both the client and library side. It also preserves
some information that is otherwise hard to recover for the Builder.
- Allow the LDML XML element extend to be passed to Insert. This simplifies
both client and library implementation.
R=r
CC=golang-dev
https://golang.org/cl/6454061
Process a package's object in a reproducible
order (rather then in map order) so that we
get error messages in reproducible order.
R=r
CC=golang-dev
https://golang.org/cl/6449076
The text inside <script> tags is not ordinary raw text; there are all sorts
of other complications. This CL implements those complications.
Pass 76 additional tests.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6443070
This is more in sync with the rest of the package;
for instance, we have functions (not methods) to
deref or find the underlying type of a Type.
In the process use a single bytes.Buffer to create
the string representation for a type rather than
the (occasional) string concatenation.
R=r
CC=golang-dev
https://golang.org/cl/6458057
If an end tag has an attribute that is a quoted string containing '>',
the tokenizer would end the tag prematurely. Now it reads the attributes
on end tags just as it does on start tags, but the high-level interface
still doesn't return them, because their presence is a parse error.
Pass 1 additional test.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6457060
- Allow handles into the trie for different locales. Multiple tables share the same
try to allow for reuse of blocks.
- Significantly improved memory footprint and reduced allocations of trieNodes.
This speeds up generation by about 30% and allows keeping trieNodes around
for multiple locales during generation.
- Renamed print method to fprint.
R=r
CC=golang-dev
https://golang.org/cl/6408052
If NUL bytes occur inside certain elements, convert them to U+FFFD
replacement character.
Pass 1 additional test.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6452047
Fixed creation of Func's, taking IsVariadic from parameter list rather
than results.
Updated checkObj to process ast.Fun objects.
R=gri
CC=golang-dev
https://golang.org/cl/6402046
If the body of an HTML document contains text, the <frameset> tag is
ignored. But not if the text is only whitespace.
Pass 4 additional tests.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6442043
Don't unescape entities in attributes when they don't end with
a semicolon and they are followed by '=', a letter, or a digit.
Pass 6 more tests from the WebKit test suite, plus one that was
commented out in token_test.go.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6405073
- Changed the representation of colElem to support a few cases
for some languages not supported by the current format.
- Changed offsets for implicit primary values. This makes the
values both easier to read and debug (last 4 nibbles are identical to
implicit primary value) and also results in better packing.
- Fixed bug in weight conversion code that did not pop up yet by
sheer luck.
Note that tables.go also includes changes to the contraction trie
from CL 6346092.
R=r, mpvl
CC=golang-dev
https://golang.org/cl/6392060
which has a rather large contraction table. The value of the next state
offset now starts after the current block, instead of before. This is
slightly less efficient (on extra addition per state change), but gives
some extra range for the offsets.
Also introduced constants for final (0) and noIndex (0xFF).
tables.go is updated in a separate CL.
R=r
CC=golang-dev
https://golang.org/cl/6346092
This is part 1 of a 2 part changelist. Part 2 contains the mechanical
change to parse.go to compare atoms (ints) instead of strings.
The overall effect of the two changes are:
benchmark old ns/op new ns/op delta
BenchmarkParser 4462274 4058254 -9.05%
BenchmarkRawLevelTokenizer 913202 912917 -0.03%
BenchmarkLowLevelTokenizer 1268626 1267836 -0.06%
BenchmarkHighLevelTokenizer 1947305 1968944 +1.11%
R=rsc
CC=andybalholm, golang-dev, r
https://golang.org/cl/6305053
Use perfect cuckoo hash, to avoid binary search.
Define Atom bits as offset+len in long string instead
of enumeration, to avoid string headers.
Before: 1909 string bytes + 6060 tables = 7969 total data
After: 1406 string bytes + 2048 tables = 3454 total data
benchmark old ns/op new ns/op delta
BenchmarkLookup 83878 64681 -22.89%
R=nigeltao, r
CC=golang-dev
https://golang.org/cl/6262051
Implement the (3-per-family) Noah's Ark clause (i.e. don't put
more than three identical elements on the list of active formatting
elements.
Also, when running tests, sort attributes by name before dumping
them.
Pass 4 additional tests with Noah's Ark clause (including one
that needs attributes to be sorted).
Pass 5 additional, unrelated tests because of sorting attributes.
R=nigeltao, rsc
CC=golang-dev
https://golang.org/cl/6247056
Remove redundant checks for integration points.
Ignore null bytes in text.
Don't break out of foreign content for a <font> tag unless it
has a color, face, or size attribute.
Check for MathML text integration points when breaking out of
foreign content.
Pass two new tests.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6256045
There's no need for the 16-bit arithmetic here,
and it tickles a long-standing compiler bug.
Fix the exp code not to use 16-bit math and
create an explicit test for the compiler bug.
R=golang-dev, r
CC=golang-dev
https://golang.org/cl/6256048
Detect HTML integration points and MathML text integration points.
At these points, process tokens as HTML, not as foreign content.
Pass 33 more tests.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6249044
Import updated test data from the WebKit Subversion repository (SVN revision 118111).
Some of the old tests were failing because we were HTML5 compliant, but the tests weren't.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6228049
Handle text, comment, and doctype tokens in afterBodyIM, afterAfterBodyIM,
and afterAfterFramesetIM.
Pass three more tests.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6231043
Clean up flow of control.
Ignore </table>, </tbody>, </tfoot>, </thead>, </tr> if there is not
an appropriate element in table scope.
Pass 3 more tests.
R=golang-dev, nigeltao
CC=golang-dev
https://golang.org/cl/6206093
Delete cases that just fall down to "anything else" action.
Handle </tbody>, </tfoot>, and </thead>.
R=golang-dev, nigeltao
CC=golang-dev
https://golang.org/cl/6203061
key and simple comparisson. Search is not yet implemented in this CL.
Changed some of the types of table_test.go to allow reuse in the new test.
Also reduced number of primary values for illegal runes to 1 (both map to
the same).
R=r
CC=golang-dev
https://golang.org/cl/6202062
Also set maxContractLen automatically.
Note that the table size is much bigger than it needs to be.
Optimization is best done, though, when the language specific
tables are added.
R=r
CC=golang-dev
https://golang.org/cl/6167044
dictates a CJK rune is only part of a certain specified range if it
is explicitly defined in the Unicode Codepoint Database.
Fixed the code and some of the tests accordingly.
R=r
CC=golang-dev
https://golang.org/cl/6160044
The first bug was that tertiary ignorables had the same colElem as
implicit colElems, yielding unexpected results. The current encoding
ensures that a non-implicit colElem is never 0. This fix uncovered
another bug of the trie that indexed incorrectly into the null block.
This was caused by an unfinished optimization that would avoid the
need to max out the most-significant bits of continuation bytes.
This bug was also present in the trie used in exp/norm and has been
fixed there as well. The appearence of the bug was rare, as the lower
blocks happened to be nearly nil.
R=r
CC=golang-dev
https://golang.org/cl/6127070
context for change lists of lower-level types. The public APIs are defined
in builder.go and collate.go. Type table is the glue between the lower and
higher level code and might be a good starting point for understanding the
collation code.
R=r, r
CC=golang-dev
https://golang.org/cl/5999053
The trie code looks a lot like the trie in exp/norm. It uses different
types, however. Also, there is only a lookup for []byte and the unsafe
lookup methods have been dropped, as well as sparse mode.
There is now a method for generating a trie. To output Go code, one now needs
to first generate a trie and then call print() on it.
R=r, r, mpvl
CC=golang-dev
https://golang.org/cl/5966064
Don't foster-parent text nodes that consist only of whitespace.
(I implemented this entirely in inTableIM instead of creating an
inTableTextIM, because the sole purpose of inTableTextIM seems to be
to combine character tokens into a string, which our tokenizer does
already.)
Use parseImpliedToken to clarify a couple of cases.
Handle <style>, <script>, <input>, and <form>.
Ignore doctype tokens.
Pass 20 additional tests.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6117048
This CL corrects the remaining differences that I could find between the
implementation of inBodyIM and the spec:
Handle <rp> and <rt>.
Adjust SVG and MathML attributes.
Reconstruct active formatting elements in the "any other start tag" case.
Pass 7 additional tests.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6101055
Clean up the flow of control.
Fix the TODO for handling <html> tags.
Add a case to ignore doctype declarations.
Pass one additional test.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6072047
This method will allow us to be explicit about what we're doing when
we insert an implied token, and avoid repeating the logic involved in
multiple places.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6060048
Add a case to ignore doctype tokens.
Clean up the flow of control to more clearly match the spec.
Pass one more test.
R=nigeltao
CC=golang-dev
https://golang.org/cl/6062047