the need to decompose characters for the majority of cases. This considerably
speeds up collation while increasing the table size minimally.
To detect non-normalized strings, rather than relying on exp/norm, the table
now includes CCC information. The inclusion of this information does not
increase table size.
DETAILS
- Raw collation elements are now a struct that includes the CCC, rather
than a slice of ints.
- Builder now ensures that NFD and NFC counterparts are included in the table.
This also fixes a bug for Korean which is responsible for most of the growth
of the table size.
- As there is no more normalization step, code should now handle both strings
and byte slices as input. Introduced source type to facilitate this.
NOTES
- This change does not handle normalization correctly entirely for contractions.
This causes a few failures with the regtest. table_test.go contains a few
uncommented tests that can be enabled once this is fixed. The easiest is to
fix this once we have the new norm.Iter.
- Removed a test cases in table_test that covers cases that are now guaranteed
to not exist.
R=rsc, mpvl
CC=golang-dev
https://golang.org/cl/6971044
compare incrementally. Also modified collation API to be more high-level
by removing the need for an explicit buffer to be passed as an argument.
This considerably speeds up Compare and CompareString. This change also eliminates
the need to reinitialize the normalization buffer for each use of an iter. This
also significantly improves performance for Key and KeyString.
R=r, rsc
CC=golang-dev
https://golang.org/cl/6842050
Tailorings are represented by removing and reinserting entries from a linked list.
After all tailorings are done, missing weights are computed and verified.
This implementation assumes that entries that are used in expansions are not
reinserted at a later point. This considerably simplifies the implementation.
R=r
CC=golang-dev
https://golang.org/cl/6739052
incremental comparisons. Instead, processing is now done directly on colElems.
As a result, the size of the weights array is now reduced by 75%.
Details:
- Primary value of type 1 colElem is shifted by 1 bit so that primaries
of all types can be compared without shifting.
- Quaternary values are now stored in the colElem itself. This is possible
as quaternary values other than 0 or maxQuaternary are only needed when other
values are ignored.
- Simplified processWeights by removing cases that are needed for ICU but not
for us (our CJK primary values fit in a single value).
R=r
CC=golang-dev
https://golang.org/cl/6817054
- Allow secondary values below the default value in second form. This is
to support before tags for secondary values, as used by Chinese.
- Eliminate collation elements that are guaranteed to be immaterial
after a weight increment.
R=r
CC=golang-dev
https://golang.org/cl/6739051
various implementation of collation. The tool provides commands for soring,
regressing one implementation against another, and benchmarking.
Currently it includes collation implementations for the Go collator, ICU,
and one using Darwin's CoreFoundation framework.
To avoid building this tool in the default build, the colcmp tag has been
added to all files. This allows other tools/colcmp in this directory (e.g. it may make
sense to move maketables here) to be put in this directory as well.
R=r, rsc, mpvl
CC=golang-dev
https://golang.org/cl/6496118
instead of variables. Several reasons:
- Encourage users of the API to minimize the number of creations and reuse Collate objects.
- Don't rule out the possibility of using initialization code for collators. For some locales
it will be possible to have very compact representations that can be quickly expanded
into a proper table on demand.
Other changes:
- Change name of root* vars to main*, as the tables are shared between locales.
- Added Locales() method to get a list of supported locales.
R=r
CC=golang-dev
https://golang.org/cl/6498107
Refactored build + buildTrie into build + buildOrdering.
Note that since the tailoring code is not checked in yet, all tailorings are identical
to root. The table therefore should not and does not grow at this point.
R=r
CC=golang-dev
https://golang.org/cl/6500087
for both locale-specific exemplar characters and tailorings to
the collation table.
Some specifices:
- Moved stringSet to the beginning of the file and added some functionality
to parse command line files.
- openReader now modifies the input URL for localFiles to guarantee that
any http source listed in the generated file is indeed this source.
- Note that the implementation of the Tailoring API used by maketables.go
is not yet checked in. So for now adding tailorings are simply no-ops.
- The generated file of exemplar characters will be used somewhere else.
Here is a snippet of how the body of the generated file looks like:
type exemplarType int
const (
exCharacters exemplarType = iota
exContractions
exPunctuation
exAuxiliary
exCurrency
exIndex
exN
)
var exemplarCharacters = map[string][exN]string{
"af": [exN]string{
0: "a á â b c d e é è ê ë f g h i î ï j k l m n o ô ö p q r s t u û v w x y z",
3: "á à â ä ã æ ç é è ê ë í ì î ï ó ò ô ö ú ù û ü ý",
4: "a b c d e f g h i j k l m n o p q r s t u v w x y z",
},
...
}
R=r
CC=golang-dev
https://golang.org/cl/6501070
- Elements in the array are now sorted as a linked list. This makes it easier to
apply tailorings.
- Added code to sort entries by collation elements.
- Added logical reset points. This is used for tailoring relative to certain
properties, rather than characters.
NOTE: all code for type entry should now be in order.go. To keep the diffs for
this CL reasonable, though, the existing code is left in builder.go. I'll move
this in a separate CL.
R=r
CC=golang-dev
https://golang.org/cl/6493063
In the regtest data, surrogates are assigned primary weights based on
the surrogate code point value. Go now converts surrogates to FFFD, however,
meaning that the primary weight is based on this code point instead.
This change drops tests with surrogates and lets the tests pass.
R=r
CC=golang-dev
https://golang.org/cl/6461100
with table changes.
NOTE: there is no test for this, but 1) the code has now the same
control flow as scan in exp/locale/collate/contract.go, which is
tested and 2) Builder verifies the generated table so bugs in this
code are quickly and easily found (which is how this bug was discovered).
R=r
CC=golang-dev
https://golang.org/cl/6461082
The main table will need to get a slightly different collation table as the one
used by regtest, as the regtest is based on the standard UCA DUCET, while
the locale-specific tables are all based on a CLDR root table.
This change allows changing the table without affecting the regression test.
R=r
CC=golang-dev
https://golang.org/cl/6453089
for dealing with CLDR files:
- Add now taxes a list of indexes of colelems that are variables. Checking and
handling is now done by the Builder. VariableTop is now also properly generated
using the Build method.
- Introduced separate Builder, called Tailoring, for creating tailorings of root
table. This clearly separates the functionality for building a table based on
weights (the allkeys* files) versus tables based on LDML XML files.
- Tailorings are now added by two calls instead of one: SetAnchor and Insert.
This more closely reflects the structure of LDML side and simplifies the
implementation of both the client and library side. It also preserves
some information that is otherwise hard to recover for the Builder.
- Allow the LDML XML element extend to be passed to Insert. This simplifies
both client and library implementation.
R=r
CC=golang-dev
https://golang.org/cl/6454061
- Allow handles into the trie for different locales. Multiple tables share the same
try to allow for reuse of blocks.
- Significantly improved memory footprint and reduced allocations of trieNodes.
This speeds up generation by about 30% and allows keeping trieNodes around
for multiple locales during generation.
- Renamed print method to fprint.
R=r
CC=golang-dev
https://golang.org/cl/6408052
- Changed the representation of colElem to support a few cases
for some languages not supported by the current format.
- Changed offsets for implicit primary values. This makes the
values both easier to read and debug (last 4 nibbles are identical to
implicit primary value) and also results in better packing.
- Fixed bug in weight conversion code that did not pop up yet by
sheer luck.
Note that tables.go also includes changes to the contraction trie
from CL 6346092.
R=r, mpvl
CC=golang-dev
https://golang.org/cl/6392060
which has a rather large contraction table. The value of the next state
offset now starts after the current block, instead of before. This is
slightly less efficient (on extra addition per state change), but gives
some extra range for the offsets.
Also introduced constants for final (0) and noIndex (0xFF).
tables.go is updated in a separate CL.
R=r
CC=golang-dev
https://golang.org/cl/6346092
There's no need for the 16-bit arithmetic here,
and it tickles a long-standing compiler bug.
Fix the exp code not to use 16-bit math and
create an explicit test for the compiler bug.
R=golang-dev, r
CC=golang-dev
https://golang.org/cl/6256048
key and simple comparisson. Search is not yet implemented in this CL.
Changed some of the types of table_test.go to allow reuse in the new test.
Also reduced number of primary values for illegal runes to 1 (both map to
the same).
R=r
CC=golang-dev
https://golang.org/cl/6202062
Also set maxContractLen automatically.
Note that the table size is much bigger than it needs to be.
Optimization is best done, though, when the language specific
tables are added.
R=r
CC=golang-dev
https://golang.org/cl/6167044
dictates a CJK rune is only part of a certain specified range if it
is explicitly defined in the Unicode Codepoint Database.
Fixed the code and some of the tests accordingly.
R=r
CC=golang-dev
https://golang.org/cl/6160044
The first bug was that tertiary ignorables had the same colElem as
implicit colElems, yielding unexpected results. The current encoding
ensures that a non-implicit colElem is never 0. This fix uncovered
another bug of the trie that indexed incorrectly into the null block.
This was caused by an unfinished optimization that would avoid the
need to max out the most-significant bits of continuation bytes.
This bug was also present in the trie used in exp/norm and has been
fixed there as well. The appearence of the bug was rare, as the lower
blocks happened to be nearly nil.
R=r
CC=golang-dev
https://golang.org/cl/6127070
context for change lists of lower-level types. The public APIs are defined
in builder.go and collate.go. Type table is the glue between the lower and
higher level code and might be a good starting point for understanding the
collation code.
R=r, r
CC=golang-dev
https://golang.org/cl/5999053
The trie code looks a lot like the trie in exp/norm. It uses different
types, however. Also, there is only a lookup for []byte and the unsafe
lookup methods have been dropped, as well as sparse mode.
There is now a method for generating a trie. To output Go code, one now needs
to first generate a trie and then call print() on it.
R=r, r, mpvl
CC=golang-dev
https://golang.org/cl/5966064