1
0
mirror of https://github.com/golang/go synced 2024-10-05 16:01:22 -06:00
Commit Graph

44 Commits

Author SHA1 Message Date
Marcel van Lohuizen
9aa70984a9 exp/locale/collate: include composed characters into the table. This eliminates
the need to decompose characters for the majority of cases.  This considerably
speeds up collation while increasing the table size minimally.

To detect non-normalized strings, rather than relying on exp/norm, the table
now includes CCC information. The inclusion of this information does not
increase table size.

DETAILS
 - Raw collation elements are now a struct that includes the CCC, rather
   than a slice of ints.
 - Builder now ensures that NFD and NFC counterparts are included in the table.
   This also fixes a bug for Korean which is responsible for most of the growth
   of the table size.
 - As there is no more normalization step, code should now handle both strings
   and byte slices as input. Introduced source type to facilitate this.

NOTES
 - This change does not handle normalization correctly entirely for contractions.
   This causes a few failures with the regtest. table_test.go contains a few
   uncommented tests that can be enabled once this is fixed.  The easiest is to
   fix this once we have the new norm.Iter.
 - Removed a test cases in table_test that covers cases that are now guaranteed
   to not exist.

R=rsc, mpvl
CC=golang-dev
https://golang.org/cl/6971044
2012-12-24 16:42:29 +01:00
Shenghou Ma
d1ef9b56fb all: fix typos
caught by https://github.com/lyda/misspell-check.

R=golang-dev, gri
CC=golang-dev
https://golang.org/cl/6949072
2012-12-19 03:04:09 +08:00
Mikio Hara
78cee46f3a src: gofmt -w -s
R=golang-dev, dsymonds
CC=golang-dev
https://golang.org/cl/6935059
2012-12-15 14:19:51 +09:00
Shenghou Ma
42c8904fe1 all: fix the the typos
Fixes #4420.

R=golang-dev, rsc, remyoudompheng
CC=golang-dev
https://golang.org/cl/6854080
2012-11-22 02:58:24 +08:00
Marcel van Lohuizen
8b7ea6489c exp/locale/collate: changed implementation of Compare and CompareString to
compare incrementally. Also modified collation API to be more high-level
by removing the need for an explicit buffer to be passed as an argument.
This considerably speeds up Compare and CompareString.  This change also eliminates
the need to reinitialize the normalization buffer for each use of an iter. This
also significantly improves performance for Key and KeyString.

R=r, rsc
CC=golang-dev
https://golang.org/cl/6842050
2012-11-15 22:23:56 +01:00
Marcel van Lohuizen
e14cf90a8b unicode: move unicode and related packages to Unicode 6.2.0.
R=r, mpvl
CC=golang-dev
https://golang.org/cl/6818067
2012-10-31 17:32:16 +01:00
Marcel van Lohuizen
b8b329451c exp/locale/collate: implementation of tailorings and table generation.
Tailorings are represented by removing and reinserting entries from a linked list.
After all tailorings are done, missing weights are computed and verified.
This implementation assumes that entries that are used in expansions are not
reinserted at a later point.  This considerably simplifies the implementation.

R=r
CC=golang-dev
https://golang.org/cl/6739052
2012-10-31 14:28:44 +01:00
Marcel van Lohuizen
4c1a6f84f8 exp/locale/collate: removed weights struct to allow for faster and easier
incremental comparisons. Instead, processing is now done directly on colElems.
As a result, the size of the weights array is now reduced by 75%.
Details:
- Primary value of type 1 colElem is shifted by 1 bit so that primaries
  of all types can be compared without shifting.
- Quaternary values are now stored in the colElem itself. This is possible
  as quaternary values other than 0 or maxQuaternary are only needed when other
  values are ignored.
- Simplified processWeights by removing cases that are needed for ICU but not
  for us (our CJK primary values fit in a single value).

R=r
CC=golang-dev
https://golang.org/cl/6817054
2012-10-31 14:28:18 +01:00
Marcel van Lohuizen
bc0783dbe5 exp/locale/collate: add context to entry.
R=r
CC=golang-dev
https://golang.org/cl/6727049
2012-10-31 14:02:43 +01:00
Robert Griesemer
465b9c35e5 gofmt: apply gofmt -w src misc
Remove trailing whitespace in comments.
No other changes.

R=r
CC=golang-dev
https://golang.org/cl/6815053
2012-10-30 13:38:01 -07:00
Marcel van Lohuizen
b575e3ca99 exp/locale/collate: slightly changed collation elements:
- Allow secondary values below the default value in second form. This is
  to support before tags for secondary values, as used by Chinese.
- Eliminate collation elements that are guaranteed to be immaterial
  after a weight increment.

R=r
CC=golang-dev
https://golang.org/cl/6739051
2012-10-25 13:02:31 +02:00
Marcel van Lohuizen
6653d76ef6 exp/locale/collate/build: fixed problem where blocks for first byte need
different indexes for values and index blocks. Fixes many regressions.

R=r
CC=golang-dev
https://golang.org/cl/6737057
2012-10-24 11:41:05 +02:00
Marcel van Lohuizen
34f2050626 exp/locale/collate: clarification in comments on use of returned value.
R=r
CC=golang-dev
https://golang.org/cl/6752043
2012-10-24 11:40:32 +02:00
Marcel van Lohuizen
a35f23f34e exp/locale/collate/tools/colcmp: add locale to output of regression failure.
R=r
CC=golang-dev
https://golang.org/cl/6749058
2012-10-24 11:28:18 +02:00
Robert Griesemer
7f710c2de9 exp/locale/collate: use gofmt -w -s (rather than just gofmt -w)
Also: apply gofmt -w -s to existing tables.

R=mpvl, minux.ma, rsc
CC=golang-dev
https://golang.org/cl/6611051
2012-10-07 17:59:33 -07:00
Marcel van Lohuizen
5e47b77990 exp/locale/collate/tools/colcmp: implementation of colcmp tool used for comparing
various implementation of collation.  The tool provides commands for soring,
regressing one implementation against another, and benchmarking.
Currently it includes collation implementations for the Go collator, ICU,
and one using Darwin's CoreFoundation framework.
To avoid building this tool in the default build, the colcmp tag has been
added to all files. This allows other tools/colcmp in this directory (e.g. it may make
sense to move maketables here) to be put in this directory as well.

R=r, rsc, mpvl
CC=golang-dev
https://golang.org/cl/6496118
2012-09-24 13:22:03 +09:00
Marcel van Lohuizen
a4d08ed5df exp/locale/collate: changed API to allow access to different locales through New(),
instead of variables. Several reasons:
- Encourage users of the API to minimize the number of creations and reuse Collate objects.
- Don't rule out the possibility of using initialization code for collators. For some locales
  it will be possible to have very compact representations that can be quickly expanded
  into a proper table on demand.
Other changes:
- Change name of root* vars to main*, as the tables are shared between locales.
- Added Locales() method to get a list of supported locales.

R=r
CC=golang-dev
https://golang.org/cl/6498107
2012-09-14 19:10:02 +09:00
Marcel van Lohuizen
ef48dfa310 exp/locale/collate: added indices to builder for reusing blocks between locales.
Refactored build + buildTrie into build + buildOrdering.
Note that since the tailoring code is not checked in yet, all tailorings are identical
to root.  The table therefore should not and does not grow at this point.

R=r
CC=golang-dev
https://golang.org/cl/6500087
2012-09-08 10:46:55 +09:00
Marcel van Lohuizen
21d94a22fe exp/locale/collate: switch from DUCET to CLDR for the default root table.
R=r
CC=golang-dev
https://golang.org/cl/6499079
2012-09-08 10:38:11 +09:00
Marcel van Lohuizen
f0a31b5fc2 exp/locale/collate/build: moved some of the code to the appropriate file, as
promised in CL 13985.

R=r
CC=golang-dev
https://golang.org/cl/6503071
2012-09-06 13:16:02 +09:00
Marcel van Lohuizen
5a78e5ea4c exp/locale/collate: Added functionality to parse and process LDML files
for both locale-specific exemplar characters and tailorings to
the collation table.
Some specifices:
- Moved stringSet to the beginning of the file and added some functionality
  to parse command line files.
- openReader now modifies the input URL for localFiles to guarantee that
  any http source listed in the generated file is indeed this source.
- Note that the implementation of the Tailoring API used by maketables.go
  is not yet checked in. So for now adding tailorings are simply no-ops.
- The generated file of exemplar characters will be used somewhere else.
  Here is a snippet of how the body of the generated file looks like:

type exemplarType int
const (
        exCharacters exemplarType = iota
        exContractions
        exPunctuation
        exAuxiliary
        exCurrency
        exIndex
        exN
)

var exemplarCharacters = map[string][exN]string{
        "af": [exN]string{
                0: "a á â b c d e é è ê ë f g h i î ï j k l m n o ô ö p q r s t u û v w x y z",
                3: "á à â ä ã æ ç é è ê ë í ì î ï ó ò ô ö ú ù û ü ý",
                4: "a b c d e f g h i j k l m n o p q r s t u v w x y z",
        },
        ...
}

R=r
CC=golang-dev
https://golang.org/cl/6501070
2012-09-01 14:15:00 +02:00
Marcel van Lohuizen
18aa55c169 exp/locale/collate: first changes that introduce implementation of tailorings:
- Elements in the array are now sorted as a linked list.  This makes it easier to
  apply tailorings.
- Added code to sort entries by collation elements.
- Added logical reset points.  This is used for tailoring relative to certain
  properties, rather than characters.

NOTE: all code for type entry should now be in order.go.  To keep the diffs for
this CL reasonable, though, the existing code is left in builder.go.  I'll move
this in a separate CL.

R=r
CC=golang-dev
https://golang.org/cl/6493063
2012-09-01 14:13:37 +02:00
Marcel van Lohuizen
c61a185f35 exp/locale/collate: add code to ignore tests with (unpaired) surrogates.
In the regtest data, surrogates are assigned primary weights based on
the surrogate code point value.  Go now converts surrogates to FFFD, however,
meaning that the primary weight is based on this code point instead.
This change drops tests with surrogates and lets the tests pass.

R=r
CC=golang-dev
https://golang.org/cl/6461100
2012-08-24 15:56:07 +02:00
Marcel van Lohuizen
a8357f0160 exp/locale/collate/build: fixed bug that was exposed by experimenting
with table changes.
NOTE: there is no test for this, but 1) the code has now the same
control flow as scan in exp/locale/collate/contract.go, which is
tested and 2) Builder verifies the generated table so bugs in this
code are quickly and easily found (which is how this bug was discovered).

R=r
CC=golang-dev
https://golang.org/cl/6461082
2012-08-20 10:56:41 +02:00
Marcel van Lohuizen
98883c811a exp/locale/collate: let regtest generate its own collation table.
The main table will need to get a slightly different collation table as the one
used by regtest, as the regtest is based on the standard UCA DUCET, while
the locale-specific tables are all based on a CLDR root table.
This change allows changing the table without affecting the regression test.

R=r
CC=golang-dev
https://golang.org/cl/6453089
2012-08-20 10:56:19 +02:00
Marcel van Lohuizen
2845e5881f exp/locale/collate: changed default AlternateHandling to non-ignorable, the same
default as ICU.

R=r
CC=golang-dev
https://golang.org/cl/6445080
2012-08-20 10:56:06 +02:00
Marcel van Lohuizen
6918357031 exp/locale/collate: Added test flag to maketables tool for comparing newly
against previously generated tables.

R=r
CC=golang-dev
https://golang.org/cl/6441098
2012-08-20 10:55:40 +02:00
Marcel van Lohuizen
89d40b911c exp/locale/collate: changed API of Builder to be more convenient
for dealing with CLDR files:
- Add now taxes a list of indexes of colelems that are variables. Checking and
  handling is now done by the Builder.  VariableTop is now also properly generated
  using the Build method.
- Introduced separate Builder, called Tailoring, for creating tailorings of root
  table.  This clearly separates the functionality for building a table based on
  weights (the allkeys* files) versus tables based on LDML XML files.
- Tailorings are now added by two calls instead of one: SetAnchor and Insert.
  This more closely reflects the structure of LDML side and simplifies the
  implementation of both the client and library side.  It also preserves
  some information that is otherwise hard to recover for the Builder.
- Allow the LDML XML element extend to be passed to Insert.  This simplifies
  both client and library implementation.

R=r
CC=golang-dev
https://golang.org/cl/6454061
2012-08-03 09:01:21 +02:00
Marcel van Lohuizen
601045e87a exp/locale/collate: changed trie in first step towards support for multiple locales.
- Allow handles into the trie for different locales.  Multiple tables share the same
  try to allow for reuse of blocks.
- Significantly improved memory footprint and reduced allocations of trieNodes.
  This speeds up generation by about 30% and allows keeping trieNodes around
  for multiple locales during generation.
- Renamed print method to fprint.

R=r
CC=golang-dev
https://golang.org/cl/6408052
2012-07-28 18:44:14 +02:00
Marcel van Lohuizen
882b6ef454 exp/locale/collate: This CL includes the following changes:
- Changed the representation of colElem to support a few cases
  for some languages not supported by the current format.
- Changed offsets for implicit primary values. This makes the
  values both easier to read and debug (last 4 nibbles are identical to
  implicit primary value) and also results in better packing.
- Fixed bug in weight conversion code that did not pop up yet by
  sheer luck.
Note that tables.go also includes changes to the contraction trie
from CL 6346092.

R=r, mpvl
CC=golang-dev
https://golang.org/cl/6392060
2012-07-13 11:38:22 +02:00
Marcel van Lohuizen
adc19ac5e3 exp/locale/collate: adjusted contraction trie to support Myanmar (Burmese),
which has a rather large contraction table. The value of the next state
offset now starts after the current block, instead of before.  This is
slightly less efficient (on extra addition per state change), but gives
some extra range for the offsets.
Also introduced constants for final (0) and noIndex (0xFF).
tables.go is updated in a separate CL.

R=r
CC=golang-dev
https://golang.org/cl/6346092
2012-07-13 11:38:00 +02:00
Marcel van Lohuizen
77b1378c3e exp/locale/collate: added regression test for collate package
based on UCA test files.

R=r
CC=golang-dev
https://golang.org/cl/6216056
2012-06-19 11:34:56 -07:00
Marcel van Lohuizen
de0c1c9cf5 exp/locale/collate: somehow an incorrect version of tables was checked in earlier.
Regenerated tables using maketables.

R=r, rsc
CC=golang-dev
https://golang.org/cl/6248067
2012-06-04 18:35:26 +02:00
Marcel van Lohuizen
c633f85f65 exp/locale/collate: avoid double building in maketables.go. Also added check.
R=r
CC=golang-dev
https://golang.org/cl/6202063
2012-05-30 17:47:56 +02:00
Russ Cox
ce69666273 exp/locale/collate: avoid 16-bit math
There's no need for the 16-bit arithmetic here,
and it tickles a long-standing compiler bug.
Fix the exp code not to use 16-bit math and
create an explicit test for the compiler bug.

R=golang-dev, r
CC=golang-dev
https://golang.org/cl/6256048
2012-05-24 14:50:36 -04:00
Marcel van Lohuizen
ec099b2bc7 exp/locale/collate: implementation of main collation functionality for
key and simple comparisson. Search is not yet implemented in this CL.
Changed some of the types of table_test.go to allow reuse in the new test.
Also reduced number of primary values for illegal runes to 1 (both map to
the same).

R=r
CC=golang-dev
https://golang.org/cl/6202062
2012-05-17 19:48:56 +02:00
Marcel van Lohuizen
0355a71751 exp/locale/collate: Add maketables tool and generated tables.
Also set maxContractLen automatically.
Note that the table size is much bigger than it needs to be.
Optimization is best done, though, when the language specific
tables are added.

R=r
CC=golang-dev
https://golang.org/cl/6167044
2012-05-09 12:03:55 +02:00
Marcel van Lohuizen
56a76c88f8 exp/locale/collate: from the regression test we derive that the spec
dictates a CJK rune is only part of a certain specified range if it
is explicitly defined in the Unicode Codepoint Database.
Fixed the code and some of the tests accordingly.

R=r
CC=golang-dev
https://golang.org/cl/6160044
2012-05-07 11:51:40 +02:00
Marcel van Lohuizen
10838165d8 exp/locale/collate: fixed two bugs uncovered by regression tests.
The first bug was that tertiary ignorables had the same colElem as
implicit colElems, yielding unexpected results. The current encoding
ensures that a non-implicit colElem is never 0.  This fix uncovered
another bug of the trie that indexed incorrectly into the null block.
This was caused by an unfinished optimization that would avoid the
need to max out the most-significant bits of continuation bytes.
This bug was also present in the trie used in exp/norm and has been
fixed there as well. The appearence of the bug was rare, as the lower
blocks happened to be nearly nil.

R=r
CC=golang-dev
https://golang.org/cl/6127070
2012-05-02 17:01:41 +02:00
Marcel van Lohuizen
fdce27f7b8 exp/locale/collate: Added Builder type for generating a complete
collation table. At this moment, it only implements the generation of
a root table.

R=r
CC=golang-dev
https://golang.org/cl/6039047
2012-04-25 13:19:35 +02:00
Marcel van Lohuizen
52f0afe0db exp/locale/collate: Added skeleton for the higher-level types to provide
context for change lists of lower-level types. The public APIs are defined
in builder.go and collate.go. Type table is the glue between the lower and
higher level code and might be a good starting point for understanding the
collation code.

R=r, r
CC=golang-dev
https://golang.org/cl/5999053
2012-04-25 13:19:00 +02:00
Marcel van Lohuizen
bcf48c7971 exp/locale/collate: added trie for associating colElems to runes.
The trie code looks a lot like the trie in exp/norm. It uses different
types, however.  Also, there is only a lookup for []byte and the unsafe
lookup methods have been dropped, as well as sparse mode.
There is now a method for generating a trie. To output Go code, one now needs
to first generate a trie and then call print() on it.

R=r, r, mpvl
CC=golang-dev
https://golang.org/cl/5966064
2012-04-25 13:16:57 +02:00
Marcel van Lohuizen
bb3f3c9775 exp/locale/collate: added representation for collation elements
(see http://www.unicode.org/reports/tr10/).

R=r, r
CC=golang-dev
https://golang.org/cl/5981048
2012-04-25 13:16:24 +02:00
Marcel van Lohuizen
e456d015fb exp/locale/collate: implementation of trie that is used for detecting contractions.
(See http://www.unicode.org/reports/tr10/#Contractions.)  Each rune that is at the
start of any contraction is associated a trie. This trie, in turn, may be shared
by other runes that have the same set of suffixes.

R=r, r
CC=golang-dev
https://golang.org/cl/5970066
2012-04-25 13:15:48 +02:00