Motivations:
- Simpler UI. Previous API proved a bit awkward for practical purposes.
- Iter is often used in cases where one want to be able to bail out early.
The old implementaton had too much look-ahead to be efficient.
Disadvantages:
- ASCII performance is bad. This is unavoidable for tiny iterations.
Example is included to show how to work around this.
Description:
Iter now iterates per boundary/segment. It returns a slice of bytes that
either points to the input bytes, the internal decomposition strings,
or the small internal buffer that each iterator has. In many cases, copying
bytes is avoided.
The method Seek was added to support jumping around the input without
having to reinitialize.
Details:
- Table adjustments: some decompositions exist of multiple segments.
Decompositions that are of this type are now marked so that Iter can
handle them separately.
- The old iterator had a different next function for different normal forms
that was assigned to a function pointer called by Next.
The new iterator uses this mechanism to switch between different modes
for handling different type of input as well. This greatly improves
performance for Hangul and ASCII. It is also used for multi-segment
decompositions.
- input is now a struct of sting and []byte, instead of an interface.
This simplifies optimizing the ASCII case.
R=rsc
CC=golang-dev
https://golang.org/cl/6873072
For completeness, we also expose the Canonical Combining Class of a rune.
This does not increase the data size.
R=r
CC=golang-dev
https://golang.org/cl/5931043
by other low-level libraries, like collate. Extra care has been given to optimize the performance
of normalizing to NFD, as this is what will be used by the collator. The overhead of checking
whether a string is normalized vs simply decomposing a string is neglible. Assuming that most
strings are in the FCD form, this iterator can be used to decompose strings and normalize with
minimal overhead.
R=r
CC=golang-dev
https://golang.org/cl/5676057
one trie lookup per rune is needed. See forminfo.go for a description
of the new format. Also included leading and trailing canonical
combining class in decomposition information. This will often avoid
additional trie lookups.
R=r, r
CC=golang-dev
https://golang.org/cl/5616071
- Unified bounary conditions for NFC and NFD and removed some indirections.
This enforces boundaries at the character level, which is typically what
the user expects. (NFD allows a boundary between 'a' and '`', for example,
which may give unexpected results for collation. The current implementation
is already stricter than the standard, so nothing much changes. This change
just formalizes it.
- Moved methods of qcflags to runeInfo.
- Swapped YesC and YesMaybe bits in qcFlags. This is to aid future changes.
- runeInfo return values use named fields in preperation for struct change.
- Replaced some left-over uint32s with rune.
R=r
CC=golang-dev
https://golang.org/cl/5607050
Nothing terribly interesting here. (!)
Since the public APIs are all in terms of UTF-8,
the changes are all internal only.
R=mpvl, gri, r
CC=golang-dev
https://golang.org/cl/5309042
Needed to ensure that finding the last boundary does not result in O(n^2)-like behavior.
Now prevents lookbacks beyond 31 characters across the board (starter + 30 non-starters).
composition.go:
- maxCombiningCharacters now means exactly that.
- Bug fix.
- Small performance improvement/ made code consistent with other code.
forminfo.go:
- Bug fix: ccc needs to be 0 for inert runes.
normalize.go:
- A few bug fixes.
- Limit the amount of combining characters considered in FirstBoundary.
- Ditto for LastBoundary.
- Changed semantics of LastBoundary to not consider trailing illegal runes a boundary
as long as adding bytes might still make them legal.
trie.go:
- As utf8.UTFMax is 4, we should treat UTF-8 encodings of size 5 or greater as illegal.
This has no impact on the normalization process, but it prevents buffer overflows
where we expect at most UTFMax bytes.
R=r
CC=golang-dev
https://golang.org/cl/4963041
maketables.go/tables.go
- Properly set combinesForward flag for JamoL and JamoV.
- Fixed Printf bug.
composition.go
- Make insertString use the same control flow as insert.
- Better Hangul and non-Hangul mixing.
forminfo.go
- Fixed bug in compBoundaryBefore that affected a few esoteric cases.
- Buffer overflow now tested in normalize_test.go (other CL).
R=r
CC=golang-dev
https://golang.org/cl/4924041
forminfo.go:
- Wrappers for table data.
- Per Form dispatch table.
composition.go:
- reorderBuffer type. Implements decomposition, reordering, and composition.
- Note: decompose and decomposeString fields in formInfo could be replaced by
a pointer to the trie for the respective form. The proposed design makes
testing easier, though.
normalization.go:
- Temporarily added panic("not implemented") methods to make the tests run.
These will be removed again with the next CL, which will introduce the
implementation.
R=r, rogpeppe, mpvl, rsc
CC=golang-dev
https://golang.org/cl/4875043