...
Text segmentation works as follows:
Find a breaking rule. This is where the segmenter would “like” to split the text
If the resulting segment IDENTIFY SPLIT POINTS: The process identifies all potential break points in the text from right to left. Once all break points are identified it looks to see if some of them need to be removed / canceled:
CHECK MINIMUM LENGTH: By looking at each segment from right to left, it identifies any segment that is shorter than the minimum length, listed in
parameters.minimumSegmentLength
. If so, then the split point to the left of the segment is removed. ThetooShort
property will be canceled (and thetooShort
property is set).flagged for the canceled split point.APPLY EXCEPTION RULES: If an exception rule is found in the SRX configuration that matches the a remaining split point, then the split is also canceled. The details for the exception rule are listed in the
exception
property.Start over to find more breaking rules
If you believe that this process is complicate then the Wordbee team heartily agrees with you.