Note |
---|
Work in progress, not released. |
This tool permits to submit a short text and see how it is split. The API returns very detailed information so that it is easier to understand which breaking rules and exception rules were invoked. This tool aids with debugging or building SRX configurations.
...
Code Block |
---|
{ "count": 2, "original": "Hallo wie geht es am 20.4. um 3 Uhr. Geht es spaeter?", "segments": [ { "position": 0, "text": "Hallo wie geht es am 20.4. um 3 Uhr." }, { "position": 36, "text": " Geht es spaeter?" } ], "rules": [ { "position": 26, "retained": false, "tooShort": false, "breaking": { "no": 10021, "before": "[\\.\\?\\!\\;\\:]+[\\“\\\"\\'”\\)]?", "after": "\\s" }, "exception": { "no": 10019, "before": "\\.+", "after": "[\\“\\\"\\'”\\)]?\\s\\p{Ll}" } }, { "position": 36, "retained": true, "tooShort": false, , "breaking": { "no": 10021, "before": "[\\.\\?\\!\\;\\:]+[\\“\\\"\\'”\\)]?", "after": "\\s" }, "exception": { "no": null, "before": null, "after": null } } ], "parameters": { "locale": "de", "independentRuleId": 5503, "languageRulesId": 5502, "minimumSegmentLength": 5 } } |
The properties are:
count | Total segments into which the text was split | int |
original | The original text. | string? |
segments | The list of segments with start character position and the text | object[] |
rules | An array of breaking and exception rules that were activated for all the positions in the text. See below for details. | string |
parameters | Includes information from the original payload. | object |
...
position | The text position that the system attempts to split | int |
retained |
| bool |
tooShort | If the split segment is shorter than an allowed minimum, the split will be canceled. This property is then set to true. | bool |
breaking | The breaking rule that was applied. | object |
exception | The exception rule, if any, that canceled the breaking rule. If there is no exception then the properties will all be null. | object |
Text segmentation works as follows:
Find a breaking rule. This is where the segmenter would “like” to split the text
If the resulting segment is shorter than the minimum length in
parameters.minimumSegmentLength
then the split will be canceled (and thetooShort
property is set).If an exception rule is found in the SRX configuration that matches the split point, then the split is also canceled. The details for the exception rule are listed in the
exception
property.Start over to find more breaking rules.