settings/srx/tools/split
This tool permits to submit a short text and see how it is split. The API returns very detailed information so that it is easier to understand which breaking rules and exception rules were invoked. This tool aids with debugging or building SRX configurations.
URL
(POST) /api/settings/srx/tools/split
PARAMETERS
The BODY must include a JSON object with these properties:
locale | The language of the text, such as “en”, “de” etc. | string, Mandatory |
text | The text to be tested. Up to 1000 characters. | string, Mandatory |
independentRulesId | The language independent SRX rules. If you set to null, then no language independent rules will be loaded (not recommended). Use settings/srx/find to find configurations. | int, Optional |
languageRulesId | The language specific SRX rule. It must match the locale of the text. If you set to null, then no language specific rules will be loaded (not recommended). Use settings/srx/find to find configurations. | int, Optional |
Example payload:
{
"locale": "de",
"independentRulesId": 234,
"languageRulesId" : 213,
"text": "Hallo geht es am 20.4. um 3 Uhr? Geht es spaeter?"
}
RESULTS
The JSON result shows segmentation results. A result for the sample above might be:
{
"count": 2,
"original": "Hallo wie geht es am 20.4. um 3 Uhr. Geht es spaeter?",
"segments": [
{
"position": 0,
"text": "Hallo wie geht es am 20.4. um 3 Uhr."
},
{
"position": 36,
"text": " Geht es spaeter?"
}
],
"rules": [
{
"position": 26,
"retained": false,
"tooShort": false,
"breaking": {
"no": 10021,
"before": "[\\.\\?\\!\\;\\:]+[\\“\\\"\\'”\\)]?",
"after": "\\s"
},
"exception": {
"no": 10019,
"before": "\\.+",
"after": "[\\“\\\"\\'”\\)]?\\s\\p{Ll}"
}
},
{
"position": 36,
"retained": true,
"tooShort": false,
,
"breaking": {
"no": 10021,
"before": "[\\.\\?\\!\\;\\:]+[\\“\\\"\\'”\\)]?",
"after": "\\s"
},
"exception": {
"no": null,
"before": null,
"after": null
}
}
],
"parameters": {
"locale": "de",
"independentRuleId": 5503,
"languageRulesId": 5502,
"minimumSegmentLength": 5
}
}
The properties are:
count | Total segments into which the text was split | int |
original | The original text. | string? |
segments | The list of segments with start character position and the text | object[] |
rules | An array of breaking and exception rules that were activated for all the positions in the text. See below for details. | string |
parameters | Includes information from the original payload. | object |
The rules
array contains positions in the text and describes whether the position was split (breaking rule) or undone by a specific exception rule. The properties are:
position | The text position that the system attempts to split | int |
retained |
| bool |
tooShort | If the split segment is shorter than an allowed minimum, the split will be canceled. This property is then set to true. | bool |
breaking | The breaking rule that was applied. | object |
exception | The exception rule, if any, that canceled the breaking rule. If there is no exception then the properties will all be null. | object |
Text segmentation works as follows:
IDENTIFY SPLIT POINTS: The process identifies all potential break points in the text from right to left. Once all break points are identified it looks to see if some of them need to be removed / canceled:
CHECK MINIMUM LENGTH: By looking at each segment from right to left, it identifies any segment that is shorter than the minimum length, listed in
parameters.minimumSegmentLength
. If so, then the split point to the left of the segment is removed. ThetooShort
property will be flagged for the canceled split point.APPLY EXCEPTION RULES: If an exception rule is found in the SRX configuration that matches a remaining split point, then the split is also canceled. The details for the exception rule are listed in the
exception
property.
If you believe that this process is complicate then the Wordbee team heartily agrees with you.
Copyright Wordbee - Buzzin' Outside the Box since 2008