Customize Rules with XML

Text extraction rules are stored in XML format.

Example of a Microsoft Word rule:

<?xml version="1.0" encoding="utf-8"?>
<!-- Exchange format for text extraction rules -->
<ParserConfigurations xmlns="http://www.wordbee.com/config">
<!-- Rule -->
<ParserConfiguration xmlns="http://www.wordbee.com/config">
  <Name>Microsoft Word</Name>
  <Description>Extracts all contents including header, footer, document properties and user comments.</Description>
  <ParserDomain>MSWORD</ParserDomain>
  <EParser>4</EParser>
  <SegmentationRulesEnabled>true</SegmentationRulesEnabled>
  <SegmentationSplitAtNewlines>true</SegmentationSplitAtNewlines>
  <SegmentationSplitAtInlineTags>true</SegmentationSplitAtInlineTags>
  <VersionPretranslation>CompareTexts</VersionPretranslation>
  <UserTextPatterns xmlns="" />
  <CompactingOption xmlns="">0</CompactingOption>
  <ModulesVersion />
  <MSOfficeConfiguration xmlns="http://www.wordbee.com/config/msoffice">
    <TrimWhitespaces>true</TrimWhitespaces>
    <TrimNoLetterDigit>false</TrimNoLetterDigit>
    <RemoveFormatWhitespaces>true</RemoveFormatWhitespaces>
    <RemoveFormatNoLetterDigit>false</RemoveFormatNoLetterDigit>
 
...

 

You can tweak the XML directly as long as you find out what the different options mean. For example:

  • Name: The print name of the configuration.
  • Description: Optional description of the configuration.
  • SegmentationRulesEnabled: Switches segmentation of text on or off.

Rules may themselves embed further rule definitions. For example, an XML rule may include an HTML rule for processing nodes that contain HTML content. Or, a Word rule may contain an Excel or Powerpoint rule to handle such formats if embedded in a Word document.

In general we recommend using Wordbee Translator for customizing rules interactively:

Customize Rules

 

 

Copyright Wordbee - Buzzin' Outside the Box since 2008