Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current View Version History

« Previous Version 7 Next »

When setting up a file format configuration for Microsoft Word, there are many options to choose from to ensure extraction is successful. This page will explain the most common options for Word files.

The following file extensions are supported when setting up file format configurations for Microsoft Word: .doc, .docx, .dot, .dotx, .docm, .dotm.

Please click on a section to see specific information regarding a configuration option. 

General Tab

Configuration OptionDescription
Content SectionExtraction rules for document properties, headers, footers, calculated fields text, table of contents, and user comments.
Whitespaces and SymbolsElect to not show leading and trailing whitespaces, convert sequences of multiple whitespaces into markup, do not show leading or trailing characters that are not letters or digits, convert words containing no letters or digits into markup.
Text SegmentationEnable SRX rules for text segmentation and elect to always split text at line breaks.

 

Do Not Translate Tab

Configuration OptionDescription
  
  
  

 

Fonts Tab

Configuration OptionDescription
  
  
  

 

Reduce Markup Tab

Configuration OptionDescription
  
  
  

 

Embedded Files Tab

Configuration OptionDescription
  
  
  
  • No labels