Developing a Common Interchange Model and Format for Representing Knowledge Synthesized from HLT Analytic Results

By Ransom Winder, Ph.D. , Joseph Jubinski , Dr. Michael Smith

This paper describes a common interchange format and model designed to coordinate the extracted information from raw document sources in order to generate knowledge in the Human Language Technology (HLT) domain.

Download Resources


PDF Accessibility

One or more of the PDF files on this page fall under E202.2 Legacy Exceptions and may not be completely accessible. You may request an accessible version of a PDF using the form on the Contact Us page.

​In the Human Language Technology (HLT) domain, analytic results extracted from raw document sources are captured in varied models and formats due to the depth of what can be revealed and the diversity of interpretation. However, some common model and format must be followed to allow for multiple analytics to operate together in workflows and enable both the communication between analytics and the fusion of parallel or complementary results. This data integration problem is exacerbated when placing an emphasis on extracting knowledge from text, as the data model must be both adaptable and extensible to handle current and emerging content extraction capabilities and technologies. This paper describes a common interchange format and model designed to coordinate the extracted information from raw document sources in order to generate knowledge. The approach described adheres to the principles of adaptability and extensibility. It also provides the means to represent the annotation data that act as the reference for the knowledge and maintain provenance about these analytic results. While the data model and format described were designed for the HLT domain, the process used to develop them can be applied to other domains as well (e.g., image processing, signal processing)​.