IDEAS home Printed from https://ideas.repec.org/a/sae/risrel/v238y2024i5p933-944.html
   My bibliography  Save this article

Automatic semantic knowledge extraction from electronic forms

Author

Listed:
  • Haolin Wu
  • Tim French
  • Wei Liu
  • Melinda Hodkiewicz

Abstract

Electronic tabular forms are an intuitive way for organisations to collect, present and store structured information for human readers. Forms use features such as fonts, colours and cell positioning to help readers navigate and find information. Millions of forms, typically in Portable Document Format (PDF), are generated by businesses as part of routine operations. Unlike human readers, machines are not able to directly ‘understand’ the implicit cues contained in the fonts, colours and use of boxes without explicit processing. In this paper, a supervised computer vision model is proposed to decompose the PDF form document into nested microtables. The cells within these microtables are then processed using a customisable rule bank for meaningful table content and semantic relationship extraction. The process is demonstrated on an industry dataset of 37 maintenance procedure documents containing 373 pages and 1016 unique microtables. A web application EMU (Extracting Machine Understandable Semantics from Forms) demonstrates how data captured in tables with different dimensions in procedural forms can be automatically extracted and stored in JavaScript Object Notation (JSON). Identifying and extracting nested tables is a critical fundamental step for future applications to support machine-automated search and extraction of data at scale for both maintenance and other procedural documentation.

Suggested Citation

  • Haolin Wu & Tim French & Wei Liu & Melinda Hodkiewicz, 2024. "Automatic semantic knowledge extraction from electronic forms," Journal of Risk and Reliability, , vol. 238(5), pages 933-944, October.
  • Handle: RePEc:sae:risrel:v:238:y:2024:i:5:p:933-944
    DOI: 10.1177/1748006X221098272
    as

    Download full text from publisher

    File URL: https://journals.sagepub.com/doi/10.1177/1748006X221098272
    Download Restriction: no

    File URL: https://libkey.io/10.1177/1748006X221098272?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:sae:risrel:v:238:y:2024:i:5:p:933-944. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: SAGE Publications (email available below). General contact details of provider: .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.