This application demonstrates the power of data extraction for filtering an archive of documents. Because specific relevant data mentioned in the documents has been extracted, it can be used to filter the archive. In this way, relevant documents can be retrieved very quickly. Moreover, this data can directly be used to generate statistical reports about the documents in the archive.
The demo contains all published Dutch court rulings in the field of violent crime and the Opium Act article 2 (hard drugs).
Start the demo, filter the statements and then plot one of the extracted attributes to quickly visualize the distribution in the archive.
The language technology used for this demo application is very similar to that used for the “Semantic Search and Filtering” demo. Again, the corpus is annotated automatically according to our Jurisprudence ontology, but this demo application does not contain a search box. It does, however, have many more features that can be used to filter the data set, and an important part of these features are quantitative in nature.
Also in this demo, the section in which the features are found determines the context and thus the meaning. In the user interface, these are divided into three groups: the entire file/document, only the indictment and the statement of evidence. The application also contains a widget with which a characteristic can be displayed in a graph to quickly display the distribution over the (filtered) documents.
Data extraction from documents containing “plain text” (so-called natural language) has many applications. Think, for example, of finding the claim amount in e-mail complaints to prioritize the treatment, or extracting locations of specific news event in order to plot them geographically.
By extracting the data from the documents, it becomes quantifiable and therefore suitable for all kinds of automatic processing. In fact, data extraction converts a text into a model that can be stored in a database. This allows the text documents to be further processed in the same way as other structured forms of data.