Syntactica Solutions for TEI
The Text Encoding Initiative (TEI) has clearly been one of the most successful standards in the historical community. TEI allows historical documents to be marked up with tags that provide precise meaning for key entities such as dates, locations, people and organizations.
But since the 1980s many of the most historical artifacts in our world have moved from the realm of paper to digital documents. Today the digital historian and archivist must not only capture digital artifacts, they must make these historically significant documents available through high-quality on-line search and retrieval tools.
And the expectations of users of our archives is growing. Not only do users want simple keyword search but they expect relevant document to appear quickly and they want the most relevant documents to appear first in the result sets.
Although a wide variety of conversion tools, many documents that have been previously being locked in vendor-specific formats can now be easily be converted to TEI documents that retain complex document structure. This document structure is critical for retaining high precision and high relevancy search.
But many TEI users are working with limited budgets and do not have the resources to purchase complex databases and hire full time software developers to create and maintain high-precision document markup systems.
TEI users need easy-to use tools that can be customized to their needs without the need for custom software. They need solutions that do not require complex programming languages such as Java, .Net, C# or Python. They want solutions that allow them to drag-and-drop XML files into collections and allow fast but precise retrieval over very large document collections.
The Syntactica XRX application architecture is perfectly suited to these requirements. With the XRX web application architecture you will never need to learn about creating middle tier objects, relational databases or transforming XML to and from these other formats.
Syntactica offers a full range of training, tools and processes to do the most with limited budgets. We focus on open source software systems that can be quickly customized to the needs of the digital historian. Our solutions make it easy for non-programmers to handle all phase of the import, analysis and export of TEI documents from a wide-variety of sources. At Syntactica our philosophy is to empower the digital historian with tools that make them more productive without breaking the bank.
Key Benefits of TEI XQuery Frameworks
There are several of the benefits of using the Syntactica architecture for managing historical documents using TEI XML standards. This architecture includes:
- Open Source (free) Native XML database (based on eXist-db.org release 1.4) that lower total cost of ownership.
- RESTful web service interfaces to make it easy to use, test and integrate with other systems.
- A library of tools customized to transforming TEI documents to keep your staff productive.
- A library of tools to manage databases of extracted entities so that your inline TEI entities reference consistent IDs.
Together, this architecture (the XRX Web Application Architecture) combined with a library of TEI entity managers allows organizations to gain rich functionality at very low cost. The entire system will be releases using less restrictive (non-viral) open source licenses (LGPL and Apache 2.0 style licenses).
Conversion-Free Data Formats
Traditional Relational Database Systems require users to chop up XML documents into rows so that they can be inserted into tables for performing search and retrieval. With the Syntactica architecture:
- All XML documents are added to the eXist system using a simple drag-and-drop process.
- Documents do not have to be “shredded” into SQL tables for indexing and lookup.
- Documents stay in their native XML format and can be quickly indexed for fast analysis.
- Any well-formed XML files (documents or data sets) can be imported into these databases with very little effort.
Library of TEI XQuery Transformations
There are many XQuery transformations that have already been written to transform TEI into other formats such as HTML, PDF, timelines and other possible structures. Because TEI has consistent guidelines of how key entities like people, locations and dates are coded within documents the libraries of TEI functions can be easily shared and enhanced with other members of the TEI community.
Ease of Reporting with XQuery
It is easy to extract reports on entities from TEI documents. For example a query to find all dates or places in a TEI document is only a few lines of code. The Syntactica provides a library of reporting templates to list items, search items and view items of many different types.
Consistent URL (REST) Interfaces
All TEI documents and TEI extracted entities can have consistent interfaces with external sites using simplified URLs. For example you can create an interface that allows a person, location or term to be stored in a bookmark so that all future documents that reference this entity can be quickly identified.
High Quality Fulltext Search and Retrieval with Integrated Structured Search Rules
The eXist 1.4 release (October 2009) included a full library of search and retrieval tools built around the extensive Apache Lucene framework document index management tools. These tools allow for very fast fulltext index management with highly-customizable document scoring systems. The Syntactica has also worked to make these indexing tools easier for non-technical users to access.
Document Indexing for Fast Data Access
The eXist system uses information within the TEI XML files to create highly efficient storage and indexing of large collections of documents. Native XML systems leverage the metadata within documents to create very efficient indexing systems so even large collections of hundreds of thousands of documents can be managed quickly and cost effectively.
Easy Faceted (Drilldown) Searching
When any TEI document is displayed in a web page a simple XQuery of the document entities can be used to show the key items referenced anywhere in this document. For example the right-side of a page can include lists of people, places, dates or events mentioned in this document. These links can include links to show other documents that also include these entities. This makes it easy for researchers to quickly navigate to similar documents.
Data Quality Reporting
XQuery makes it easy to create reports that look for data quality problems. For example a very simple XQuery program can be used to check for valid date formats in an entire collection of documents. We find that a flexible library of data quality reporting tools makes it easier for organizations with limited funding to create tools to quickly identify and correct inconsistencies in complex data sets that have been collected from many different sources.
Customization by Non-Programmers
Because XQuery is based on simple XPath expressions, we feel that it is much easier for non-programmers to create and maintain customized reports. Syntactica provides a library of template applications that allow non-programmers with some training to build and maintain their own web applications. With minimal training, many non-programmers and subject-matter experts can become key contributors to development projects.
High Developer Productivity
One of the central benefits of using this architecture is very high developer productivity. Developers that are familiar with XML structures and XQuery can create new web-applications much faster than any other technology. There are several reasons for this exceptionally high productivity including the ability to avoid data translation and to leverage a large base of sample software that can be quickly modified. This productivity is growing each month as the XQuery/TEI community grows and more open-source applications are being shared.
Taxonomy Management Tools
The Syntactica has developed frameworks of tools to manage databases of key entities that can be consistently referenced within TEI documents. Examples of these entities include people, locations, terms, products or events. These entity managers can be quickly customized to meet your needs.
Data Quality Management Functions
The Syntactica has developed a library of tools to perform many data cleanup functions. These functions are usually used to standardize or transform various documents into standard formats with consistent references to defined entities.
XML Service Enablement
All reports or queries inherently web services. This means that these services can be re-used to create new web applications with very little effort. These rapid-applications, also known as mashups, allow researchers to quickly create new reports or visualizations of large and complex data sets.
Automated Document Versioning
The eXist-db 1.4 system also includes a full XML document versioning system when documents are updated. These versioning files provide a view into prior versions of XML documents with reports that show line-by-line differences between versions of documents as well as reports about who changed these documents and when the changes were made.
Workflow and Publishing Functions
The Syntactica has developed (and is in the process of developing) tools that make it easy to create and manage the overall software lifecycle including requirements management, use cases, business terminology, role-based access control, task management, workflow management and publishing to external web sites.
Roadmap of Enhancements
The Syntactica is actively working with several organizations to make it easier for organizations to have full turn-key solutions to manage TEI documents. These list of enhancements includes:
- Site role management application – defining site-wide roles and associating individual with content collection roles (authoring, editing, approving etc.)
- Role based access control – assigning permissions based on a person’s roles
- Remote publishing – allowing publishing to a public web site with a single click
- Database synchronization – allows developers to automatically sync with central systems
- Source-code versioning integration – better integration with version control systems such as Subversion)
- Streamlined web content management tools (news updates, FAQs, Glossary of Terms etc.)
TEI Demonstration Application
We have been working with Dr. Martin Mueller at Northwestern University to create a series of Beginners Guides that help TEI users understand the simplicity, elegance and power behind using native XML databases and the XQuery languge. Dr. Mueller has kindly provided us with around 40 plays and poems of William Shakespeare that have been coded in TEI. The following is a link to a demonstration application:
http://www.syntactica.com/rest/db/org/northwestern/apps/tei/index.xqPlease note some of the following features in this demonstration:
- We created a simple navigation page by the writing of a simple "collection" query of all plays. This query simply displays all the titles of the plays and presents users with a link to each play or other reports for that play. You can also view the TEI files directly with this link. We recomend using the FireFox web browser to view the XML files.
- We created TEI to HTML Viewers for each Act and Scene. Please note that each of these viewers takes an ID as a URL parameter. This demonstrates that each XQuery is a complete REST service that uses a URL parameter to select which act or scene is viewed.
- We created a simple rule-based TEI to HTML transforms using the powerful XQuery typeswitch function. This tranform creates a function for each TEI tag that you would like to convert to HTML. With a day of training users can quickly learn to modify this tranform to include or exclued the tranform of TEI elements to HTML or XSL-FO for printing.
- We created a simple search and retrieval tool using the Lucene fulltext indexing system.
- The search results us the exist Key Word in Context (KWIC) highlighting in search results.
We feel that the way that TEI documents are coded can have a dramatic impact on the ability to create simple XQueries that navigate the document structures. We feel that anyone that is charged with coding TEI files should have some experience creating XQuery tranformation of these file.
Please contact us if you are interested in seeing the full source code for this application. We are currently in discussions concerning our ability to distribute the TEI files in a training system.
We are currently working with Dr. Mueller and Dr. Joe Wicentowski of the US State Department to create additional documentation for beginners based on this example material.