The aim of the overall project is to present a concept for using OCR to process digitised versions of Germany’s printed cultural heritage from the 16th to the 19th century. Additionally, the existing prototype OCR software will be developed further during Phase III.

In recent years libraries – especially research libraries – have undertaken a comprehensive programme of digitisation to convert their holdings into image form. However, access to the full electronic text is often impossible or inadequate, and OCR procedures are necessary in order to automatically generate searchable full texts from the image data.

The added value of digital full texts is essential today for numerous academic disciplines, especially in the field of the humanities. The OCR-D funding initiative seeks to advance the development of full text recognition and to optimise it for bulk digitisation in libraries.

The project is a collaborative endeavour run in coordination between the Herzog August Bibliothek, the Berlin-Brandenburgische Akademie der Wissenschaften (Berlin-Brandenburg Academy of Sciences and Humanities, BBAW), the Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen (Göttingen Society for Scientific Data Processing, GWDG), the Staatsbibliothek zu Berlin (Berlin State Library, SBB) and the Niedersächsische Staats- und Universitätsbibliothek Göttingen (Göttingen State and University Library of Lower Saxony, SUB).

During Phase I (2015–2018) the state of current OCR techniques was evaluated and development needs were identified. On this basis, the eight project modules of Phase II (2018–2020) developed OCR-D tools dedicated to the specific challenges of full text recognition in historical documents. The prototype results integrated by the coordination project are freely accessible on GitHub.

//www.hab.de/wp-content/uploads/2020/02/hab-forschungsprojekte-ocr-beispiel-text.jpg
Korrekt auf Regionen- und Zeilenebene segmentierte, volltexterkannte Seite aus Johannes Praetorius' im Jahr 1671 publizierter Schrift »Eine nützliche Spiel-Karte für die Flucher« (M: Tg 117)

Phase III launched in 2021. The aim of this phase is to implement the OCR-D software in institutions maintaining and processing text holdings and to advance the development of selected tools. Four implementation projects and three project modules have been approved by the Deutsche Forschungsgemeinschaft (German Research Foundation, DFG):

Implementation projects

Integration of Kitodo and OCR-D for productive mass digitisation (UB Braunschweig, SLUB Dresden, UB Mannheim)

OPERANDI: OCR-D Performance Optimisation and Integration (SUB Göttingen, GWDG)

OCR4all libraries – full text recognition for historical collections (GEI Braunschweig, HCI and ZPD at Universität Würzburg)

ODEM: OCR-D extension for mass digitisation (ULB Sachsen-Anhalt)

 

Project modules

Workflow for work-specific training based on generic models using OCR-D and Ground Truth enhancement (UB Mannheim)

Font Group Recognition for Improved OCR (JGU Mainz, FAU Erlangen-Nürnberg)

OLA-HD Service – a generic service for the long-term archiving of historical prints (SUB Göttingen, GWDG)

 

The coordination project supports the implementation projects and project modules in their work. Phase III will also optimise the OCR-D software for bulk digitisation and develop a concept for continuing the work.

The HAB is responsible for coordinating the project: this includes project management, organising workshops, documentation, scholarly publication and preparation of a concept for the full-text transformation of VD 16, VD 17 and VD 18.

Website: https://ocr-d.de/de/

In cooperation with BBAW, GWDG, SBB and SUB

PURL: http://diglib.hab.de/?link=068

Funding: DFG
Duration: October 2015 – June 2024
Project participant: Lena Hinrichsen (team member)