Synthetic renaissance text generation with generative models

Description

Transliteration of text from centuries-old works represents a research area that is underserved by current tools, such as Adobe Acrobat’s OCR. While these resources can perform text recognition from clearly printed modern sources, they are incapable of extracting textual data from early forms of print, much less manuscripts. This project aims to enhance text recognition capabilities by integrating generative models to simulate Renaissance-era printing imperfections and augment OCR training datasets. The project will focus on the Spanish printed sources from the 17th century.

Duration

Total project length: 175 hours

Task ideas

Use generative models (GANs or diffusion models) to generate synthetic Renaissance-style printed text with realistic printing defects, ink bleed, and paper degradation.
Simulate historical printing techniques and fine-tune models with historical text datasets.
Use the synthetic printed text to improve current models.

Expected results

Develop a generative models capable of producing Renaissance-style printed text with realistic degradation effects.
Generate a dataset of synthetic Renaissance text.
Improve OCR model performance on historical Spanish texts using the synthetic text, achieving at least 80% text extraction accuracy.

Requirements

Python and some previous experience in Machine Learning.

Difficulty level

Medium

Mentors

Sergei Gleyzer (University of Alabama)
Xabier Granja (University of Alabama)
Nicholas Jones (Yale University)
Harrison Meadows (University of Tennessee Knoxville)
Emanuele Usai (University of Alabama)

Please DO NOT contact mentors directly by email. Instead, please email human-ai@cern.ch with Project Title and include your CV and test results. The mentors will then get in touch with you.

Corresponding Project

RenAIssance