Transcription of text from centuries-old works represents a research area that is underserved by current tools, such as Adobe Acrobat’s OCR. While these resources can perform text recognition from clearly printed modern sources, they are incapable of extracting textual data from early forms of print, much less manuscripts. This project will focus on the application of hybrid end-to-end models based on transformers (e.g. VIT-RNN or CNN-TF or VIT-TF) to recognize text in Spanish printed sources from the seventeenth century. Transformer-based models have been fine-tuned to improve transcription accuracy, particularly for degraded and complex historical texts. Training on a diverse dataset that combines expert transcriptions and synthetic data has enabled better generalization across various typographical styles.
For GSoC 2025, this project aims to expand the dataset, to help the model finetune to handle handwritten documents as well. The goal is to increase our fine-tuning on larger datasets incorporating diverse typographical styles both printed and handwritten.
Total project length: 175 hours
Python and some previous experience in Machine Learning.
Advanced
Please DO NOT contact mentors directly by email. Instead, please email human-ai@cern.ch with Project Title and include your CV and test results. The mentors will then get in touch with you.