Catmus logotype
[WIP]

Datasets


Medieval Dataset

The following datasets are in the current version of the HuggingFace dataset, as seen in the paper from CATMuS Medieval.

All data were passed through Choco Mufin. Any attempt to replicate the HuggingFace dataset should use the conversion table provided with each dataset. The conversion table for the modified dataset will be made available.

language url version
Latin DEEDS-Project/htr-dataset v0.0.8
Latin HTRomance-Project/medieval-latin v0.0.8
Spanish Languages HTRomance-Project/middle-ages-in-spain v0.0.6
Old/Middle French HTRomance-Project/medieval-french v0.0.9
Italian Languages HTRomance-Project/medieval-italian v1.0.2
Old/Middle French HTR-United/cremma-medieval v2.0.1
Latin HTR-United/cremma-medieval-lat v0.1.2
Old/Middle French Gallicorpora/HTR-imprime-gothique-16e-siecle v0.0.19
Old/Middle French Gallicorpora/HTR-MSS-15e-Siecle v0.0.37
Old/Middle French Gallicorpora/HTR-incunable-15e-siecle v0.0.29
Old/Middle French ciham-htr/fabliaux v0.0.22
Old/Middle French ciham-htr/liber v0.0.5
Latin/Italian/French Reorganized from HN2021 Boccace last
Old/Middle French Reorganized [Decameron-Fr] last
Latin Reorganized malamatenia/Eutyches last
Spanish Languages Reorganized & Augmented Gille-Levenson's PhD Data last
Latin Adapted rescribe/carolineminuscule-groundtruth 2023-10-03

Vocabulary:

  • Adapted: Image were replaced / rescaled
  • Reorganized: Data were reorganized in a way that made them more usable
  • Agumented: Data whose digitization were not publicly available where used to extract lines.