Text digitization

This document group my notes during the task of digitizing the book "les mots de l'écrivain".

Redigitization

I was not sure anymore about the quality of the markdown files I have. Anyway, it was a mix of manual digitization and digitization from png images. I decided to extract again the png images from the book. This time I used the one column format, which improved the quality of the digitization.

I regenerated the markdown files and manually added headers before calling the perchance-tools. I have completed the digitization and the corrections of words and I stored everything on Dropbox.

Cache mechanism

I am using a local cache mechanism to store results previously computed by the prompts.

GPT-4o prompt

In a first moment I tried to split the correction task in several prompts because GPT-3.5 was not having a good performance. I could have used GPT-4 because the results were indeed very good, but the cost 20x more expensive.

However, I decided that for this very specific task, which is word correction from the digitization of a thesaurus book, I would rely on the most advanced GPT model available. That decision came in a good timing because OpenAI has just launches GPT-4o which is half the price of GPT-4.

ChatGPT Correction

Need to change the prompt: In compound words or expressions, it returned only the corrected portion. "hotesse de l'air" -> "hôtesse" Although that is the behaviour for the original prompt, it is not well suited for this case.
It is too expensive. It would be better to have a special prompt to deal with a list of words.
In some cases, it would be beneficial to pass a little bit of context. Give a single word is not optimum to ask for corrections. We need to say for example that the word is supposed to be a clothing, or an adjective to qualify the nose, for example.

Automate digitization process

Pictures take on the left side of the book are not digitized with quality as good as those on the right side.
It would also help to automatically put one word per line.
Sort the entries of a category help to check against errors. It is important to use the locale (French) in order to sort words with diacritics correctly.
There are some typos. They are not that much, but could easily pass unnoticed. It would be a good idea to use ChatGPT to correct those automatically.
I noticed that the book had a kindle version. Although I can't copy the selected text, I can take a screen shot. Nowadays it is very practical to take a screen shot of the area of interested. I took all the pictures, converted to a single pdf file and then used my tool to digitilize to text. Much more efficient this time because the digitalizataion contained almost no errors.

Structured format

After a manually cleaning of the data, I am going to transform it into a structured format. I will try to use ChatGPT to help me with that.

Given a markdown text, I want that you transform it into an equivalent yml file.

Examples:

Input:

# Cinq sens

## Vue

### Verbes

admirer
voir
zieuter

### Noms

agencement
uniformité
vision
visualisation

### Adjectifs

affreux
versicolore
vif
vilain
visible

## Toucher

### Verbes

affleurer
appuyer
attoucher
caresser

Output:

name="__codelineno-1-1" href="#__codelineno-1-1">root: cinq_sens: vue: verbes: - admirer - voir - zieuter noms: - agencement - uniformité - vision - visualisation adjectifs: - affreux - versicolore - vif - vilain - visible toucher: verbes: - affleurer - appuyer - attoucher - caresser