Text digitization
This document group my notes during the task of digitizing the book "les mots de l'écrivain".
Redigitization
I was not sure anymore about the quality of the markdown files I have. Anyway, it was a mix of manual digitization and digitization from png images. I decided to extract again the png images from the book. This time I used the one column format, which improved the quality of the digitization.
I regenerated the markdown files and manually added headers before calling the perchance-tools. I have completed the digitization and the corrections of words and I stored everything on Dropbox.
Cache mechanism
I am using a local cache mechanism to store results previously computed by the prompts.
GPT-4o prompt
In a first moment I tried to split the correction task in several prompts because GPT-3.5 was not having a good performance. I could have used GPT-4 because the results were indeed very good, but the cost 20x more expensive.
However, I decided that for this very specific task, which is word correction from the digitization of a thesaurus book, I would rely on the most advanced GPT model available. That decision came in a good timing because OpenAI has just launches GPT-4o which is half the price of GPT-4.
ChatGPT Correction
- Need to change the prompt: In compound words or expressions, it returned only the corrected portion. "hotesse de l'air" -> "hôtesse" Although that is the behaviour for the original prompt, it is not well suited for this case.
- It is too expensive. It would be better to have a special prompt to deal with a list of words.
- In some cases, it would be beneficial to pass a little bit of context. Give a single word is not optimum to ask for corrections. We need to say for example that the word is supposed to be a clothing, or an adjective to qualify the nose, for example.
Automate digitization process
- Pictures take on the left side of the book are not digitized with quality as good as those on the right side.
- It would also help to automatically put one word per line.
- Sort the entries of a category help to check against errors. It is important to use the locale (French) in order to sort words with diacritics correctly.
- There are some typos. They are not that much, but could easily pass unnoticed. It would be a good idea to use ChatGPT to correct those automatically.
- I noticed that the book had a kindle version. Although I can't copy the selected text, I can take a screen shot. Nowadays it is very practical to take a screen shot of the area of interested. I took all the pictures, converted to a single pdf file and then used my tool to digitilize to text. Much more efficient this time because the digitalizataion contained almost no errors.
Structured format
After a manually cleaning of the data, I am going to transform it into a structured format. I will try to use ChatGPT to help me with that.
Given a markdown text, I want that you transform it into an equivalent yml file.
Examples:
Input:
# Cinq sens
## Vue
### Verbes
admirer
voir
zieuter
### Noms
agencement
uniformité
vision
visualisation
### Adjectifs
affreux
versicolore
vif
vilain
visible
## Toucher
### Verbes
affleurer
appuyer
attoucher
caresser
Output: