Skip to content

Text digitization

This document group my notes during the task of digitizing the book "les mots de l'écrivain".

Redigitization

I was not sure anymore about the quality of the markdown files I have. Anyway, it was a mix of manual digitization and digitization from png images. I decided to extract again the png images from the book. This time I used the one column format, which improved the quality of the digitization.

I regenerated the markdown files and manually added headers before calling the perchance-tools. I have completed the digitization and the corrections of words and I stored everything on Dropbox.

Cache mechanism

I am using a local cache mechanism to store results previously computed by the prompts.

GPT-4o prompt

In a first moment I tried to split the correction task in several prompts because GPT-3.5 was not having a good performance. I could have used GPT-4 because the results were indeed very good, but the cost 20x more expensive.

However, I decided that for this very specific task, which is word correction from the digitization of a thesaurus book, I would rely on the most advanced GPT model available. That decision came in a good timing because OpenAI has just launches GPT-4o which is half the price of GPT-4.

ChatGPT Correction

  • Need to change the prompt: In compound words or expressions, it returned only the corrected portion. "hotesse de l'air" -> "hôtesse" Although that is the behaviour for the original prompt, it is not well suited for this case.
  • It is too expensive. It would be better to have a special prompt to deal with a list of words.
  • In some cases, it would be beneficial to pass a little bit of context. Give a single word is not optimum to ask for corrections. We need to say for example that the word is supposed to be a clothing, or an adjective to qualify the nose, for example.

Automate digitization process

  • Pictures take on the left side of the book are not digitized with quality as good as those on the right side.
  • It would also help to automatically put one word per line.
  • Sort the entries of a category help to check against errors. It is important to use the locale (French) in order to sort words with diacritics correctly.
  • There are some typos. They are not that much, but could easily pass unnoticed. It would be a good idea to use ChatGPT to correct those automatically.
  • I noticed that the book had a kindle version. Although I can't copy the selected text, I can take a screen shot. Nowadays it is very practical to take a screen shot of the area of interested. I took all the pictures, converted to a single pdf file and then used my tool to digitilize to text. Much more efficient this time because the digitalizataion contained almost no errors.

Structured format

After a manually cleaning of the data, I am going to transform it into a structured format. I will try to use ChatGPT to help me with that.

Given a markdown text, I want that you transform it into an equivalent yml file.

Examples:

Input:

# Cinq sens

## Vue

### Verbes

admirer
voir
zieuter

### Noms

agencement
uniformité
vision
visualisation

### Adjectifs

affreux
versicolore
vif
vilain
visible

## Toucher

### Verbes

affleurer
appuyer
attoucher
caresser

Output:

root:
    cinq_sens:
        vue:
            verbes:
                - admirer 
                - voir 
                - zieuter 
            noms:
                - agencement 
                - uniformité 
                - vision 
                - visualisation 
            adjectifs:
                - affreux 
                - versicolore 
                - vif 
                - vilain
                - visible 
        toucher:
            verbes:
                - affleurer 
                - appuyer 
                - attoucher 
                - caresser