Skip to content

Journal

TODO: General

  • Understand the differences between the Make variables types of assignments.

TODO: correct-markdown

  • I do not want that the corrector to replace expressions that are correct only because they do not seem appropriated.

Examples:

  • Nickel -> Parfait
  • Se marie mieux avec -> s'accorde mieux avec
  • Quel crampon -> Quelle râleuse
  • tapais du pied -> pratiquait la danse

  • Add a comment in the code (MarkdownView) explaining why I am using a MarkdownView and not a plain-text view to do find and replacements. A complete explanation can be found in the design document.

  • Tackle the TODOs in the code.

  • Write a function to print the segments of StringView in a helpful format such as:

[H](<span>)[N](Claude: )[H](</span>)[N]( Que mangerons-nous\ncette après-midi?\n\n)[H](<span>)[N](Boris: )[H](</span>)
  • It will be much easier if I have a way to go interactively over a correction and then decide if I will keep it or not.

  • The corrected version of the text (without the strikethroughs) have the tendency to remove some spaces between words. For example (il doit yavoir)

  • If the markdown file has a double quoted enclosed word, we might face problems in the creation of correct.json file. In the instance I observed that, the output of the step that creates correct.json generated a string representation of a json object in which the double quotes in the message was not properly escaped. "{ \"message\":\" tralala, blablablha, \"error occurs here\" blablablah \"}" A workaround is to use single quotes in the markdown text instead of double quotes.

  • The explanation step is too slow. It is a step I can do in parallel though. I need to improve the workflow to accept different LLM-Agent designs, such as the ones described here

2025-01-26

  • Pushed v2.0.0 of word-definition prompt to consider the context in which the word appears.
  • Updated correct-markdown collect-bold-segments to collect the context as well.
  • Updated Makefile of correct-markdown to comply with changes above.
  • Finished refactoring of Makefile of correct-markdown.
  • Update correct-markdown Application.

2025-01-25

  • I started working on the v2.0.0 of word-definition prompt. Modifications are in the correct-markdown project.

I need to update how the prompt is called in the Makefile. In particular, I need to pass the context surrounded the word in bold, and not only the word in boldface as it is currently the case.

To test, I am using la-voyage-reduit.md which is in the languages Journals under Drafts.

  • Re-run correct-markdown for my last text: La voyage de Alice. Everything went well.
  • Move get_plain_text to utils instead of having it in api.

2025-01-24

  • Understand if I can really rely on the diff to compute the segments of StringView. At least add a comment in the code explaining that I assume that I am going to get the separation of HTML segments for free by using the difflib and autojunk set to False. (DONE!)

  • I am now supporting both pure-markdown and plain-text in MarkdownView. I finished updating MarkdownView to support both modes. I am currently going through the tests in test_markdown_view to figure it out if I can obtain equivalent results using the plain-text strategy. (DONE!)

One thing to look at is that the string returned by get_full_content should respect as most as possible the original formatting with respect to spaces and new lines.

During tests there was an inconsistency between pure-markdown and plain-text modes. Nonetheless, the documents are equivalent. Therefore, I am going to accept both versions and the test will check if the underlying html document is equivalent.

Given the following document

<span>Claude: </span> Que mangerons-nous
cette après-midi?

<span>Boris: </span>

Pure markdown mode segments it like this:

[H](<span>)[N](Claude: )[H](</span>)[N]( Que mangerons-nous\ncette après-midi?\n\n)[H](<span>)[N](Boris: )[H](</span>)

Plain text mode segments it like this:

[H](<span>)[N](Claude: )[H](</span>)[N]( Que mangerons-nous\ncette après-midi?\n)[H](\n<span>)[N](Boris: )[H](</span>)

Therefore, for the replacement:

mv.replace(0, "Claude:  Que mangerons-nous", "Daniel: On mange quoi")

We obtain for the pure markdown mode:

[H](<span>)[N](Daniel: On mange quoi\ncette après-midi?\n\n)[H](</span><span>)[N](Boris: )[H](</span>)
<span>Daniel: On mange quoi
cette après-midi?

</span><span>Boris: </span>

whereas for plain text we obtain:

[H](<span>)[N]Daniel: On mange quoi\ncette après-midi?\n)[H](</span>)[H](\n<span>)[N](Boris: )[H](</span>)
<span>Daniel: On mange quoi
cette après-midi?
</span>
<span>Boris: </span>

2025-01-14

  • I removed the OUTPUT_FOLDER parameter of the Makefile. I should call make from the intended build folder and pass the path to the Makefile using -f

2025-01-13

  • Experiment improve the correct-text prompt to correct output a json object. This was an item in the TODO list. I remove it because I realize that I can obtain more consistent results if I simply ask to correct the given text while preserving the format. JSON manipulation can be easily done using jq.

2025-01-12

The error is during the creation of Segments of StringView. I am using character-mode to compute the diff between the original markdown and the no html view and this is giving me some unexpected diff items.

Test if it is related with the autojunk mode; or if I can use word since I am doing a smart find replace that ignore spaces.

TODO: It seems that I have not set the autojunk to False. By doing that seemed to make it work. (DONE! Confirmed that it works)

2025-01-08

    raise ValueError({"message": "Value not found", "search_value": search_value})
ValueError: {'message': 'Value not found', 'search_value': "**se  marie mieux avec** la montagne. Thomas: Je sui
s d'accord! Et"}

I was expecting that the correction were to be done on the plain-text of markdown. That is, the stars should not be part of the text I send to the LLM to make corrections.

However, as explained in the design and architecture document, I decided to pass to the LLM a pure-markdown document instead in order to better apply corrections involving segments across markdown markup such as the double stars.

For example:

Incorrect text: il a marce ver la forêt Corrected text with plain-text: il a marché vers la forêt Corrected text with pure-markdown: il a marché vers la forêt.

2024-12-29

  • Pipeline error: Found an error while executing for text: "la-maitresse". I have to create a minimum example to reproduce the error, which I believe it is framed in the diff items below. In particular, the third one.
  {
    "context": "bières que tu avalises. ruben: on ne va pas recommencer, ___ pas? tu fais ce que tu veux et mois je",
    "original_value": "ne c'est",
    "new_value": "n'est-ce",
    "operation": "replace"
  },
  {
    "context": "recommencer, n'est-ce pas? tu fais ce que tu veux et ___ je fais pareil. où sont les pizzas que j'en ai",
    "original_value": "mois",
    "new_value": "moi",
    "operation": "replace"
  },
  {
    "context": "et moi je fais pareil. où sont les pizzas que ___ ai acheté? alice: oh, je les ai jétés. ils n'étaient",
    "original_value": "j'en",
    "new_value": "j'ai achetées? alice: oh, je les",
    "operation": "replace"
  },
  {
    "context": "les pizzas que j'ai achetées? alice: oh, je les ai ___ les ai jétés. ils n'étaient pas bonne. ruben: impossible! je",
    "original_value": "acheté? alice: oh, je",
    "new_value": "jetées. elles n'étaient pas bonnes. ruben: impossible! je",
    "operation": "replace"
  },
  {
    "context": "jetées. elles n'étaient pas bonnes. ruben: impossible! je les ai ___ hier même. alice: de toute façon, les surgélés sont toujours",
    "original_value": "jétés. ils n'étaient pas bonne. ruben: impossible! je les ai achétaient",
    "new_value": "achetées",
    "operation": "replace"
  },
  {
    "context": "les ai achetées hier même. alice: de toute façon, les ___ sont toujours mauvais. ruben: d'accord! ce n'est pas grave. je",
    "original_value": "surgélés",
    "new_value": "surgelés",
    "operation": "replace"
  }
  • I found that the error above was located in the text_diff function.
  • In this function, there is an option to update the first sequence iteratively. That is, every time we traverse a DiffItem, we apply the diff in the first sequence. This is done to recover updated contexts.
  • However, when I do iteractive update, I must break the DiffSequence generated by diff_lib. The issue is that I might break the DiffSequence in a bad point. For example, some times we have compound actions, for example, a replace followed by a delete. I think the issue occurs when I have such breaks in compound actions.
  • The error was the autojunk heuristic in the SequenceMatcher generator. Setting it to False solved the issue.

2024-09-22

  • Structured the enhance-markdown pipeline.
  • Create an example folder
  • Add create venv step

2024-09-10

toml library potential error

In a multiline literal string ('''), when I have three non-consecutive single quotes, I got an error. Not sure if this is the expected behaviour

It seems to be a bug in the python library

Mkdocs footnotes plugin

features:
  - content.footnote.tooltips

markdown_extensions:
  - footnotes

Ignore markdown during correction

Use markdown library to parse to html and beautifulSoup to get the text.

Word segmenter

I am detecting differences between the original and corrected version by doing an analysis word by word. But I can have a sequence of corrections that are close to each other (at least by one word) and the justification for the change of n+1 depends on the changes made in n. Therefore, I need to correctly construct the context using the most recent version of the corrected text.

  {
    "context": "justificou o asilo político de González dizendo que, na Venezuela, ___ vida corria perigo, e as crescentes ameaças, citações judiciais, ordens",
    "original_value": "\\\"sua",
    "new_value": "a",
    "operation": "replace"
  },
  {
    "context": "asilo político de González dizendo que, na Venezuela, \\\"sua vida ___ perigo, e as crescentes ameaças, citações judiciais, ordens de apreensão",
    "original_value": "corria",
    "new_value": "dele estava em",
    "operation": "replace"
  },