Design

This section discuss design decisions took during the project

Brick

A Brick is a data structure inspired in the design of a Trie. A Trie is prefix tree and it is extensively used to store words. However, the Brick data structure is more adapted for the goals we aim.

In the game, a puzzle is composed by a pair: a list of characters and a list of words. The list of words are formed by using only the characters listed in the list of characters.

A Brick data structure stores puzzles. Each path in the tree forms a puzzle. That is, the sequence of characters starting from the root node until one of the leaf nodes forms a valid puzzle.

To insert a word in a Brick you have to sort its characters in ascending order and eliminate the repeated ones. Next, insert the resulting string as you would if it was in a Trie.

There is another difference. We need to explicit store the words. In a Trie, the path to a leaf is the word itself. In a Brick, the path is a sorted combination of the unique characters of a set of words. Indeed, many words may be associated to a unique path. Therefore, at the end of a path we should append the list of words that can be formed by that path.

Yet another difference. We may have paths ending at intermediate nodes, differently from a Trie in which the paths always end at a leaf.

Example

Here is what looks like the brick for the following puzzle:

{ "puzzle": { "letters": "deioprv", "words": [ "prior", "proper", "error", "driver", "deed", "video", "over", "period", "pipe", "divide", "drive", "deep", "derive", "river", "poor", "order", "pride", "dive", "drop", "deer", "deprive", "provider", "ever", "provide", "door", "pepper", "prove", "peer", "ride", "revive", "rope" ] }

Brick properties

Unbalanceness

Left subtrees tends to be bigger than their right siblings. That's because it sufficies to have one letter 'a' in the word w in order to that word be positioned at the a subtree. Therefore, the data structure is unbalanced to the left.

Brick file format

               |-------BRICK_LIST------||--------OFFSET_LIST-------| 
<ALPHABET_SIZE>[[KEY]#<BACKTRACK_COUNT>][<OFFSET><NUM_WORDS>[WORD#]]<BRICK_LENGTH>

1. ALPHABET_SIZE: The number of bytes needed to encode the alphabet underlying the words to be stored by the Brick. 2. KEY: A single character composing a path in the Brick. 3. BACKTRACK_COUNT: Indicates that a path has ended. The BACKTRACK_COUNT tells how many levels up in the current path we have to climb in order to insert the next KEY. 4. OFFSET: Indicates how many keys were inserted before that a list of words must be appended to a path. 5. NUM_WORDS: Indicates the number of words to append at the end of some path. 6. WORD: A word in the Brick collection. 7. BRICK_LENGTH: Indicates the number of bytes between in BRICK_LIST.

A simpler format could have been adopted. For example, we could have simply stored the words contained in the Brick in the order they were inserted. However, I suspect that the adopted format is superior for the following reasons.

It is independent of the algorithm defined in insert_word and independent of the data structure used to store the words at the end of a path.
It is faster if the Brick has several words. Executing one insertion per word would mean to traverse the Brick several times. The current format traverses the Brick only once. (This one needs to be tested in a benchmark).

Segmenter

Modular brick

Puzzle generator algorithm

Input: n := Number of letters in the puzzle b := brick

def get_combinations(brick, kn_tuple, current_sum=0, previous_path=""):
    if kn_tuple.empty():
        yield tuple() 

    j = kn_tuple.front()

    for m in range(j,current_sum+1):
        '''
        The m-flat brick representation consists of a string in which 
        every non-empty path of length j (i.e. with at least one word) 
        are concatenated one after the other separated by some token, 
        for example $.

        Example:

        The 3-flat representation for the brick in the brick image 
        example is:

        dep$der$dor$eip$eor$epr$erv$opr    
        '''
        m_flat = m_flat_brick(brick,m)


        ''' 
        Given a m-letter path, we will find all paths with `m-j` letters
        in common. 

        For example:            

        j := 4
        current_sum := 6
        previous_path := de

        m := 4 (0 letters in common)
            (?<=\$)[^\$de]*(?=\$)

        m := 5 (1 letter in common)
            (?<=\$)[^\$de]*[de]{1}[^\$de]*(?=\$)

        m := 6 (2 letters in common)
            (?<=\$)[^\$de]*[de]{2}[^\$de]*(?=\$)


        Here are the results for the brick in the image above:
            get_combinations(brick,(2,4),0,"de")
                deoprv (de + eoprv) 
                deiprv (de + iprv)
                deiopr (de + iopr)
        '''
        r = build_regex(previous_path,m)

        candidate_paths = [path for path in m_flat if path.match(r) ]
        for path in candidate_paths:
            yield (path,) + get_combinations(brick,kn_tuple,current_sum+j,previous_path)


# A kn-tuple is a set of k numbers such that their sum equals to n 
kn_tuple = get_kn_tuple()

# This is a list of valid paths in the brick that are composed by n letters 
valid_puzzle_paths = list(get_all_combinations(brick,kn_tuple))
puzzle_path = shuffle_and_pick(valid_puzzle_paths)

'''
The last step consists to collect all words in the brick that are made composed
by the letters in each of the paths.

Notice that we have to explore every subtring in the path. For example,
puzzle_path := deoprv

    subpaths:
    dopr    (drop)
    dor     (door)
    deor    (order)
    der     (deer)
    dep     (deep)
    de      (deed)
    eoprv   (prove)
    eopr    (rose, proper)
    eorv    (over)
    eor     (error)
    epr     (pepper, peer)
    erv     (ever)
    opr     (poor)
'''
word_collection = collect_all_words_subpaths(puzzle_path)

In the english-5K corpora there are 17948 puzzles of 7 letters.