Design
This section discuss design decisions took during the project
Brick
A Brick is a data structure inspired in the design of a Trie. A Trie is prefix tree and it is extensively used to store words. However, the Brick data structure is more adapted for the goals we aim.
In the game, a puzzle is composed by a pair: a list of characters and a list of words. The list of words are formed by using only the characters listed in the list of characters.
A Brick data structure stores puzzles. Each path in the tree forms a puzzle. That is, the sequence of characters starting from the root node until one of the leaf nodes forms a valid puzzle.
To insert a word in a Brick you have to sort its characters in ascending order and eliminate the repeated ones. Next, insert the resulting string as you would if it was in a Trie.
There is another difference. We need to explicit store the words. In a Trie, the path to a leaf is the word itself. In a Brick, the path is a sorted combination of the unique characters of a set of words. Indeed, many words may be associated to a unique path. Therefore, at the end of a path we should append the list of words that can be formed by that path.
Yet another difference. We may have paths ending at intermediate nodes, differently from a Trie in which the paths always end at a leaf.
Example
Here is what looks like the brick for the following puzzle:
{ "puzzle": { "letters": "deioprv", "words": [ "prior", "proper", "error", "driver", "deed", "video", "over", "period", "pipe", "divide", "drive", "deep", "derive", "river", "poor", "order", "pride", "dive", "drop", "deer", "deprive", "provider", "ever", "provide", "door", "pepper", "prove", "peer", "ride", "revive", "rope" ] }
Brick properties
Unbalanceness
Left subtrees tends to be bigger than their right siblings. That's because it sufficies to have
one letter 'a' in the word w in order to that word be positioned at the a subtree. Therefore,
the data structure is unbalanced to the left.
Brick file format
|-------BRICK_LIST------||--------OFFSET_LIST-------|
<ALPHABET_SIZE>[[KEY]#<BACKTRACK_COUNT>][<OFFSET><NUM_WORDS>[WORD#]]<BRICK_LENGTH>
A simpler format could have been adopted. For example, we could have simply stored the words contained in the Brick in the order they were inserted. However, I suspect that the adopted format is superior for the following reasons.
- It is independent of the algorithm defined in
insert_wordand independent of the data structure used to store the words at the end of a path. - It is faster if the Brick has several words. Executing one insertion per word would mean to traverse the Brick several times. The current format traverses the Brick only once. (This one needs to be tested in a benchmark).
Segmenter
Modular brick
Puzzle generator algorithm
Input: n := Number of letters in the puzzle b := brick
def get_combinations(brick, kn_tuple, current_sum=0, previous_path=""):
if kn_tuple.empty():
yield tuple()
j = kn_tuple.front()
for m in range(j,current_sum+1):
'''
The m-flat brick representation consists of a string in which
every non-empty path of length j (i.e. with at least one word)
are concatenated one after the other separated by some token,
for example $.
Example:
The 3-flat representation for the brick in the brick image
example is:
dep$der$dor$eip$eor$epr$erv$opr
'''
m_flat = m_flat_brick(brick,m)
'''
Given a m-letter path, we will find all paths with `m-j` letters
in common.
For example:
j := 4
current_sum := 6
previous_path := de
m := 4 (0 letters in common)
(?<=\$)[^\$de]*(?=\$)
m := 5 (1 letter in common)
(?<=\$)[^\$de]*[de]{1}[^\$de]*(?=\$)
m := 6 (2 letters in common)
(?<=\$)[^\$de]*[de]{2}[^\$de]*(?=\$)
Here are the results for the brick in the image above:
get_combinations(brick,(2,4),0,"de")
deoprv (de + eoprv)
deiprv (de + iprv)
deiopr (de + iopr)
'''
r = build_regex(previous_path,m)
candidate_paths = [path for path in m_flat if path.match(r) ]
for path in candidate_paths:
yield (path,) + get_combinations(brick,kn_tuple,current_sum+j,previous_path)
# A kn-tuple is a set of k numbers such that their sum equals to n
kn_tuple = get_kn_tuple()
# This is a list of valid paths in the brick that are composed by n letters
valid_puzzle_paths = list(get_all_combinations(brick,kn_tuple))
puzzle_path = shuffle_and_pick(valid_puzzle_paths)
'''
The last step consists to collect all words in the brick that are made composed
by the letters in each of the paths.
Notice that we have to explore every subtring in the path. For example,
puzzle_path := deoprv
subpaths:
dopr (drop)
dor (door)
deor (order)
der (deer)
dep (deep)
de (deed)
eoprv (prove)
eopr (rose, proper)
eorv (over)
eor (error)
epr (pepper, peer)
erv (ever)
opr (poor)
'''
word_collection = collect_all_words_subpaths(puzzle_path)
In the english-5K corpora there are 17948 puzzles of 7 letters.
