Leo Tree Model - rewritten | Art of computing

The new Leo tree data model has been evolving during last few weeks. Some things has been changed since my last post. I guess now may be the right time to explain how it works and why it was designed the way it is.

But before I start to explain model, let me give short summary of what features new model supports.

super fast tree drawing
selecting nodes
moving nodes left, right, up and down as well as moving from one position in the outline to any other position
clone node
copy and paste node/subtree
delete node/subtree
promote/demote node
reading XML leo files
reading external files
writing external files
reading at-auto python files
writing at-auto files
unlimitted universal undo/redo functionality (with a single method entry handles any kind of tree/node modification including changes to body and/or headline)

All these features are achieved in one python module with some 1669 LOC (with comments included 2299 lines).

All these commands for modification of outline run much faster than analog Leo commands. For example executing promote/demote pair of commands takes Leo on average 113.6ms (measured 100 times with one clone node and three following sibling nodes). The new data model performs this in 3.7ms which means about 30 times faster than Leo. Comparing all other outline commands give similar speed ratio. Also all outline traversals are faster than Leo’s iterators. For example executing:

import timeit
def tf():
    res = []
    for p in c.all_positions():
        res.append(('----'*p.level()) + p.h)
    return '\n'.join(res)
t1 = timeit.timeit(tf, number=100)*1000/100
g.es('Average: %.1fms\n'%t1)

inside LeoPyRef.leo, prints Average: 66.8ms on my machine, while computing the same value from new model prints Average: 5.4ms. New model traversal in this case is 12 times faster.

The new data model is very well tested using excellent hypothesis framework. Testing involves creating 90 random outlines and performing series of 1000 random tree operations, checking everytime certain invariants. This kind of testing revealed several bugs that would be very hard to encounter using hand written test cases.

An interesting way to understand how code works is to comment out any line and run tests again. Failure is almost certain and in case of failure, hypothesis will display minimal sequence of commands that lead to the failure.

How can Leo benefit best from this code

There may be several ways for Leo to benefit from this code. It may be used as an inspiration for changing Leo’s VNode and Position classes and possibly some other as well like undoer, importCommands, leoTree, … But, there is no guarantee that it would be possible to implement all these features by changing Leo’s original classes in backward compatible way. Plus, this also would mean lots of additional work.

A better way would be to add this single module in leo/core/ and use it internally. Redirect top-level commands (like cut/copy/paste, clone, delete-outline, move node, expand, collapse, promote, demote, undo, redo) to delegate their work to the new data model. Also, selecting nodes and drawing tree should be delegated to the new model. All these changes can be done with minimal changes to the current Leo.

For example:

def clone_node(self):
    if g.USE_NEW_MODEL:
        return self.ltm.clone_node()
    ... the rest of method remains unchanged

The same approach can be used for all other methods.

Finally, execute-script command can be changed to do the following:

if g.USE_NEW_MODEL:
    c.ltm.store_to_vnode(c.hiddenRootNode)
    old_USE_NEW_MODEL = True
    g.USE_NEW_MODEL = False
else:
    old_USE_NEW_MODEL = False
try:
    ... old body of execute script method
finally:
    g.USE_NEW_MODEL = old_USE_NEW_MODEL
    ltm.restore_from_vnode(c.hiddenRootNode)

All scripts will continue to work as they used to, without any change. If one day, LeoTreeModel class, gets its C or Rust implementation, the only thing that Leo should do to use power of such extension is to change one import so that LeoTreeModel class comes from that extension and not from ordinary Python module.

Helper data classes

There are two new data classes that the new model uses. While prototyping I was using tuples and lists for holding necessary data. In the end I have converted those tuples and lists into specialized data classes.

At the lowest level there is the NData class, which contain data related to a single node in the outline. It has at least following five fields:

h - headline
b - body
children - is a list of gnx-es of children nodes
parents - is a list of gnx-es of parent nodes
size - is the total number of cells this node and its subtree occupies in the outline lists. Basicaly it is the same value that you would get from the Leo position class with the following code: len(list(p.self_and_subtree())).

One could easily imagine adding more fields to this class. For example unknownAttributes or u is a good candidate. But for the sake of simplicity, I left this for later.

The other data class is a named tuple LTMData which contains all relevant data for the outline. It has following fields:

positions is a list of unique doubles. It could be list of unique items of any type. I have used doubles and random.random() to populate it, but really it can be anything hashable and immutable. In some testing that I have done using hypothesis framework, it turned out that there was some interference between hypothesis using random module and the model itself, which made some tests to fail. To avoid this intereference I have changed usage of random.random() to function that calculates (1 - 1/x) where x is an integer increased everytime new position is generated. And all tests have passed with this.
nodes is a list of gnx-es of all nodes in the outline order. It is the same value you would get from Leo using: list(x.gnx for x in c.all_positions())
levels is a bytearray containing levels of each node in the outline order. The same thing you would get from: bytearray(x.level() for x in c.all_positions()). This restricts the depth of the outline to 255 levels, but I don’t expect that anyone would mind this restriction. Using bytearray here is esential, because it implements rfind method for searching backwards, and plain python lists do not imlement this method. It is a pitty that python lists do not have this method. One possible solution would be to store values in reversed order and then use index method for searching list. In case restriction to 255 levels should be relaxed, this is something that must be done.
attrs is a dict: keys are gnx-es and values are NData instances. Perhaps name attrs is not very intuitive. It started as a place to store attributes of nodes such as headline, body,… Later all these attributes were joined into a single NData instance. It may be better if this field was named nodes, and have nodes renamed to gnxes, gnxlist or even outline.
expanded is a set of positions that should be drawn as expanded.
marked is a set of gnxes of all marked nodes.

One thing to keep in mind is that in the new Leo tree data model, root node is considered to be a part of the outline. It is the first item in nodes, positions and levels, and its size is equal to size of the length of these lists. In the above comprehension expressions this rule is not allways obeyed. These expressions were shown just as an illustration.

In the previous version of LTMData there were few other fields like parPos which was list of positions. For each item in positions list at the same index in parPos was position of the parent node. Later I have noticed that this position can be deduced from the levels list. The parent position is at the same index that has level for one less than this level. For example let i be the index of any given node in the outline. Index of its parent pi must satisfy the following equation:

levels[pi] == levels[i] - 1

# and for all j > pi, j < i
levels[j] >= levels[i]

There is also a special case, when levels[i] <= 1, index of parent node is zero, i.e. parent node is in fact root node which has fixed index so we don’t need to search for it.

The other field that was removed is gnx2pos. It was used to keep the track of all occurrences of a single node in the outline. Keys were gnxes and values were lists of positions. However, later I have realized that in every usage of this field, positions were looked up in the list of positions by index method of list class. So, there were no real benefit from this index. The same value is retrieved just by searching for gnx in the list of nodes.

Here is how outline of an experimental outline is represented inside LTMData object.

To make this image more readable, actual values of gnx-es are replaced by single upper case letter and positions are represented as P<index> instead of their float values. Also for each node after headline all positions the node is located are listed, followed by children and parents lists.

For example let’s look at the node in position P24. From the above picture we can see its headline is clone 1, its gnx is C, its at the level 3, and its size (self_and_subtree has total of 2 nodes). Reading text to the right of headline we can see that this same node can be found on positions: P2, P6, P8, P24 and P26. It has a single child node with the gnx K, and its parents are nodes with the gnxes: B, B and N.

In the following post, I’ll try to explain how some of the tree operations work.