The new Leo tree data model has been evolving during last few weeks. Some things has been changed since my last post. I guess now may be the right time to explain how it works and why it was designed the way it is.
But before I start to explain model, let me give short summary of what features new model supports.
- super fast tree drawing
- selecting nodes
- moving nodes left, right, up and down as well as moving from one position in the outline to any other position
- clone node
- copy and paste node/subtree
- delete node/subtree
- promote/demote node
- reading XML leo files
- reading external files
- writing external files
- reading at-auto python files
- writing at-auto files
- unlimitted universal undo/redo functionality (with a single method entry handles any kind of tree/node modification including changes to body and/or headline)
All these features are achieved in one python module with some 1669 LOC (with comments included 2299 lines).
All these commands for modification of outline run much faster than analog Leo commands. For example executing promote/demote pair of commands takes Leo on average 113.6ms (measured 100 times with one clone node and three following sibling nodes). The new data model performs this in 3.7ms which means about 30 times faster than Leo. Comparing all other outline commands give similar speed ratio. Also all outline traversals are faster than Leo’s iterators. For example executing:
Average: 66.8ms on my machine, while computing the same value from new model prints
Average: 5.4ms. New model traversal in this case is 12 times faster.
The new data model is very well tested using excellent hypothesis framework. Testing involves creating 90 random outlines and performing series of 1000 random tree operations, checking everytime certain invariants. This kind of testing revealed several bugs that would be very hard to encounter using hand written test cases.
An interesting way to understand how code works is to comment out any line and run tests again. Failure is almost certain and in case of failure, hypothesis will display minimal sequence of commands that lead to the failure.
How can Leo benefit best from this code
There may be several ways for Leo to benefit from this code. It may be used as an inspiration for changing Leo’s VNode and Position classes and possibly some other as well like undoer, importCommands, leoTree, … But, there is no guarantee that it would be possible to implement all these features by changing Leo’s original classes in backward compatible way. Plus, this also would mean lots of additional work.
A better way would be to add this single module in leo/core/ and use it internally. Redirect top-level commands (like cut/copy/paste, clone, delete-outline, move node, expand, collapse, promote, demote, undo, redo) to delegate their work to the new data model. Also, selecting nodes and drawing tree should be delegated to the new model. All these changes can be done with minimal changes to the current Leo.
The same approach can be used for all other methods.
Finally, execute-script command can be changed to do the following:
All scripts will continue to work as they used to, without any change. If one day, LeoTreeModel class, gets its C or Rust implementation, the only thing that Leo should do to use power of such extension is to change one import so that LeoTreeModel class comes from that extension and not from ordinary Python module.
Helper data classes
There are two new data classes that the new model uses. While prototyping I was using tuples and lists for holding necessary data. In the end I have converted those tuples and lists into specialized data classes.
At the lowest level there is the
NData class, which contain data related to a single node in the outline. It has at least following five fields:
- h - headline
- b - body
- children - is a list of
gnx-es of children nodes
- parents - is a list of
gnx-es of parent nodes
- size - is the total number of cells this node and its subtree occupies in the outline lists. Basicaly it is the same value that you would get from the Leo position class with the following code:
One could easily imagine adding more fields to this class. For example
u is a good candidate. But for the sake of simplicity, I left this for later.
The other data class is a named tuple
LTMData which contains all relevant data for the outline. It has following fields:
positions is a list of unique doubles. It could be list of unique items of any type. I have used doubles and
random.random()to populate it, but really it can be anything hashable and immutable. In some testing that I have done using
hypothesisframework, it turned out that there was some interference between
randommodule and the model itself, which made some tests to fail. To avoid this intereference I have changed usage of
random.random()to function that calculates
(1 - 1/x)where x is an integer increased everytime new position is generated. And all tests have passed with this.
nodes is a list of
gnx-es of all nodes in the outline order. It is the same value you would get from Leo using:
list(x.gnx for x in c.all_positions())
- levels is a
bytearraycontaining levels of each node in the outline order. The same thing you would get from:
bytearray(x.level() for x in c.all_positions()). This restricts the depth of the outline to 255 levels, but I don’t expect that anyone would mind this restriction. Using
bytearrayhere is esential, because it implements
rfindmethod for searching backwards, and plain python lists do not imlement this method. It is a pitty that python lists do not have this method. One possible solution would be to store values in reversed order and then use
indexmethod for searching list. In case restriction to 255 levels should be relaxed, this is something that must be done.
- attrs is a
dict: keys are
gnx-es and values are
NDatainstances. Perhaps name
attrsis not very intuitive. It started as a place to store attributes of nodes such as headline, body,… Later all these attributes were joined into a single
NDatainstance. It may be better if this field was named
nodes, and have
- expanded is a set of positions that should be drawn as expanded.
marked is a set of gnxes of all marked nodes.
One thing to keep in mind is that in the new Leo tree data model, root node is considered to be a part of the outline. It is the first item in
levels, and its size is equal to size of the length of these lists. In the above comprehension expressions this rule is not allways obeyed. These expressions were shown just as an illustration.
In the previous version of
LTMData there were few other fields like
parPos which was list of positions. For each item in
positions list at the same index in
parPos was position of the parent node. Later I have noticed that this position can be deduced from the
levels list. The parent position is at the same index that has level for one less than this level. For example let
i be the index of any given node in the outline. Index of its parent
pi must satisfy the following equation:
There is also a special case, when
levels[i] <= 1, index of parent node is zero, i.e. parent node is in fact root node which has fixed index so we don’t need to search for it.
The other field that was removed is
gnx2pos. It was used to keep the track of all occurrences of a single node in the outline. Keys were gnxes and values were lists of positions. However, later I have realized that in every usage of this field, positions were looked up in the list of positions by
index method of list class. So, there were no real benefit from this index. The same value is retrieved just by searching for gnx in the list of nodes.
Here is how outline of an experimental outline is represented inside
To make this image more readable, actual values of gnx-es are replaced by single upper case letter and positions are represented as
P<index> instead of their float values. Also for each node after headline all positions the node is located are listed, followed by children and parents lists.
For example let’s look at the node in position P24. From the above picture we can see its headline is
clone 1, its gnx is
C, its at the level 3, and its size (self_and_subtree has total of 2 nodes). Reading text to the right of headline we can see that this same node can be found on positions: P2, P6, P8, P24 and P26. It has a single child node with the gnx
K, and its parents are nodes with the gnxes:
B, B and N.
In the following post, I’ll try to explain how some of the tree operations work.