Leo Tree Model - loading from a file

In the previous part I wrote about implementation of LeoTreeModel class and its fields. First instances of LeoTreeModel were made from Leo’s VNode instances. Now let’s enable building LeoTreeModel directly from .leo xml files.

Loading from Leo xml file

We already have function that builds LeoTreeModel based on a sequence of tuples that define each node in outline see (nodes2treemodel) in previous part. This function expects as its only input argument array of tuples in the following form:

(gnx, h, b, level, size, parentGnxes, childrenGnxes)

This values can be retrieved from Leo xml file. Iterating over <vnodes> child elements we can build required array of tuples. Values for each tuple element can be deduced from <vnodes> elements all except b-value which is the text of <t> element with the same gnx. All <t> elements are children of one unique <tnodes> element.

For parsing xml I have used xml.etree.ElementTree module which I think is part of standard Python library.

import xml.etree.ElementTree as ET
def loadLeo(fname):
    '''Loads given xml Leo file and returns LeoTreeModel instance'''
    with open(fname, 'rt') as inp:
        s = inp.read()
        xroot = ET.fromstring(s)
        vnodesEl = xroot.find('vnodes')
        tnodesEl = xroot.find('tnodes')
        return xml2treemodel(vnodesEl, tnodesEl)

Leo xml file format is not fully symmetric which requires that we have two different iterators: one for every <v> element, and the second one for iterating top level nodes which are children of <vnodes> element.

def xml2treemodel(xvroot, troot):
    '''Returns LeoTreeModel instance from vnodes and tnodes elements of xml Leo file'''
    parDict = defaultdict(list) # accumulates parent gnxes for each node
    hDict = {} # accumulates headlines for each node

    # contains body for each node
    bDict = dict((ch.attrib['tx'], ch.text or '') for ch in troot.getchildren())

    xDict = {} # will keep references to Element instances for iterating clones


    # here come two utility iterators
    @others


    nodes = tuple(riter())
    return nodes2treemodel(nodes)

And as a child nodes we have two different iterators:

def viter(xv, lev0):
    s = [1] # first element of this array is counting how many children
            # there are in under this node i.e. its size
    gnx = xv.attrib['t']
    if len(xv) == 0:
        # clone nodes doesn't contain vh element nor any children
        # so we have to reiterate the first of all clones that
        # has both vh and other children
        for ch in viter(xDict[gnx], lev0):
            yield ch
        return

    # not a clone, we have encountered a new node
    xDict[gnx] = xv
    hDict[gnx] = xv[0].text
    chs = [ch.attrib['t'] for ch in xv if ch.tag == 'v']
    for ch in chs:
        parDict[ch].append(gnx)
    mnode = [gnx, hDict[gnx], bDict.get(gnx, ''), lev0, s, parDict[gnx], chs]
    yield mnode
    for ch in xv.getchildren():
        if ch.tag != 'v':continue
        for x in viter(ch, lev0 + 1):
            s[0] += 1
            yield x

This iterator will be used for every <v> element. However, for top-level elements that are not children of <v> element but of <vnodes> element, we have to make different iterator r(oot)iterator.

def riter():
    s = [1]
    chs = []
    yield 'hidden-root-vnode-gnx', '<hidden root vnode>','', 0, s, [], chs
    for xv in xvroot.getchildren():
        gnx = xv.attrib['t']
        chs.append(gnx)
        parDict[gnx].append('hidden-root-vnode-gnx')
        for ch in viter(xv, 1):
            s[0] += 1
            yield ch

This iterator invokes the first one for each top level vnode, and finally gives us tuple of node tuples that we can pass to nodes2treemodel function.

Reading external files

After this first pass we will have outline only. All children of @file nodes are still missing. In July 2017, I wrote two functions for reading and writing external files in Leo. They relied on VNode and Position methods so they need to be adjusted for use in new LeoTreeModel. However, we can keep their overall structure.

load_derived_file(lines) takes as input lines of text from derived file and returns generator of tuples (gnx, h, b, level). It has five distinct phases:

handling first lines and header of derived file
creating necessary regexes
init topnode
iterate input lines
yield collected nodes

Phase 1: handling first lines and header

header_pattern = re.compile(r'''
    ^(.+)@\+leo
    (-ver=(\d+))?
    (-thin)?
    (-encoding=(.*)(\.))?
    (.*)$''', re.VERBOSE)

for i, line in flines:
    m = header_pattern.match(line)
    if m:
        break
    first_lines.append(line)
else:
    raise ValueError('wrong format, not derived file')

delim_st = m.group(1)
delim_en = m.group(8)

Nothing too special about this phase. We simply read lines and collect them in the first_lines list until we encounter header line. When we have header line we deduce start and end delimiters.

Phase 2: creating regexes

Once we know start and end delimiters we can make some patterns that can be used for parsing remaining lines.

def get_patterns(delim_st, delim_en):
    if delim_en:
        dlms = re.escape(delim_st), re.escape(delim_en)
        ns_src = r'^(\s*)%s@\+node:([^:]+): \*(\d+)?(\*?) (.*?)%s$'%dlms
        sec_src = r'^(\s*)%s@(\+|-)<{2}[^>]+>>(.*?)%s$'%dlms
        oth_src = r'^(\s*)%s@(\+|-)others%s\s*$'%dlms
        all_src = r'^(\s*)%s@(\+|-)all%s\s*$'%dlms
        code_src = r'^%s@@c(ode)?%s$'%dlms
        doc_src = r'^%s@\+(at|doc)?(\s.*?)?%s$'%dlms
    else:
        dlms = re.escape(delim_st)
        ns_src = r'^(\s*)%s@\+node:([^:]+): \*(\d+)?(\*?) (.*)$'%dlms
        sec_src = r'^(\s*)%s@(\+|-)<{2}[^>]+>>(.*)$'%dlms
        oth_src = r'^(\s*)%s@(\+|-)others\s*$'%dlms
        all_src = r'^(\s*)%s@(\+|-)all\s*$'%dlms
        code_src = r'^%s@@c(ode)?$'%dlms
        doc_src = r'^%s@\+(at|doc)?(\s.*?)?'%dlms + '\n'
    return bunch(
        node_start = re.compile(ns_src),
        section    = re.compile(sec_src),
        others     = re.compile(oth_src, re.DOTALL),
        all        = re.compile(all_src, re.DOTALL),
        code       = re.compile(code_src),
        doc        = re.compile(doc_src),
    )
patterns = get_patterns(delim_st, delim_en)

Phase 3: start top node

First we need a place to collect all data.

nodes = bunch(
    # level must contain lists of node levels in order they appear in input
    # this is to support at-all directive which will write clones several times.
    level = defaultdict(list),

    # contains headline for each node
    head = {},

    # contains lines of body text for each node
    body = defaultdict(list),

    # this is list which will store the order of nodes in derived file
    # that is the order in which we will dump nodes once we have consumed
    # all input lines
    gnxes = [],
)

topnodeline = flines[len(first_lines) + 1][1] # line after header line

m = patterns.node_start.match(topnodeline)
topgnx = set_node(m)

# append first lines if we have some
nodes.body[topgnx] = ['@first '+ x for x in first_lines]
assert topgnx, 'top node line [%s] %d first lines'%(topnodeline, len(first_lines))

# this will keep track of current gnx and indent whenever we encounter 
# at+others or at+<section> or at+all
stack = []

in_all = False
in_doc = False


# spelling of at-verbatim sentinel
verbline = delim_st + '@verbatim' + delim_en + '\n'

verbatim = False # keeps track whether next line is to be processed or not

where set_node is like this:

@ utility function to set data from regex match object from sentinel line
  see node_start pattern. groups[1 - 5] are:
  (indent, gnx, level-number, second star, headline)
      1      2         3            4          5
  returns gnx
@c
def set_node(m):
    gnx = m.group(2)
    lev = int(m.group(3)) if m.group(3) else 1 + len(m.group(4))
    nodes.level[gnx].append(lev)
    nodes.head[gnx] = m.group(5)
    nodes.gnxes.append(gnx)
    return gnx

Phase 4: iterating lines

# we need to skip twice the number of first_lines, one header line
# and one top node line
start = 2 * len(first_lines) + 2

# keeps track of current indentation
indent = 0 

# keeps track of current node that we are reading
gnx = topgnx

# list of lines for current node
body = nodes.body[gnx]


for i, line in flines[start:]:
    # child nodes may if necessary shortcut this loop
    # using continue or let the line fall through to 
    # the end of loop

    ... handle verbatim lines
    ... handle indentation 
    ... handle at-all
    ... handle at-others
    ... handle at-doc
    ... handle at-code
    ... handle sections
    ... handle node start
    ... handle at-leo line
    ... handle directives
    ... handle in-doc parts


    # nothing special about this line, let's append it to current body
    body.append(line)


if i + 1 < len(flines):
    nodes.body[topgnx].extend('@last %s'%x for x in flines[i+1:])

All those handle ... parts are subnodes of this for loop. There is nothing special about them. They start with some check if they should be applied to current line or not. If so, they do what they need to do and end with continue statement. If current line is not handled by any of handle... nodes, then it is simply appended to the current body.

When we encounter line with closing leo header (“@-leo”), we break out of the loop, and remaining lines (if any) are appended to the top-level node as @last lines.

Phase 5: yielding results

Finally we can just dump all collected data in the outline order.

for gnx in nodes.gnxes:
    b = ''.join(nodes.body[gnx])
    h = nodes.head[gnx]
    lev = nodes.level[gnx].pop(0)
    yield gnx, h, b, lev-1

Extending load_derived_file to get the sequence of node tuples suitable for building LeoTreeModel is straightforward. We just need to collect data about parents/children relations and calculate subtree size for each node.

def ltm_from_derived_file(fname):
    '''Reads external file and returns tree model.'''
    with open(fname, 'rt') as inp:
        lines = inp.read().splitlines(True)
        parents = defaultdict(list)
        def viter():
            stack = [None for i in range(256)]
            lev0 = 0
            for gnx, h, b, lev in load_derived_file(lines):
                ps = parents[gnx]
                cn = []
                s = [1]
                stack[lev] = [gnx, h, b, lev, s, ps, cn]
                if lev:
                    # add parent gnx to list of parents
                    ps.append(stack[lev - 1][0])
                    if lev > lev0:
                        # parent level is lev0
                        # add this gnx to list of children in parent
                        stack[lev0][6].append(gnx)
                    else:
                        # parent level is one above 
                        # add this gnx to list of children in parent
                        stack[lev - 1][6].append(gnx)
                lev0 = lev

                # increase size of every node in current stack
                for x in stack[:lev]:
                    x[4][0] += 1

                # finally yield this node
                yield stack[lev]

        nodes = tuple(viter())
        return nodes2treemodel(nodes)

To be continued

In the next part I will write about adding some methods to data model to implement outline commands.

Art of computing