Functional programming for prototyping II

This is the second part of story about using functional programming for prototyping (first part is Here). Originally, my intention was to document whole process of prototyping. I wished not to finish prototype and then describe it after, but to show every step in my work-flow. However, soon after I started coding, I forgot about documenting each step, and after few hours my prototype was complete.

All javascript files are successfully imported in less than a second (average 626.3ms in 10 imports). After importing ten times all files Leo is still responsive. It still remains to clean and document my prototype. All that I can now say about prototyping process is that it was a pleasant and easy activity. While developing it I needed to look few times in the Python documentation about re, bisect, timeit, …

Prototype

Indexing source file

After initial idea to split source in lines and operate on lines one by one, I ended up with manual splitting, function named so as opposed to using splitlines method. However, now I think a better name for this function would be create_source_index or something similar.

def manual_splitting(src=None):
    src = src or get_source()
    chunks = []
    start = -1
    end = len(src)
    pat = re.compile('[\'"]|/\\*|//')
    @others
    dd = {
        '"' : lambda i: do_string(i, '"'),
        "'" : lambda i: do_string(i, "'"),
        '//': do_comment_sl,
        '/*': do_comment_ml,
    }
    while start < end:
        m = pat.search(src, start + 1)
        if m:
            start = do_code(m.start())
            start = dd[m.group(0)](start)
        else:
            start = do_code(end)
    return (src, tuple(chunks))
test(manual_splitting, number=100)

and in Log pane result:

    average  13.2 ms

This function makes an index of source file chunks of three different types:

code
string
comment

it returns an array of tuples (chunk_type, start_index, end_index). At first I considered chunks to be single line, but later it become clear that it is better to allow for multiline chunks.

Helper functions defined inside this function are very simple:

def do_code(i):
    if i > 0 and i > start + 1:
        chunks.append(('code', start, i))
    return i
def do_string(i, q):
    while True:
        i = src.find(q, i + 1)
        if i > -1 and src[i - 1] != '\\':
            break
    if i < 0:
        chunks.append(('string', start, end))
        return end
    chunks.append(('string', start, i + 1))
    return i + 1
def do_comment_sl(i):
    i = src.find('\n', i + 2)
    if i < 0:
        i = end
    while i + 1 < end and src[i + 1] == '\n':
        i += 1
    chunks.append(('comment', start, i))
    return i
def do_comment_ml(i):
    i = src.find('*/', i + 2) + 1
    if i < 1:
        i = end
    while i + 1 < end and src[i + 1] == '\n':
        i += 1
    chunks.append(('comment', start, i))
    return i

As you can see they are very similar, and I really just copy/pasted them adjusting each a bit. That may be considered violation of DRY principle, but it is very localized (all of them reside inside one function). Main argument of DRY principle is that later if you change one part of duplicated code, there is a possibility that you will forget to change the other parts of code. But, since all this repeated code parts are very close to each other, DRY principle violation becomes less danger.

Source file is searched with regex pattern that catches beginning of:

single line comment
multi line comment
single quote string
double quote string

Searching starts in code mode, and for each beginning of another chunk type, helper function of that type is called with the starting index as first argument. Every helper function returns the end of chunk and appends chunk to chunks array.

Searching for chunks

After indexing source file we need a function that finds chunk for any given position in source file. I was thinking to use bisect from standard library, but it turned out that it was faster to re-implement it than to figure out how to adjust chunks array to comply to bisect rules. Now, that I write this, I realize that it could be achieved very easily, the only thing that is needed is to change the order of tuple items from (type, start, end) to (start, end, type). And to use bisect functions not on integer position, but on tuple created from the integer argument of this function. But, that may be left for the polishing phase.

def find_chunk(i, chunks, low=0):
    isFound = lambda m: chunks[m][1] <= i and chunks[m][2] > i
    if isFound(low): return low
    upp = len(chunks) - 1
    while upp > low:
        med = (upp + low) // 2
        k, a, b = chunks[med]
        if isFound(med): return med
        if i < chunks[med][1]:
            upp = med - 1
        else:
            low = med + 1
    return upp if isFound(upp) else -1

# and here is a specialized version that finds nearest code chunk
# after given index
def find_code_chunk(i, chunks, low=0):
    j = find_chunk(i, chunks, low)
    if j < 0: return j
    nchunks = len(chunks) - 1
    while chunks[j][0] != 'code':
        j += 1
        if j > nchunks:
            return -1
    return j

Searching for functions

Now, that we have functions to locate chunks, we can pursue our next target: finding structural blocks of source code. In javascript syntax parenthesis '()', '[]', '{}' must be properly nested. There may exists some of this characters in comment and string chunks, but they don’t count. The only ones that matter are those in code blocks. That was the reason for chunking source code in the first place.

def find_next_block(i, src, chunks):
    # searching for the opening character
    pat = re.compile(r'\(|\{|\[')
    end = len(src)

    j = find_code_chunk(i, chunks)
    if j < 0: return i, end

    c_chunks = lambda:(x for x in chunks[j:] if x[0] == 'code')
    for k, a, b in c_chunks():
        m = pat.search(src, i)
        if m: break
    else:
        return i, end

    opening_ch = m.group(0)
    closing_ch = {'{': '}', '[': ']', '(': ')'}[opening_ch]

    lev = 1
    ii = m.end()
    for k, a, b in c_chunks():
        while ii < b:
            if src[ii] == opening_ch:
                lev += 1
            elif src[ii] == closing_ch:
                lev -= 1
            ii += 1
            if lev: continue
            return m.start(), ii
    return m.start(), end

First we are searching for the opening character (one of '(', '[', '{'). If there is no block openings, that means we are finished. Last block will be in that case form given position to the end of file.

If we find the opening character, then we continue searching from the character following opening one, counting levels as we encounter them, skiping all non code blocks. When we find closing character that resets lev to zero, we have the end index and so we return starting and ending index of the code block.

There are several ways to define function in javascript. There are so called named functions in the form:

     function myFunc () {
        ...
     }

But there are also anonymous functions and function expressions, which doesn’t have a name, but are assigned to a variable or passed as an argument to some other function. What is common to both cases is that they contain keyword 'function', followed by a single parenthesized block of arguments, followed by a function body which is block that starts with ‘{’ and ends with ‘}’. To find all function definitions we need to search all code blocks for a 'function' word. Whenever we find one, skip next block of arguments. The very next block contains body of function definition. End of this block is the end of function definition.

def find_next_function(i):
    pat = re.compile(r'\bfunction\b')
    m = pat.search(src, i)
    if m:
        ii = m.end()
        a, b = find_next_block(ii, src, chunks)
        a, b = find_next_block(b, src, chunks)
        yield m.start(), b

Begining can be either start of keyword function that we already have found, but may also be somewhere before. Especially if the definition is following after comment chunk. Also, if it is function expression then it is often followed by semicolon. Finally, it would be useful to have not only start and end index yielded, but headline for the node as well.

It would be natural to continue search right after the end of found block. Here is a get_all_functions definition:

def get_all_functions(src, chunks):
    pat = re.compile(r'\bfunction\b([^(]*)\(')
    pat2 = re.compile(r'\bObject.definePropert(?:y|ies)\s*\(([^.]+)(?:[.]prototype)?\s*,[^{]*\{')
    end = len(src)
    @others
    i = 0
    m1 = m2 = None
    m1 = pat.search(src, i)
    m2 = pat2.search(src, i)
    m1, m2, m = mi(i)
    for j, x in enumerate(chunks):
        k, a, b = x
        if k != 'code' or i > b: continue
        while m:
            i = start_of_func(m, j)
            ii = m.end() + 1 if m is m1 else m.end() - 1
            a1, b1 = find_next_block(ii, src, chunks)
            if m is m2:
                b2 = src.find(')', b1)
                if b2 > 0:
                    b1 = b2 + 1
            b1 = eat_ch(b1, ';')
            b1 = eat_ch(b1, '\n')
            yield headline_for_f(m, src), i, b1
            i = b1 + 1
            m1, m2, m = mi(i)
            if not m or m.start() > b: break

There is more in this function. I have discovered two more use cases of the code blocks that should naturally be in separate nodes. One is a call to Object.defineProperty, and the other is Object.defineProperties. I have realized that they would be best put in separate nodes after trying to import files with the previous version of the function. In its final version displayed above, two different regex patterns are used. When we search source with one and the other, we can have no match at all, a single match, or two matches. Patterns are designed to return possible function name in first group. Here is a helper function that takes both match results and return tuple of three matches. The third one is the nearest match that is next to be processed.

def mi(i):
    _m1 = pat.search(src, i) if m1 and m1.end() < i else m1
    _m2 = pat2.search(src, i) if m2 and m2.end() < i else m2
    if not _m1:
        m = _m2
    elif not _m2:
        m = _m1
    else:
        m = _m1 if _m1.start() < _m2.start() else _m2
    return _m1, _m2, m

When search has been passed after any of matches, m.end() < i, we search further for the corresponding pattern. If any of matches is None, this function would return the other match. If both of patterns did match, function returns nearest one.

Pay attention to this pattern. Helper function that are defined inside a function can access variables in outer functon scope, but they can’t change them. As a consequence, local variables in the toplevel function need not to be passed to helper functions, because they are directly visible in inner scope. However, if helper function needs to change outer variable the only way is to return its new value, and let the outer function to change variables. This is very powerful pattern!

Here are few more helper functions:

# advances index if next character matches
eat_ch = lambda i, ch: i + 1 if i < end and src[i] == ch else i

# searches backward for line break
def prev_nl(i):
    while i > 0 and src[i] != '\n':
        i -= 1
    return i

# searches backword for the start of function definition that should be
# in a single vnode.
def start_of_func(m, j):
    i = m.start(0)
    i = prev_nl(i)
    j = find_chunk(i, chunks, j)
    return prev_non_empty_code(i, j)

# helper function that searches backward from given position for the
# previous non empty code chunk
def prev_non_empty_code(i, cn):
    while cn >= 0:
        k, a, b = chunks[cn]
        if k == 'string':
            i = b + 1
            break
        if k == 'comment':
            cn -= 1
            i = a - 1
            continue
        while i - 1 >= a and src[i - 1].isspace():
            i -= 1
        if i > a: break
        cn -= 1
    return max(0, i)

# helper function that tries to guess best headline for this node
def headline_for_f(m, src):
    if m.group(0).startswith('Object.defineProperty'):
        s = m.group(0)
        i = max(s.find('"'), s.find("'"))
        if i > -1:
            s = s[i + 1:s.find(s[i], i + 1)]
            return m.group(1).split(',',1)[0] + ':' + s if m.group(1) else s
    fnm = m.group(1).strip()
    isQ = lambda x: src[x] in ('"', "'")
    if not fnm:
        j = prev_ch(m.start(), src)
        if src[j] in ('=', ':'):
            j = prev_ch(j, src)
            k = prev_ws(j, src) + 1
            if isQ(j): j -= 1
            if isQ(k): k += 1
            fnm = src[k: j + 1]
    return fnm.replace('.prototype.', ':')

Building outline

Now, that we have a generator that extracts function definitions from source code, we can build outline tree from extracted regions.

def import_file(_v0, fname):

    # adding top node root of this file
    h = g.os_path_basename(fname)
    v0 = insV(_v0, h, '@others\n')

    src = open(fname, 'rt').read()
    src, chunks = manual_splitting(src)

    # we need access to current and subsequent function block
    # that is why we zip generator with itself advanced one step
    funcs = zip_with_next(get_all_functions(src, chunks))

    # little bit of cleaning headlines and finding common parts
    # for creating organizer nodes. All continual nodes that
    # has headline with the same beginning are in one organizer node
    pat = re.compile('[^.:]+')
    h1 = ''
    i = 0
    for fcur, fnxt in funcs:
        h, i, j = fcur
        m = pat.search(h)
        if m:
            _h1 = m.group(0)
        else:
            _h1 = h
        b = src[i:fnxt[1]] if fnxt else src[i:]
        if _h1 != h1:
            v = insV(v0, _h1, '')
            h1 = _h1

        # still didn't figure out why is this necessary but it is
        if b.startswith('\n\n'):
            b = b[2:]

        insV(v, h, b)
    else:
        if i == 0:
            v0._bodyString = src

And as always there are few helper functions:

def zip_with_next(it):
    prev = nxt = None
    for nxt in it:
        if prev:
            yield prev, nxt
        prev = nxt
    if nxt:
        yield nxt, None

# and a factory for vnodes
def insV(v0, h, b):
    v = leoNodes.VNode(context=c)
    v._headString = h
    v._bodyString = b
    v0.children.append(v)
    v.parents.append(v0)
    return v

And final profiling test of the very same file rpg_objects.js that Leo with its current implementation of c.importCommands, was struggling for more than 30s, gave the following results:

10 calls, average time 233.1 ms

Isn’t it success? Let alone fact that files were imported in far better form than the original Leo import gives.

To be concluded

Art of computing