This is the second part of story about using functional programming for prototyping (first part is Here). Originally, my intention was to document whole process of prototyping. I wished not to finish prototype and then describe it after, but to show every step in my work-flow. However, soon after I started coding, I forgot about documenting each step, and after few hours my prototype was complete.
All javascript files are successfully imported in less than a second (average 626.3ms in 10 imports). After importing ten times all files Leo is still responsive. It still remains to clean and document my prototype. All that I can now say about prototyping process is that it was a pleasant and easy activity. While developing it I needed to look few times in the Python documentation about re
, bisect
, timeit
, …
Prototype
Indexing source file
After initial idea to split source in lines and operate on lines one by one, I ended up with manual splitting, function named so as opposed to using splitlines
method. However, now I think a better name for this function would be create_source_index
or something similar.
def manual_splitting(src=None):
src = src or get_source()
chunks = []
start = -1
end = len(src)
pat = re.compile('[\'"]|/\\*|//')
@others
dd = {
'"' : lambda i: do_string(i, '"'),
"'" : lambda i: do_string(i, "'"),
'//': do_comment_sl,
'/*': do_comment_ml,
}
while start < end:
m = pat.search(src, start + 1)
if m:
start = do_code(m.start())
start = dd[m.group(0)](start)
else:
start = do_code(end)
return (src, tuple(chunks))
test(manual_splitting, number=100)
and in Log pane result:
average 13.2 ms
This function makes an index of source file chunks of three different types:
- code
- string
- comment
it returns an array of tuples (chunk_type, start_index, end_index)
. At first I considered chunks to be single line, but later it become clear that it is better to allow for multiline chunks.
Helper functions defined inside this function are very simple:
def do_code(i):
if i > 0 and i > start + 1:
chunks.append(('code', start, i))
return i
def do_string(i, q):
while True:
i = src.find(q, i + 1)
if i > -1 and src[i - 1] != '\\':
break
if i < 0:
chunks.append(('string', start, end))
return end
chunks.append(('string', start, i + 1))
return i + 1
def do_comment_sl(i):
i = src.find('\n', i + 2)
if i < 0:
i = end
while i + 1 < end and src[i + 1] == '\n':
i += 1
chunks.append(('comment', start, i))
return i
def do_comment_ml(i):
i = src.find('*/', i + 2) + 1
if i < 1:
i = end
while i + 1 < end and src[i + 1] == '\n':
i += 1
chunks.append(('comment', start, i))
return i
As you can see they are very similar, and I really just copy/pasted them adjusting each a bit. That may be considered violation of DRY principle, but it is very localized (all of them reside inside one function). Main argument of DRY principle is that later if you change one part of duplicated code, there is a possibility that you will forget to change the other parts of code. But, since all this repeated code parts are very close to each other, DRY principle violation becomes less danger.
Source file is searched with regex pattern that catches beginning of:
- single line comment
- multi line comment
- single quote string
- double quote string
Searching starts in code
mode, and for each beginning of another chunk type, helper function of that type is called with the starting index as first argument. Every helper function returns the end of chunk and appends chunk to chunks
array.
Searching for chunks
After indexing source file we need a function that finds chunk for any given position in source file. I was thinking to use bisect
from standard library, but it turned out that it was faster to re-implement it than to figure out how to adjust chunks
array to comply to bisect
rules. Now, that I write this, I realize that it could be achieved very easily, the only thing that is needed is to change the order of tuple items from (type, start, end)
to (start, end, type)
. And to use bisect functions not on integer position, but on tuple created from the integer argument of this function. But, that may be left for the polishing phase.
def find_chunk(i, chunks, low=0):
isFound = lambda m: chunks[m][1] <= i and chunks[m][2] > i
if isFound(low): return low
upp = len(chunks) - 1
while upp > low:
med = (upp + low) // 2
k, a, b = chunks[med]
if isFound(med): return med
if i < chunks[med][1]:
upp = med - 1
else:
low = med + 1
return upp if isFound(upp) else -1
# and here is a specialized version that finds nearest code chunk
# after given index
def find_code_chunk(i, chunks, low=0):
j = find_chunk(i, chunks, low)
if j < 0: return j
nchunks = len(chunks) - 1
while chunks[j][0] != 'code':
j += 1
if j > nchunks:
return -1
return j
Searching for functions
Now, that we have functions to locate chunks, we can pursue our next target: finding structural blocks of source code. In javascript syntax parenthesis '()', '[]', '{}'
must be properly nested. There may exists some of this characters in comment and string chunks, but they don’t count. The only ones that matter are those in code blocks. That was the reason for chunking source code in the first place.
def find_next_block(i, src, chunks):
# searching for the opening character
pat = re.compile(r'\(|\{|\[')
end = len(src)
j = find_code_chunk(i, chunks)
if j < 0: return i, end
c_chunks = lambda:(x for x in chunks[j:] if x[0] == 'code')
for k, a, b in c_chunks():
m = pat.search(src, i)
if m: break
else:
return i, end
opening_ch = m.group(0)
closing_ch = {'{': '}', '[': ']', '(': ')'}[opening_ch]
lev = 1
ii = m.end()
for k, a, b in c_chunks():
while ii < b:
if src[ii] == opening_ch:
lev += 1
elif src[ii] == closing_ch:
lev -= 1
ii += 1
if lev: continue
return m.start(), ii
return m.start(), end
First we are searching for the opening character (one of '(', '[', '{'
). If there is no block openings, that means we are finished. Last block will be in that case form given position to the end of file.
If we find the opening character, then we continue searching from the character following opening one, counting levels as we encounter them, skiping all non code blocks. When we find closing character that resets lev
to zero, we have the end index and so we return starting and ending index of the code block.
There are several ways to define function in javascript. There are so called named functions in the form:
But there are also anonymous functions and function expressions, which doesn’t have a name, but are assigned to a variable or passed as an argument to some other function. What is common to both cases is that they contain keyword 'function'
, followed by a single parenthesized block of arguments, followed by a function body which is block that starts with ‘{’ and ends with ‘}’. To find all function definitions we need to search all code blocks for a 'function'
word. Whenever we find one, skip next block of arguments. The very next block contains body of function definition. End of this block is the end of function definition.
def find_next_function(i):
pat = re.compile(r'\bfunction\b')
m = pat.search(src, i)
if m:
ii = m.end()
a, b = find_next_block(ii, src, chunks)
a, b = find_next_block(b, src, chunks)
yield m.start(), b
Begining can be either start of keyword function
that we already have found, but may also be somewhere before. Especially if the definition is following after comment chunk. Also, if it is function expression then it is often followed by semicolon. Finally, it would be useful to have not only start and end index yielded, but headline for the node as well.
It would be natural to continue search right after the end of found block. Here is a get_all_functions
definition:
def get_all_functions(src, chunks):
pat = re.compile(r'\bfunction\b([^(]*)\(')
pat2 = re.compile(r'\bObject.definePropert(?:y|ies)\s*\(([^.]+)(?:[.]prototype)?\s*,[^{]*\{')
end = len(src)
@others
i = 0
m1 = m2 = None
m1 = pat.search(src, i)
m2 = pat2.search(src, i)
m1, m2, m = mi(i)
for j, x in enumerate(chunks):
k, a, b = x
if k != 'code' or i > b: continue
while m:
i = start_of_func(m, j)
ii = m.end() + 1 if m is m1 else m.end() - 1
a1, b1 = find_next_block(ii, src, chunks)
if m is m2:
b2 = src.find(')', b1)
if b2 > 0:
b1 = b2 + 1
b1 = eat_ch(b1, ';')
b1 = eat_ch(b1, '\n')
yield headline_for_f(m, src), i, b1
i = b1 + 1
m1, m2, m = mi(i)
if not m or m.start() > b: break
There is more in this function. I have discovered two more use cases of the code blocks that should naturally be in separate nodes. One is a call to Object.defineProperty
, and the other is Object.defineProperties
. I have realized that they would be best put in separate nodes after trying to import files with the previous version of the function. In its final version displayed above, two different regex patterns are used. When we search source with one and the other, we can have no match at all, a single match, or two matches. Patterns are designed to return possible function name in first group. Here is a helper function that takes both match results and return tuple of three matches. The third one is the nearest match that is next to be processed.
def mi(i):
_m1 = pat.search(src, i) if m1 and m1.end() < i else m1
_m2 = pat2.search(src, i) if m2 and m2.end() < i else m2
if not _m1:
m = _m2
elif not _m2:
m = _m1
else:
m = _m1 if _m1.start() < _m2.start() else _m2
return _m1, _m2, m
When search has been passed after any of matches, m.end() < i
, we search further for the corresponding pattern. If any of matches is None, this function would return the other match. If both of patterns did match, function returns nearest one.
Pay attention to this pattern. Helper function that are defined inside a function can access variables in outer functon scope, but they can’t change them. As a consequence, local variables in the toplevel function need not to be passed to helper functions, because they are directly visible in inner scope. However, if helper function needs to change outer variable the only way is to return its new value, and let the outer function to change variables. This is very powerful pattern!
Here are few more helper functions:
# advances index if next character matches
eat_ch = lambda i, ch: i + 1 if i < end and src[i] == ch else i
# searches backward for line break
def prev_nl(i):
while i > 0 and src[i] != '\n':
i -= 1
return i
# searches backword for the start of function definition that should be
# in a single vnode.
def start_of_func(m, j):
i = m.start(0)
i = prev_nl(i)
j = find_chunk(i, chunks, j)
return prev_non_empty_code(i, j)
# helper function that searches backward from given position for the
# previous non empty code chunk
def prev_non_empty_code(i, cn):
while cn >= 0:
k, a, b = chunks[cn]
if k == 'string':
i = b + 1
break
if k == 'comment':
cn -= 1
i = a - 1
continue
while i - 1 >= a and src[i - 1].isspace():
i -= 1
if i > a: break
cn -= 1
return max(0, i)
# helper function that tries to guess best headline for this node
def headline_for_f(m, src):
if m.group(0).startswith('Object.defineProperty'):
s = m.group(0)
i = max(s.find('"'), s.find("'"))
if i > -1:
s = s[i + 1:s.find(s[i], i + 1)]
return m.group(1).split(',',1)[0] + ':' + s if m.group(1) else s
fnm = m.group(1).strip()
isQ = lambda x: src[x] in ('"', "'")
if not fnm:
j = prev_ch(m.start(), src)
if src[j] in ('=', ':'):
j = prev_ch(j, src)
k = prev_ws(j, src) + 1
if isQ(j): j -= 1
if isQ(k): k += 1
fnm = src[k: j + 1]
return fnm.replace('.prototype.', ':')
Building outline
Now, that we have a generator that extracts function definitions from source code, we can build outline tree from extracted regions.
def import_file(_v0, fname):
# adding top node root of this file
h = g.os_path_basename(fname)
v0 = insV(_v0, h, '@others\n')
src = open(fname, 'rt').read()
src, chunks = manual_splitting(src)
# we need access to current and subsequent function block
# that is why we zip generator with itself advanced one step
funcs = zip_with_next(get_all_functions(src, chunks))
# little bit of cleaning headlines and finding common parts
# for creating organizer nodes. All continual nodes that
# has headline with the same beginning are in one organizer node
pat = re.compile('[^.:]+')
h1 = ''
i = 0
for fcur, fnxt in funcs:
h, i, j = fcur
m = pat.search(h)
if m:
_h1 = m.group(0)
else:
_h1 = h
b = src[i:fnxt[1]] if fnxt else src[i:]
if _h1 != h1:
v = insV(v0, _h1, '')
h1 = _h1
# still didn't figure out why is this necessary but it is
if b.startswith('\n\n'):
b = b[2:]
insV(v, h, b)
else:
if i == 0:
v0._bodyString = src
And as always there are few helper functions:
def zip_with_next(it):
prev = nxt = None
for nxt in it:
if prev:
yield prev, nxt
prev = nxt
if nxt:
yield nxt, None
# and a factory for vnodes
def insV(v0, h, b):
v = leoNodes.VNode(context=c)
v._headString = h
v._bodyString = b
v0.children.append(v)
v.parents.append(v0)
return v
And final profiling test of the very same file rpg_objects.js
that Leo with its current implementation of c.importCommands
, was struggling for more than 30s, gave the following results:
10 calls, average time 233.1 ms
Isn’t it success? Let alone fact that files were imported in far better form than the original Leo import gives.