Celandine part 1 – Basics and Basis

Current code base available here

“I’d quite like to write a very simple computational chemistry program in Julia,” thought lots of people all the time, “I wonder how I’d go about that.”

Step one is to find out if the problem has been solved before. Not quite, but there is an open-source computational chemistry program in Python. So I’m not so much re-inventing the wheel as re-branding the Rotating Axle-Based Object Transport Circle. Still, it might be fun to try, and more importantly, we can follow the existing methodology as a guide.

It would make sense for step two to be “look up a bunch of sources on how to do computational chemistry,” and indeed this is what I did first. But vague memories of university would have been sufficient to provide the answer: there are a whole bunch of different approaches, and each will need its own algorithmic implementation (you could probably get that much just by guessing, even). However, they will share common features: input format, possible outputs, basis sets, and the such. Therefore this first post will, very conveniently for a lazy blogger with a day job and other hobbies, not in fact contain any computational chemistry. Step two, then, is to implement a basic structure that can read and display molecular geometries.

moleculeReader.jl: module for reading molecular structures from files into arrays
A molecular structure is, fundamentally, a 4-column 2D array, the first column containing atomic number and the other three giving (x,y,z) coordinates

function convertTextBlockToArray(text::Array)
    #Given the subset of text in the file which constructs molecules,
    #convert to an array of integers
    #This creates a 1D array of vectors
    A = [lineConvert(l) for l in text]
    #This converts it to a 2D array
    A2 = permutedims(hcat(A...), [2,1])
    #Obviously there are other ways to reach this result
    return A2
end

To explain how this works: the line creating A uses a comprehension to create a vector of whatever lineconvert is returning. The structure “hcat(A…)” concatenates all the contents of A together into a single 2D array instead. permutedims is Julia’s generalisation of transposition; permutedims(Array, [2,1]) is equivalent to transpose(Array). The result is a 2D array of converted lines from the text. We could also obtain this result by iterating over the array, converting and concatenating each line. But this would look a lot uglier, and since pretty much all reasonable molecules have less than 100 atoms, this slower method is fine.

function lineConvert(line)
    #Converts a single line from string to numerics
    a = split(line, ',')
    b = map(strip, a)
    c = map(x->parse(Float64, x), b)
    return c
end

This can be done in one line of stacked maps, but there’s not really any reason to and it’s very hard to read it that way. We have to use the somewhat ugly lambda expression “x->parse(Float64,x)” rather than just “parse” in order to force parse to treat all numbers as Float64. This is still somewhat annoying since we’d ideally like to treat the atomic numbers as integers, but that would mean using a mixed array. It seems acceptable to use floating-point for atomic numbers here, since they’re not actually used for any arithmetic until much later.

This function could still stand to be improved. Ideally it ought to be possible to freely switch between writing the atomic numbers in the input file, and writing the symbols (H, He, C, Br, etc). And a proper version of this would probably find some way to use integers for atomic numbers.

function readFileToUniversalInformation(filepath)
    #Read the file in the given location
    #Return an object containing the information set in the header
end

This is the other big TODO in moleculeReader. The top paragraph of an input file will be used to specify how the program should execute. It ought to be able to contain information like calculation type (RHF, UHF, DFT…), basis set to use (sto3G, p321, etc), specify the unit of measurement being used in input/output (angstroms, nanometers, bohr radii) (though note that internal calculations will all be done in nm. No sense making things needlessly hard), and probably other things I haven’t thought of yet. A custom type will be used to store this information accessibly.

moleculeDrawer.jl: module for creating 3D graphs of molecular structure
Sadly, Gadfly does not support 3D graphs yet. Since there’s not really any good way to visualise a molecular structure using only 2D graphs, we’ll just have to deal with using Plotly.

function plotMolecule(atomArray::Array)
    #create a graph displaying the atoms in the positions given by the array
    trc = scatter3d(
                    ;x=atomArray[:, 2], y=atomArray[:, 3], z=atomArray[:, 4],
                    mode="markers+text", opacity=0.5, marker_size=12,
                    text=map(x->getElementSymbol(x), atomArray[:, 1]),
                    textposition="center")
    #TODO make some of these things into inputs to this fn
    #TODO make different elements different colours :3yes
    layout = Layout(margin=attr(l=0, r=0, t=0, b=0))
    p = plot(trc, layout)
    return p
end

This is how you plot a 3D graph using Plotly. Important parts: x=atomArray[:, 2], y=atomArray[:, 3], z=atomArray[:, 4] gets the x, y and z columns of the array of atom positions. text=map(x->getElementSymbol(x), atomArray[:, 1]) creates the 4th column, the text of each atom. The text is just a label giving the elemental symbol.

function getElementSymbol(atomicNumber::Int32)
    #Retrieve the symbol for a given atomic number

    #The file is properly ordered, so the nth element is the nth line #EZPZ
    l = open(readlines, eledata)[atomicNumber]
    #Note that this loads the entire file into an array. May result in large
    #memory usage if the number of chemical elements increases,
    #or if using a table with very large amounts of info for each element
    elemInf = map(strip, split(l, ','))
    return elemInf[2]
end

Next TODO: create a better importer for the pt-data csv files. It should create a dictionary of Symbol:Number, and a simple list of symbols. That will make accessing symbols as we need them easier, and make it possible for moleculeReader to read symbols. Also find some similar files with a proper source, the source on these ones has vanished.

PlottingCodeTest
Fig. 1: Incredibly hard-to-make-out plot of atoms in non-meaningful positions. TODO: learn how to make this less awful to look at (in Atom? Which has a dark background? Or, learn how to make it display outside Atom, on a pale background? Or, make it export images without including the background?)

basis.jl: module for importing basis sets.
For the conversion from the PyQuante Python basis sets to the simpler, if more in need of interpretation, csv basis sets, see the python script basisReformatter.py. The reason for the conversion is to have a more language-neutral, but still somewhat readable version of the basis sets to hand.

function addLine!(b::BasisSet, line::String)
    l = map(strip, split(line, ','))
    linenum = parse(l[1])
    linearr = formArray(l[2:end])
    b.basis[linenum] = linearr
end

Note that the function!() format indicates a function that alters its arguments in-place rather than returning a new variable (though it is not a required syntax of doing so). This is the typical behaviour in Julia, so I’m not sure why it merits being highlighted. Opinions are divided over whether this is a good way of doing things; personally I quite like it.

function formArray(line::Vector)
    #In the input list, elements alternate between letters and strings of
    #numbers separated by spaces
    A = Array{Any}(2,0)
    for i = 1:2:length(line)
        #i indexes letters, i+1 indexes blocks of numbers
        a = line[i]
        b = listPairedNumbers(line[i+1])
        A = hcat(A, [a,b])
    end
    return A
end

The array we end up forming is not particularly easy to visualise – overall the structure is a hashmap relating numbers to (lists of (pairs of letters and (lists of pairs of numbers))). Uhh… Yeah. Think that’s right.

So far basis.jl is the part that probably needs least work. It does a single job quite effectively, and probably/hopefully doesn’t need any massive extension.

tester.jl: file for testing other modules.
Python is uncommon in having the wonderful if __name__ == "__main__": structure for testing modules by simply running them as main. In Julia, a more dedicated approach is required. So this file simply contains a set of functions for testing each other module.

try
    testBasis()
finally
    #This undoes the inclusions and uses, except when it doesn't
    workspace()
end

Annoyingly, the workspace() command doesn’t seem to do what it’s supposed to. To explain, consider the following development workflow in the Julia add-in for Atom, Juno:

  • Write a function in a module.
  • Add some code to the tester script’s function for the changed module to check the new function, execute the tester script.
  • Either part of the new code fails in some way.
  • Make further changes, run the tester code again.
  • Best case scenario: output overwhelmed by hundreds of warnings
  • Worst case scenario: console errors

Wait, what?

As I understand it, you’re Not Supposed to import things into the same workspace twice in Julia. In the good case (only overwriting my own code), the warnings just make things really hard to work with. In the bad case (Plotly), the workspace just dies. What the workspace() command should be doing is refreshing the workspace, replacing the main module with a new, empty one. In practice it doesn’t seem to prevent these problems. The only thing to do is to stop the Julia REPL and restart it.

JuliaWhy
Fig. 2: Second-run output interrupted by dozens of warnings. I have no idea how to prevent this “properly.” There doesn’t seem to be any way to suppress warnings.

Major improvements to make: Have input file set program behaviour. Allow element symbols in input file. Make graph output readable.

Next step: Add restricted Hartree-Fock algorithm. Add some test cases. Create some documentation on usage.

Celandine part 1 – Basics and Basis

Code Diary 03 – AlphaVEN highlights and lowlights

The code this time is too long to reproduce here in its entirety, so check it out here if you’re interested. I’ll be going over some of the more interesting and ugly parts here.

Keyword Arguments Everywhere

Example:

    def __init__(self, **kwargs):
        """Create a new Maker."""
        
        self.title = kwargs["title"] if "title" in kwargs else DEFAULTOUTPUT
        self.fadetime = kwargs["fadetime"] if "fadetime" in kwargs else DEFAULTFADETIME
        self.format = kwargs["format"] if "format" in kwargs else DEFAULTFORMAT      
        #Not yet implemented functionally speaking:
        self.resolution = kwargs["res"] if "res" in kwargs else DEFAULTRES

kwargs are great! Important usage note:

    def createMakers(self):
        """Create a Maker for each of the outputs."""
        self.makers = []
        for para in self.paralist:
            #Extract the inputs and transitions
            paratitle = para.split('\n')[0]
            lines = para.split('\n')[1:]
            
        paramaker = maker.Maker(title=paratitle, **self.gensetdict)

If gensetdict is empty, this calls maker.Maker() without any keyword arguments. This lets me avoid having separate code for calling functions when I don’t have any arguments for them. Very convenient!

The Worst Function I’ve Written (so far)

Speaking of createMakers, it might be the worst function I’ve ever written. I won’t reproduce it here, because it’s huge, and also because it should probably be quarantined. It went through about 5 versions and used to be much worse, but is still 81 lines long and definitely not performing just a single action. It should be set on fire restructured into several functions. However, those would probably have to have a whole bundle of inputs and outputs.

Object Orientation
        if "fadein" in kws:
            if kws["fadein"]:
                self.fadein = kws["fadein"]
            elif not (kws["fadein"] == None or kws["fadein"] == 0):
                #If the fade time isn't explicitly 0
                self.fadein = self.videosegment.parentvideo.maker.fadetime
            else:
                self.fadein = None
        else:
            self.fadein = False

Man, look at that chain of objects! I have no idea if this is a good or bad idea, but it certainly looks cool. And hey, imagine if it were Java:
this.getVideoSegment().getParentVideo().getParentMaker().getFadeTime()
Ewwww. And that’s assuming you don’t need to cast the fadetime to the right type of numeric, heh.

Abusing Truthiness

Notice how in the above code, I check if kws["fadein"] is exactly equal to “None” or “0” rather than just checking if it counts as True (both None and 0 count as False in Python). This is to distinguish three cases:

  • There is a fade-in, and it has a time.
  • There is a fade-in, and it should use the default time.
  • There is a fade-in, and it has 0 duration.

(ok, maybe this is actually the worst code I’ve ever written)

Why Doesn’t This Come With The Package?
def timediff(t1, t2):
    """Find the absolute difference between two datetime Times.
    
    Returns a timedelta object.
    """
    
    delta1 = dt.timedelta(hours=t1.hour, \
                          minutes=t1.minute, \
                          seconds=t1.second, \
                          microseconds=t1.microsecond)
    delta2 = dt.timedelta(hours=t2.hour, \
                          minutes=t2.minute, \
                          seconds=t2.second, \
                          microseconds=t2.microsecond)
    return abs(delta2 - delta1)

Seriously, why did I have to write this? Is there some reason to not define a difference between two times?

Overall Thoughts

On the one hand, AlphaVEN was hideous before this. It didn’t even assemble the videos itself, which meant rendering every video twice – a massive timesink. On the other hand, this was a pain to write, isn’t particularly impressively done, and not necessarily fully functional. There are some definite improvements to be made.

Python is 100% the right language to do this. A powerful, flexible language that still functions like a shell script when needed. (I do use other languages. Sometimes. But usually, I find myself wondering why I would want to make life hard for myself).

This ought to make assembling streams into videos really easy, if I get round to actually doing so.

Code Diary 03 – AlphaVEN highlights and lowlights

Animal Naming

My brother introduced me to a “game” he plays while walking to work: if you see an animal, think of a person’s name for it that follows the same vowel-consonant pattern as that animal’s name. For instance, a duck has the pattern C-V-C-C, which is matched by John, among other names. This can become quite difficult for certain animals (e.g. kestrel) and some seem to strain the bounds of possibility (e.g. squirrel).

Since fun exists to be destroyed, here is some code that plays this “game” far better than any person. I’ve broken it up to add some commentary on the development.

# -*- coding: utf-8 -*-

import re

re is Python's Regular Expressions module. Regexes are the obvious choice for tackling a string pattern matching problem.

vowel = '[aeiouAEIOU]'
consonant = '[^aeiouAEIOU \d]'
#Note: assuming y is typically a consonant, which isn't quite true - 
# but there's no good way to tell 'vowel y' (eg 'my') 
# from 'consonant y' (eg 'your') in such a simple program
# To use a vowel y instead, add yY to both the above 
# To use y as always both a vowel and a consonant, add yY to only the first one

def getVowelPattern(word):
    '''Returns the indices of the vowels in a word'''
    vwls = re.compile(vowel).finditer(word)
    positions = []
    for v in vwls:
        pos = v.start()
        positions.append(pos)
    return positions

We start by taking an animal name and decomposing it into a pattern of vowels and consonants, here represented by the positions of the vowels. Note that this assumes all letters are exclusively either a vowel or a consonant, which adds to the difficulty of handling 'y'.

def createPattern(vwPositions, length):
    '''Create a pattern to search for, from the positions of vowels in a word
    '''
    pattern = ""
    for i in range(0,length):
        if i in vwPositions:
            pattern += vowel
        else:
            pattern += consonant
    pattern += " "
    return pattern          

This composites a bunch of the 'vowel' and 'consonant' strings together to form a pattern matching the whole word. Fairly straightforward. This pattern will be used to search for any names matching this pattern.

def searchText(pattern, text):
    # Given a file of text, search for instances of the pattern
    p = re.compile(pattern)
    for line in text:
        m = p.match(line)
        if m:
            yield m.group()
    return
        
if __name__ == "__main__":
    animal = raw_input("What animal are we naming? \n")
    vwp = getVowelPattern(animal)
    ptr = createPattern(vwp, len(animal))
    # list of names from 
    # http://deron.meranda.us/data/census-derived-all-first.txt
    with open("firstnames.txt") as f:
        srh = searchText(ptr, f)
        cont = True
        n = 0
        while cont:
            n += 1
            try:
                name = srh.next()
                print name
                if n % 3 == 0:           
                    cont = (raw_input("Continue? y/n ") == "y")
            except StopIteration:
                print "End of file reached"
                cont = False

We take a list of first names (male and female) sorted by population frequency, and apply the regular expression created above to find matches. Three matches are returned at a time with the option to continue or stop; this seemed like a good compromise between simplicity and the conflicting desires to see many names, to see the most common matches without scrolling through a list of results, and just to see if there's any matches at all.

Well, that was a lot of fun, and I got to find out what the two matches for 'squirrel' are.

Animal Naming

Code diary 01 – Mutagen for Recovered Files

So after an incredibly stupid hard drive reformatting incident, I was left with my music collection reduced to recovered files – better than nothing, but nameless and unsorted, like so:
recoveredMusic
Luckily, most music files include metadata identifying the song title, album name, artist etc. A package such as Python’s Mutagen can read this metadata and this can be used to re-organise a large quantity of files. The below is my undeniably shoddy code to do so:

# -*- coding: utf-8 -*-
"""
Created on Wed Jun 10 14:39:13 2015

@author: Oscar
"""
forbiddenChars = ['\\','/','?','*',':','>','<','|', '"', '"'] 
import mutagen 
import os 
import sys 

def genNames():
     i = 0
     while True:
         yield "unknown track {:0>2d}".format(i)
         i += 1

def identify(path):
    fileData = {}
    namer = genNames()
    initFiles = os.listdir(path)
    for i in initFiles:
        if os.path.isfile(path + i):
            d = getData(path + i, namer)
            if d != None:
                fileData[i] = d
    return fileData

def getData(path, namer):
    try:
        rawdata = mutagen.File(path, easy=True)
    except:
        return None
    print "Getting data for {}".format(path)
    if rawdata == None:
        return None
    if rawdata.has_key('title'):
        title = str(rawdata['title'][0].strip())
        # Strip removes trailing whitespace, which windows doesn't like
    else:
        title = namer.next()
    if rawdata.has_key('album'):        
        album = str(rawdata['album'][0].strip())
    else:
        album = "Unknowns"
    ext = '.' + path[-3:]
    for i in forbiddenChars:
        if i in title:
            k = title.split(i)
            k.insert(1, ' - ')
            title = ''.join(k)
        if i in album:
            k = album.split(i)
            k.insert(1, ' - ')
            album = ''.join(k)
    return (title, album, ext)

if __name__ == "__main__":
    loc = sys.argv[1]
    print "Identifying tracks..."
    music = identify(loc)
    print "Moving files..."
    print "Found: ", music
    for f in music:
        destination = loc + music[f][1] + '/' + music[f][0] + music[f][2]
        if os.path.isdir(loc + music[f][1]):
            # If there is already a folder for this track's album name            
            if not os.path.isfile(destination):
                # If this track hasn't already been moved
                print "Moving {} to {}".format(loc + f, destination)
                os.rename(loc + f, destination)
            else:
                # If the track does already exist:
                os.remove(loc + f)
        else:
            # If there's not already a folder
            os.mkdir(loc + music[f][1])
            print "Moving {} to {}".format(loc + f, destination)
            os.rename(loc + f, destination)
    print "Finished"

Overall this wasn’t especially fun to create – every time I thought I’d included all the possible test cases, it would turn out there was something new hiding in the real directory of files. Stuff like album names ending in whitespace or random .zip files that Mutagen isn’t designed to handle. Much faster than moving them all by hand, though!

Code diary 01 – Mutagen for Recovered Files