Celandine part 1 – Basics and Basis

Current code base available here

“I’d quite like to write a very simple computational chemistry program in Julia,” thought lots of people all the time, “I wonder how I’d go about that.”

Step one is to find out if the problem has been solved before. Not quite, but there is an open-source computational chemistry program in Python. So I’m not so much re-inventing the wheel as re-branding the Rotating Axle-Based Object Transport Circle. Still, it might be fun to try, and more importantly, we can follow the existing methodology as a guide.

It would make sense for step two to be “look up a bunch of sources on how to do computational chemistry,” and indeed this is what I did first. But vague memories of university would have been sufficient to provide the answer: there are a whole bunch of different approaches, and each will need its own algorithmic implementation (you could probably get that much just by guessing, even). However, they will share common features: input format, possible outputs, basis sets, and the such. Therefore this first post will, very conveniently for a lazy blogger with a day job and other hobbies, not in fact contain any computational chemistry. Step two, then, is to implement a basic structure that can read and display molecular geometries.

moleculeReader.jl: module for reading molecular structures from files into arrays
A molecular structure is, fundamentally, a 4-column 2D array, the first column containing atomic number and the other three giving (x,y,z) coordinates

function convertTextBlockToArray(text::Array)
    #Given the subset of text in the file which constructs molecules,
    #convert to an array of integers
    #This creates a 1D array of vectors
    A = [lineConvert(l) for l in text]
    #This converts it to a 2D array
    A2 = permutedims(hcat(A...), [2,1])
    #Obviously there are other ways to reach this result
    return A2

To explain how this works: the line creating A uses a comprehension to create a vector of whatever lineconvert is returning. The structure “hcat(A…)” concatenates all the contents of A together into a single 2D array instead. permutedims is Julia’s generalisation of transposition; permutedims(Array, [2,1]) is equivalent to transpose(Array). The result is a 2D array of converted lines from the text. We could also obtain this result by iterating over the array, converting and concatenating each line. But this would look a lot uglier, and since pretty much all reasonable molecules have less than 100 atoms, this slower method is fine.

function lineConvert(line)
    #Converts a single line from string to numerics
    a = split(line, ',')
    b = map(strip, a)
    c = map(x->parse(Float64, x), b)
    return c

This can be done in one line of stacked maps, but there’s not really any reason to and it’s very hard to read it that way. We have to use the somewhat ugly lambda expression “x->parse(Float64,x)” rather than just “parse” in order to force parse to treat all numbers as Float64. This is still somewhat annoying since we’d ideally like to treat the atomic numbers as integers, but that would mean using a mixed array. It seems acceptable to use floating-point for atomic numbers here, since they’re not actually used for any arithmetic until much later.

This function could still stand to be improved. Ideally it ought to be possible to freely switch between writing the atomic numbers in the input file, and writing the symbols (H, He, C, Br, etc). And a proper version of this would probably find some way to use integers for atomic numbers.

function readFileToUniversalInformation(filepath)
    #Read the file in the given location
    #Return an object containing the information set in the header

This is the other big TODO in moleculeReader. The top paragraph of an input file will be used to specify how the program should execute. It ought to be able to contain information like calculation type (RHF, UHF, DFT…), basis set to use (sto3G, p321, etc), specify the unit of measurement being used in input/output (angstroms, nanometers, bohr radii) (though note that internal calculations will all be done in nm. No sense making things needlessly hard), and probably other things I haven’t thought of yet. A custom type will be used to store this information accessibly.

moleculeDrawer.jl: module for creating 3D graphs of molecular structure
Sadly, Gadfly does not support 3D graphs yet. Since there’s not really any good way to visualise a molecular structure using only 2D graphs, we’ll just have to deal with using Plotly.

function plotMolecule(atomArray::Array)
    #create a graph displaying the atoms in the positions given by the array
    trc = scatter3d(
                    ;x=atomArray[:, 2], y=atomArray[:, 3], z=atomArray[:, 4],
                    mode="markers+text", opacity=0.5, marker_size=12,
                    text=map(x->getElementSymbol(x), atomArray[:, 1]),
    #TODO make some of these things into inputs to this fn
    #TODO make different elements different colours :3yes
    layout = Layout(margin=attr(l=0, r=0, t=0, b=0))
    p = plot(trc, layout)
    return p

This is how you plot a 3D graph using Plotly. Important parts: x=atomArray[:, 2], y=atomArray[:, 3], z=atomArray[:, 4] gets the x, y and z columns of the array of atom positions. text=map(x->getElementSymbol(x), atomArray[:, 1]) creates the 4th column, the text of each atom. The text is just a label giving the elemental symbol.

function getElementSymbol(atomicNumber::Int32)
    #Retrieve the symbol for a given atomic number

    #The file is properly ordered, so the nth element is the nth line #EZPZ
    l = open(readlines, eledata)[atomicNumber]
    #Note that this loads the entire file into an array. May result in large
    #memory usage if the number of chemical elements increases,
    #or if using a table with very large amounts of info for each element
    elemInf = map(strip, split(l, ','))
    return elemInf[2]

Next TODO: create a better importer for the pt-data csv files. It should create a dictionary of Symbol:Number, and a simple list of symbols. That will make accessing symbols as we need them easier, and make it possible for moleculeReader to read symbols. Also find some similar files with a proper source, the source on these ones has vanished.

Fig. 1: Incredibly hard-to-make-out plot of atoms in non-meaningful positions. TODO: learn how to make this less awful to look at (in Atom? Which has a dark background? Or, learn how to make it display outside Atom, on a pale background? Or, make it export images without including the background?)

basis.jl: module for importing basis sets.
For the conversion from the PyQuante Python basis sets to the simpler, if more in need of interpretation, csv basis sets, see the python script basisReformatter.py. The reason for the conversion is to have a more language-neutral, but still somewhat readable version of the basis sets to hand.

function addLine!(b::BasisSet, line::String)
    l = map(strip, split(line, ','))
    linenum = parse(l[1])
    linearr = formArray(l[2:end])
    b.basis[linenum] = linearr

Note that the function!() format indicates a function that alters its arguments in-place rather than returning a new variable (though it is not a required syntax of doing so). This is the typical behaviour in Julia, so I’m not sure why it merits being highlighted. Opinions are divided over whether this is a good way of doing things; personally I quite like it.

function formArray(line::Vector)
    #In the input list, elements alternate between letters and strings of
    #numbers separated by spaces
    A = Array{Any}(2,0)
    for i = 1:2:length(line)
        #i indexes letters, i+1 indexes blocks of numbers
        a = line[i]
        b = listPairedNumbers(line[i+1])
        A = hcat(A, [a,b])
    return A

The array we end up forming is not particularly easy to visualise – overall the structure is a hashmap relating numbers to (lists of (pairs of letters and (lists of pairs of numbers))). Uhh… Yeah. Think that’s right.

So far basis.jl is the part that probably needs least work. It does a single job quite effectively, and probably/hopefully doesn’t need any massive extension.

tester.jl: file for testing other modules.
Python is uncommon in having the wonderful if __name__ == "__main__": structure for testing modules by simply running them as main. In Julia, a more dedicated approach is required. So this file simply contains a set of functions for testing each other module.

    #This undoes the inclusions and uses, except when it doesn't

Annoyingly, the workspace() command doesn’t seem to do what it’s supposed to. To explain, consider the following development workflow in the Julia add-in for Atom, Juno:

  • Write a function in a module.
  • Add some code to the tester script’s function for the changed module to check the new function, execute the tester script.
  • Either part of the new code fails in some way.
  • Make further changes, run the tester code again.
  • Best case scenario: output overwhelmed by hundreds of warnings
  • Worst case scenario: console errors

Wait, what?

As I understand it, you’re Not Supposed to import things into the same workspace twice in Julia. In the good case (only overwriting my own code), the warnings just make things really hard to work with. In the bad case (Plotly), the workspace just dies. What the workspace() command should be doing is refreshing the workspace, replacing the main module with a new, empty one. In practice it doesn’t seem to prevent these problems. The only thing to do is to stop the Julia REPL and restart it.

Fig. 2: Second-run output interrupted by dozens of warnings. I have no idea how to prevent this “properly.” There doesn’t seem to be any way to suppress warnings.

Major improvements to make: Have input file set program behaviour. Allow element symbols in input file. Make graph output readable.

Next step: Add restricted Hartree-Fock algorithm. Add some test cases. Create some documentation on usage.

Celandine part 1 – Basics and Basis