The most portable code: Fortran!

The title might not be fully correct, but pretty close to. To start with I consider here only purely compiled programming languages. I.e. no scripting (no web browser). Then probably as the C programming language is used to build most of the tools, it deserves the first position.

Then indeed, this blog is to advertise the publishing of one of my old library and fortran framework. A bit of shouting cannot harm. But, somehow, I came up with this title for few reasons. For the impatient, this is here https://github.com/mariotti/freemol.

The first obvious reason for this title is that I revived a 10/15 years old fortran framework and it basically compiled with minor changes (which actually where in a perl code used for processing makefiles). But not only on my up to date linux fedora box, it did work also right away on my macbook air 2011.

The second reason is that some of the code is dealing with subjects which where difficult to handle in fortran. It is nice to see that these tools are indeed still working.

The third reason is that the full code of the framework counts about 59000 lines (of which about 10000 of “help” comments and an other 10000 of “free” comments) with 39000 lines of pure fortran code. That of course is not the reason. The fundamental reason is that most of the code has been ported from fortran IV/77 to fortran 90 and not all of it really normalised to the new specifications. Yet, working.

I will publish then a full description of the framework, mainly for historical reasons, but also to let others to grab some ideas. Indeed the code per se, is high quality and tested in a period of about 10 years. At least for the part I (and my coworkers) used most. ūüėČ

The fourth reason is indeed that I am coding in java. So I do know what it takes to match the versions of java with the huge available choice of libraries and tools, and their versions within them. In my personal opinion the portability of java is made vanishing by its fast development. This is especially true if you try to build a secure system which uses few new libraries. You need to be always way back to the current main version and patch for the new code.

But just to get an idea let me point out a couple of things of this fortran framework.

In first instance I call it a framework because there are few shell and perl scripts to help you building your fortran application.

  • After machine configuration, you can select the available fortran compilers
  • The makefiles are autogenerated
  • There is a full cleanup facility
  • There exists basic libraries

It almost goes this way:

  • Create a project directory
  • Put the code inside with a main (program)
  • write a 1 line of library dependencies: -Letc
  • run make from the main folder

It has also a facility to collect all the lines starting with ‘!H’ as help comments for the given function. It includes a directory which will compile latex documentation if it happens to you to write one.

I am sure there are today better frameworks. But i designed it to “bare”, only few basic shell commands (see portability, it did work!), and fully local, no external tools. For this reason there is a copy of the blas and lapack libraries with the distribution. (for this in particular I suggest to NOT ¬†compile the local libraries as usually your machine has very good optimised versions of them. But in case you run out of resources, they are there.)

The fortran libraries included in the package handle different problems:

  • Strings, file ascii lines, and command line for example.
  • It has a fully featured library to handles files.
  • Within these, there is a library to read molden alike format files
  • A part in particular is dedicated to read the “molecule” section of the molden file.

Just for an idea, the code reads (almost) this without any problem:

[molecule]
1   H         1.0 1.0 1.0
2 1 H         1.0 0.0 0.0
3   H 1       0.0 1.0 0.0
    H         0.0 0.0 1.0
    H 1.00794 1.0 0.0 1.0

Almost because the code will warn you that it is reading from different line formats. Just in case the file is messed up ūüėČ

Then beside these basic libraries or modules, there are already few programs you can play with.

The most famous is probably ADFrom. It is not supported anymore, I am sorry. The reason is not that it was not a good code, but simply licensing problems and time. I changed my activities and I could not get anymore a licence for ADF (Amsterdam Density Functional). On my side I did not ask for a free license just to maintain the code as I was not able to keep up with the time. If it would have been opensource: who knows.

The second in this short list is probably the CSM code. Continuous Symmetry Measure. You can use this code to relate molecules to symmetry, and if, you are brave enough, to each other.

Then there is a very fancy and powerful 1d fitting machinery. it is really 1D. I used it to test merging of potential energy surface. (I will update this blog with the published reference to the actual 9 and 10 dimetional merging).

You can fit with different functions, and you can even add them if you can code. The fancy thing is that it reads a unique file with multiple data columns and it can perform basic operations on the data before the fitting. The obvious might be: fit -> col1 + col2*3.14. But the code can get parts of the columns, like: col1[1:200],col2[202:400]*1.01 and fit. As I say, I used it to test surface merging, so it is fancy. In theory you can do it in excel, but in practice not that fast and in particular not that easy to be scripted and saved for each fit.

All the rest of the code copes with polyspherical coordinates mainly for the methane molecule.

I hope I gave a nice overview of this old but still pretty actual code.

The code is here: https://github.com/mariotti/freemol.

 

The quest for the ID

The ID. The IDentification Number (or code) is used everywhere and most of the time it is crucial for software applications and for business; ¬†“most of the time” also crucial for personal identification.

I would start with noting that indeed the email address is a major example of ID which is simply understood by most of the people. It is unique across the Internet world.

We might need to exclude a couple of friends of mine who are sharing a “family” address. The email address is not personal in this case.

What most of the people might not know is that there is an effort worldwide to “unify” and make unique most of the IDs. For products an example of a worldwide effort is given here: Global Trade Item Number. It is a good example because you will need to pay to be included within the database. The fact that you have to pay shows that there is a non zero effort to keep an ID unique and working worldwide.

The best ID I ever used is indeed the “sequential” ID: 1,2,3..,n, because it actually makes its own use clean and neat. As you might have guessed this post is about all but sequential ID, but, please, do not forget you have it, use it every time: it is still the best ID ever!

The ID is used within databases to identify a single unique entry. SQL uses historically a sequential number. Few programmers split the DB record ID from the software record ID even if the DB can provide unique ID. They feel like they do not have control on the ID from the software perspective. This little trust concept, did work indeed, as new IDs can include global IDs and/or simply failure on the DB. But do we really need bright programmers for that.

Let’s see some properties of the IDs or DB indexes which might be required by the applications.

  • Security:
    • The ID should not be guessed
    • The ID should not be guessed having a previous ID
    • The ID should not be guessed by a number crunching program
  • Uniqueness:
    • The ID should be unique within the application
    • The ID should be unique worldwide
    • The ID should be unique… see discussion later
  • Readability:
    • The ID should be readable by a human being
    • The ID should be readable by a human being and eventually easy to memorize
  • Explanatory:
    • The ID should contain information which can give indications to a human being
    • The ID should contain some classification information which can be partially processed
  • Fast and speed dedicated:
    • The ID contains some sorting values which speed up searches and listing

 

These are some example requests to which I came across in my “ID life”. There might be more.

Before to go further you might want to read these more general documents:

Examples

I describe here few examples which I hope summarise the problem of finding a good ID.

Invoices

A typical invoice would have a simple sequential number like “32451”. A today’s example conversation might go like this:

VEND: Hello, I am from Company A and I call about the invoice 32451 we sent to you?

CUST: Yes, no problem, but I cannot find any order from my side from Company A with ID 32451, are you sure you called the right person?

VEND: Yes, was an order made on the 12/12/2012 for 123.45 $

CUST: Let me look… yes, I have such an order but it is from X-Bay with ID: sxAs%rtmnlh-32451

VEND: Sure, sorry, What are the first 5 chars?

CUST: sx lowercase, A capital, s lowecase and the symbol percent.

VEND: …. sure, you payed it twice!

Here you see few problems in action. In first place an invoice is typically an administration document which has to be referenced by its own ID. The actual final customer indeed registers the order and the due payment under the main web site name and not under company A. Both indeed agree on a date and an amount. Then they exchange an ID by voice which might generate errors.

Invoices solution

Our solution runs this way:

IN-2016-08-04-<EmittingCompanyID>-<SequentialNumber>

This is human readable, and unique for the given target. You might need to be sure that it fits for the given country law.

This solution works on apache ofbiz. Thanks to Jacques Le Roux. As easy as it might look like it was a long discussion with me, Jacques and Pramod prasanth.

Mongo concurrency on probability

You could consider the mongoID as an example of uniqueness. But it is not the case. While it has an high level of randomness it is not yet 100% guaranteed to generate unique numbers for the “worldwide” part.

Technically we are left with the last “3-byte counter, starting with a random value”. Why? Because in a production system I might have multiple mongodb clusters running and, in particular, for backup or fault tolerance: I might “duplicate” these clusters. Mongodb IDs work if mongo ¬†is the unique parallel application.

Let make things a bit more complicated: 1 million requests on 1 million unique items concurrently.

UUID, GUID

For this please read the wikipedia documentation. There is enough information about uniqueness.

These are working basically on random number generators. There is many version.

Few links are given above but please try yourself to google it.

The ID

The idea of this post is to give directions. And of course my personal opinion.

For example the invoice ID is unique. But it is public. I would not use public IDs as DB IDs. Sorry for the strong contraction. The point is indeed easy: do not use public IDs for your database. This is indeed the simplest drawback for sequential IDs. You need to have a very good programmer that can hide the backend process.

Changing subject. The best ID I know about is the mongoID. It has many sub-features. The only problem is the machine ID which might not be always available.

The generator

You might not have realised in the lines above but the main problem is: who generates the IDs?

You can have a totally random number, but you will need to trace it for any purpose.

You can have a paid service which warranty unique IDs.

For a general purpose IDs I suggest the mongoID with little changes.

The machineID might not be alway available and generating a random number for it might make little sense. Consider also the case where the processID might be a random number.

The best idea I came up is a generator key distributor.

In practice the keys generator, gives a unique string to each generator, which is used to generate IDs, in a mongoDB style with the machineID replaced by the string.

This way we can be sure that a single generator makes unique IDs. A second (or thousands from the design),  generator will not share the same key.

The exercise is to make a good keys distributor. It will have the list of allowed IDs generators.

 

I will publish shortly a github demo in coffeescript.

Solved problems

The first is concurrency. We assume that a generator has the possibility to spawn processes with a given processID and control a sequential number within the process. Within this picture there is no problem about concurrency.

Uniqueness: Every ID is indeed unique. As much as the machineID generator works with unique numbers.

 

Continuous Symmetry Measure: an old work

I decided to finally post an old work I did in science. The actual working paper is CSM_paper. It did never got peer reviewed for different reasons but it still deserves some notice. It is a bit off topic from the usual Unix and/or programming subjects you find in this blog. Nevertheless you can find some patterns, techniques or the like.

It is indeed interesting for the more IT oriented guys, for subjects like image recognition and fuzzy comparison. The approach here is pure scientific with the exception that there is no scientific connection or background  beside a pure statistical approach. For the IT guys I suggest to follow D.Avnir work who has more dedicated papers.

The main subject is: Continuous Symmetry Measure (CSM). The term, I think, was defined in 1992 by D.Avnir in J.Am.Chem.Soc. 7843 (114) 1992 (Site PDF). Please follow D.Avnir pubblications on the subject. A second important publication I deal with in this blog and paper comes from 1998 from S.Grimme in Chem.Phys.Lett. 15 (297) 1998 where he includes some quantum properties in the measure.

Here below I will present a very reduced comment on the paper. An important question I was trying to answer is if there exist a reference object when we introduce a/the measure or it is an intrinsic property of the object. From this work it comes out that we have ideed an intrinsic reference object which depends on our mathematical construction and definition of symmetry and more in general of measure. If you are interested just read further.

The paper extracts.

Abstract

In this contribution we address the problem of near symmetry and present a Continuous Symmetry Measure (CSM) based on the electron density function. In particular we propose an algorithm which generalizes the formalism proposed by Avnir (Encyclopedia of computational Chemistry, pp 2890 ,J.Am.Chem.Soc. 7843 (114) 1992)toward an exploitation of point symmetry groups properties. Correlations with existing definitions of CSM are discussed and an implementation which uses a simplified electron density of the proposed approach is presented. Advantages and disadvantages of the different approaches are reported.

Introduction

You will need to read the PDF version for a proper introduction. The main points are the definition of “near” symmetry and/or a distance measure from symmetry. A side problem but very important in life science is chirality, and it is discussed within the method.

Description of the CSM model

It is introduced the “symmetriser”. It works perfectly with any continuous 3d quantity.

The idea is simply to apply all the symmetry operators of a given symmetry group to a random 3d continuously defined density. If the system is continuos the result object is simply “symmetrised”.

This object is then defined to set up a measure or symmetry. There is indeed some background information on the PDF paper.

 

Relation with previous CSM definitions

There are 2 other approaches. One is discrete the other uses a wave function.

The discrete approach does not differ from this one with the exception that using discrete values need a classification of the “points” (the discrete part). The classification can be minimised and defined via algorithms in advance but gives also space to specific patterns (in this case it is disjoint from the described approach).

The wave function approach has to be the same thing, as the density is defined by the wave function. The problem is that the wave function itself has no 3d properties if not expressed in some measure. The symmetry (the 3d version) till now is not a measure. Better, this is what we try to find it out. The Grimme approach lacks of scientific approach to a 3d geometrical problem. As we all are far away from any hamiltonian form of measure.

 

Implementation

Beside a solid scientific background there are some statistical data which drive us to try an implementation. As we try a fast approach I propose this method: Gaussians. We define the object to be made of the best gaussian fit. 3D continuous, but also single points represented as gaussian. The Algorithm works in the continuous space. The integration of gaussians is a simple numerical problem.

The code is given within the freemol framework on github.

Applications

Read D.Avnir work. My current applications are trends.  You can always read the paper.

 

 

 

Praat: Extract data

I post here a very basic script which I use to extract data from some given (stress given) praat data.

It runs on files, folders (directories) and it can accumulate data. It has options to save in in different formats, even if just 2 now with some little controls on the output. It can be scripted for large amount of data.

For new comers praat is a linguistic tool more at praat. (you can try also praat.org which links back here).

Continue reading

Less unix, more linguistic and phonetics: a script to automate praat

Praat (the home page at http://www.praat.org/ and http://www.fon.hum.uva.nl/praat/):

I needed a simple script to quickly change the input data for the command line execution. The interesting part is at the end. It is an initial script to handle big data.

So this is a first attempt (lets call the script: run_praat.sh):

Continue reading

A small script: taritdate.sh tars directories with dates

This is a very simple script which I use often to create temporary backups of working directories. As complexity grows, I needed to have a script to save locally the state of my work, so that I can easily revert to a previous state. It came out that I also use it for backups. See it:

Continue reading