Bioinformatics News and Reviews: programming

Showing posts with label programming. Show all posts

Saturday, December 28, 2013

If the code is not open, you're not doing science

This is a list of articles explicitly suggesting or tangentially indorsing Open Source Software in the context of academic publications. The list is far from complete and in no particular order, but it creates a strong defense of source code as an integral part of the scientific process (code is data after all!).

http://dx.doi.org/10.1016/j.tree.2012.03.009

The difference between the RF and the NNI distances

Just to complement my answer to a blog post, where I maintain that the Nearest-Neighbor Interchange (NNI) distance is not equivalent to the Robinson-Foulds (RF) distance, a simple example:

Where we can see that trees T1 and T2 differ only in the location of nodes A and B -- on these trees, we can naturally think of the nodes A, B, 1,..., 6 as representing leaves, but they might also be large subtrees.

The RF distance is the number of edges (=branches) that are unique to each tree (that's why it's also called the symmetric difference), and it may be normalized to one. If we highlight the unique edges on trees T1 and T2

We see that the (unnormalized) RF distance is 10. For dichotomic trees, the number of unique edges is the same on both trees.

The NNI distance is the minimum number of NNIs that must be applied to one tree such that it becomes equal to the other. One NNI branch swap will change exactly one edge, thus is very tempting to assume that the NNI distance can be found by looking at the distinct edges.

But the problem is when the same branch is involved in more than one path of the "NNI walk". The RF distance (divided by two, for fully resolved trees) is then a lower bound on the minimum number of NNIs. In our example:

The NNI distance between T1 and T2 is 6, one more than the RF distance since the edge splitting (1,2,3) and (4,5,6) is used twice in the NNI computation. The problem, as explained by Liam, is that simulating trees with a specified distance is hard, and the solution of using very large trees masks the cases where the distances disagree...

Reference:
Bryant D. (2004). The Splits in the Neighborhood of a Tree, Annals of Combinatorics, 8 (1) 1-11. DOI: 10.1007/s00026-004-0200-z

Saturday, July 14, 2012

pointerception in C

This is something that I still have a hard time, so I'll leave here two clear explanations from stackoverflow user Charles Bailey

c++ - Spiral rule and 'declaration follows usage' for parsing C expressions - Stack Overflow

You just have to build it up in steps.
char *X();  // X =~ (*(*a[N])())
Function returning char*
char *(*Y())();  // Y =~ (*a[N])
Function returning pointer to function returning char*.
In a declaration, just as in an expression (declaration follow usage), postfix [] has a higher precedence that unary * so *a[N] is equivalent to *(a[N]), not (*a)[N].
char *(*(*Z)())();  // Z =~ a[N]
Pointer to function returning pointer to function returning char*.
char *(*(*a[N])())();
Array of N pointers to functions returning a pointer to function returning char*.

c - what's the meaning of this piece of code? void (signal(int sig, void (func)(int)))(int); - Stack Overflow

#include <signal.h>
void (*signal(int sig, void (*func)(int)))(int);

It's the declaration of a function taking an int and a pointer to a function (taking int returning void) and returning a pointer to a function (taking int and returning void).

Explanation, or guide to interpretation

You can interpret by treating everything in parentheses as a single entity and then working inwards using the "declaration follows usage" rule.

void (*signal(int sig, void (*func)(int)))(int);

The entity in the brackets looks like a function taking int and returning void.

Stripping away the outer part:
*signal(int sig, void (*func)(int))

So, signal takes some parameters and returns something that can be dereferenced (due to the leading *) to form a function taking int and returning void.

This means signal is a function returning a pointer to a function (taking int and returning void).
Looking at the parameters it takes an int (i.e. sig) and void (*func)(int) which is a pointer to a function (taking int and returning void).

The guy deserves the 128 thousand points of reputation.

Sunday, December 6, 2009

Recent CUDA applications

Taking a look at Nvidia's CUDA zone I found some recent additions - besides important ones that I already knew like the Smith-Waterman and the phylogenetic likelihood algoritihms. Here is a list of these new programs and libraries that take advantage of GPUs:

Expectation Maximization algorithm for Gaussian Mixture Models
Multilevel algorithm for MDS (multidimensional scaling)
Multiclass classifier based on SVMs (support vector machines)
GPU computing in the R statistical environment (BLAS lowlevel routines and dist(), hclust(), cor(), and granger.test() functions)
Smith-Waterman algorithm for sequence database search
Quicksort library
Feature finding algorithm for mass spectrometry
Smith-Waterman algorithm for sequence alignment
MDR (multifactor dimensionality reduction) algorithm for detecting genetic epistatic interactions (white paper)

The last reference is just a two-page summary, and the 8th reference is behind a paywall - I couldn't access it myself - but the slides are available here.

Bioinformatics News and Reviews

Pages