by any other name. The personal blog of Leonardo de Oliveira Martins. "quod gratis asseritur, gratis negatur"
Showing posts with label programming. Show all posts
Showing posts with label programming. Show all posts
Saturday, December 28, 2013
If the code is not open, you're not doing science
This is a list of articles explicitly suggesting or tangentially indorsing Open Source Software in the context of academic publications. The list is far from complete and in no particular order, but it creates a strong defense of source code as an integral part of the scientific process (code is data after all!).

Thursday, February 7, 2013
The difference between the RF and the NNI distances
Just to complement my answer to a blog post, where I maintain that the Nearest-Neighbor Interchange (NNI) distance is not equivalent to the Robinson-Foulds (RF) distance, a simple example:
Where we can see that trees T1 and T2 differ only in the location of nodes A and B -- on these trees, we can naturally think of the nodes A, B, 1,..., 6 as representing leaves, but they might also be large subtrees.
The RF distance is the number of edges (=branches) that are unique to each tree (that's why it's also called the symmetric difference), and it may be normalized to one. If we highlight the unique edges on trees T1 and T2
We see that the (unnormalized) RF distance is 10. For dichotomic trees, the number of unique edges is the same on both trees.
The NNI distance is the minimum number of NNIs that must be applied to one tree such that it becomes equal to the other. One NNI branch swap will change exactly one edge, thus is very tempting to assume that the NNI distance can be found by looking at the distinct edges.
But the problem is when the same branch is involved in more than one path of the "NNI walk". The RF distance (divided by two, for fully resolved trees) is then a lower bound on the minimum number of NNIs. In our example:
The NNI distance between T1 and T2 is 6, one more than the RF distance since the edge splitting (1,2,3) and (4,5,6) is used twice in the NNI computation. The problem, as explained by Liam, is that simulating trees with a specified distance is hard, and the solution of using very large trees masks the cases where the distances disagree...
Reference:
Bryant D. (2004). The Splits in the Neighborhood of a Tree, Annals of Combinatorics, 8 (1) 1-11. DOI: 10.1007/s00026-004-0200-z
Where we can see that trees T1 and T2 differ only in the location of nodes A and B -- on these trees, we can naturally think of the nodes A, B, 1,..., 6 as representing leaves, but they might also be large subtrees.
The RF distance is the number of edges (=branches) that are unique to each tree (that's why it's also called the symmetric difference), and it may be normalized to one. If we highlight the unique edges on trees T1 and T2
We see that the (unnormalized) RF distance is 10. For dichotomic trees, the number of unique edges is the same on both trees.
The NNI distance is the minimum number of NNIs that must be applied to one tree such that it becomes equal to the other. One NNI branch swap will change exactly one edge, thus is very tempting to assume that the NNI distance can be found by looking at the distinct edges.
But the problem is when the same branch is involved in more than one path of the "NNI walk". The RF distance (divided by two, for fully resolved trees) is then a lower bound on the minimum number of NNIs. In our example:
The NNI distance between T1 and T2 is 6, one more than the RF distance since the edge splitting (1,2,3) and (4,5,6) is used twice in the NNI computation. The problem, as explained by Liam, is that simulating trees with a specified distance is hard, and the solution of using very large trees masks the cases where the distances disagree...
Reference:
Bryant D. (2004). The Splits in the Neighborhood of a Tree, Annals of Combinatorics, 8 (1) 1-11. DOI: 10.1007/s00026-004-0200-z
Saturday, July 14, 2012
pointerception in C
This is something that I still have a hard time, so I'll leave here two clear explanations from stackoverflow user Charles Bailey
c++ - Spiral rule and 'declaration follows usage' for parsing C expressions - Stack Overflow
c++ - Spiral rule and 'declaration follows usage' for parsing C expressions - Stack Overflow
You just have to build it up in steps.c - what's the meaning of this piece of code? void (signal(int sig, void (func)(int)))(int); - Stack Overflow
Function returningchar *X(); // X =~ (*(*a[N])())
char*
Function returning pointer to function returningchar *(*Y())(); // Y =~ (*a[N])
char*
.
In a declaration, just as in an expression (declaration follow usage), postfix[]
has a higher precedence that unary*
so*a[N]
is equivalent to*(a[N])
, not(*a)[N]
.
Pointer to function returning pointer to function returningchar *(*(*Z)())(); // Z =~ a[N]
char*
.
Array of N pointers to functions returning a pointer to function returningchar *(*(*a[N])())();
char*
.
#include <signal.h>
void (*signal(int sig, void (*func)(int)))(int);
It's the declaration of a function taking anint
and a pointer to a function (takingint
returning void) and returning a pointer to a function (takingint
and returning void).
Explanation, or guide to interpretation
You can interpret by treating everything in parentheses as a single entity and then working inwards using the "declaration follows usage" rule.
void (*signal(int sig, void (*func)(int)))(int);
The entity in the brackets looks like a function takingint
and returningvoid
.
Stripping away the outer part:
*signal(int sig, void (*func)(int))
So,signal
takes some parameters and returns something that can be dereferenced (due to the leading*
) to form a function takingint
and returningvoid
.
This meansThe guy deserves the 128 thousand points of reputation.signal
is a function returning a pointer to a function (takingint
and returningvoid
).
Looking at the parameters it takes anint
(i.e.sig
) andvoid (*func)(int)
which is a pointer to a function (takingint
and returningvoid
).
Sunday, December 6, 2009
Recent CUDA applications
Taking a look at Nvidia's CUDA zone I found some recent additions - besides important ones that I already knew like the Smith-Waterman and the phylogenetic likelihood algoritihms. Here is a list of these new programs and libraries that take advantage of GPUs:
The last reference is just a two-page summary, and the 8th reference is behind a paywall - I couldn't access it myself - but the slides are available here.

- Expectation Maximization algorithm for Gaussian Mixture Models
- Multilevel algorithm for MDS (multidimensional scaling)
- Multiclass classifier based on SVMs (support vector machines)
- GPU computing in the R statistical environment (BLAS lowlevel routines and dist(), hclust(), cor(), and granger.test() functions)
- Smith-Waterman algorithm for sequence database search
- Quicksort library
- Feature finding algorithm for mass spectrometry
- Smith-Waterman algorithm for sequence alignment
- MDR (multifactor dimensionality reduction) algorithm for detecting genetic epistatic interactions (white paper)
The last reference is just a two-page summary, and the 8th reference is behind a paywall - I couldn't access it myself - but the slides are available here.
Powered by ScribeFire.
Subscribe to:
Posts (Atom)