Monday, October 31, 2011

Using the T-Coffee package to build multiple sequence alignments of protein, RNA, DNA sequences and 3D structures

T-Coffee (Tree-based consistency objective function for alignment evaluation) is a versatile multiple sequence alignment (MSA) method suitable for aligning most types of biological sequences. The main strength of T-Coffee is its ability to combine third party aligners and to integrate structural (or homology) information when building MSAs. The series of protocols presented here show how the package can be used to multiply align proteins, RNA and DNA sequences. The protein section shows how users can select the most suitable T-Coffee mode for their data set. Detailed protocols include T-Coffee, the default mode, M-Coffee, a meta version able to combine several third party aligners into one, PSI (position-specific iterated)-Coffee, the homology extended mode suitable for remote homologs and Expresso, the structure-based multiple aligner. We then also show how the T-RMSD (tree based on root mean square deviation) option can be used to produce a functionally informative structure-based clustering. RNA alignment procedures are described for using R-Coffee, a mode able to use predicted RNA secondary structures when aligning RNA sequences. DNA alignments are illustrated with Pro-Coffee, a multiple aligner specific of promoter regions. We also present some of the many reformatting utilities bundled with T-Coffee. The package is an open-source freeware available from

Sunday, October 30, 2011

calibre - eBook reader

calibre - eBook reader

PDF to EPUB converter
PDF to TXT converter

Unsupervised Feature Learning and Deep Learning

Course Description

Machine learning has seen numerous successes, but applying learning algorithms today often means spending a long time hand-engineering the input feature representation. This is true for many problems in vision, audio, NLP, robotics, and other areas. In this course, you'll learn about methods for unsupervised feature learning and deep learning, which automatically learn a good representation of the input from unlabeled data. You'll also pick up the "hands-on," practical skills and tricks-of-the-trade needed to get these algorithms to work well.

Basic knowledge of machine learning (supervised learning) is assumed, though we'll quickly review logistic regression and gradient descent.



Batch gradient descent(1.2x)(1.5x)
Gradient descent in practice(1.2x)(1.5x)
Stochastic gradient descent
Exponentially weighted average
Shuffling data
Exercise 1: Implementation


Examples and intuitions #1(1.2x)
Examples and intuitions #2
Parameter learning
Gradient checking
Random initialization
Vectorized implementation
Activation function derivative












Adopt an open posture. No crossed legs or folded arms.
Make your neck tall and shoulders relaxed, as if you were trying to see over a wall that was very slightly taller than your eye level. Like a meerkat who is looking for a predator. You know what I mean.
Speak clearly and with volume, remember what you're saying is worth hearing.
Don't take yourself too seriously, humour is the most universal language and can help prevent conflict with alpha-male and attention-envy types.
Don't be judgemental to others - but let yourself be open to judgement from others. This relaxes people around you, and helps bring down the barriers between you.

Coming across as confident is often a result of two things -- body language and tone of voice. No doubt you already know about sitting up straight and making eye contact to show confidence, therefore I am going to focus on how your tone of voice can get you that next job.

Your tone of voice has a big effect on how people are going to both perceive you and respond to you. In fact, your tone of voice is more important than the words you choose; it says to people, "This is how I am really feeling."

-Practise speaking in a slightly lower octave; deeper voices have more credibility than higher-pitched voices.

-Pause before saying a meaningful word or idea you are sharing to emphasize its importance.

-Pronounce every word; don't mumble.

-Record your voice and listen to it.

e-Books should be free

Free Audio Books from the public domain
Download a free audiobook in mp3, iPod, or iTunes format

Friday, October 28, 2011

ToppGene Suite - gene list enrichment analysis and candidate gene prioritization

* ToppFun: Transcriptome, ontology, phenotype, proteome, and pharmacome annotations based gene list functional enrichment analysis

Detect functional enrichment of your gene list based on Transcriptome, Proteome, Regulome (TFBS and miRNA), Ontologies (GO, Pathway), Phenotype (human disease and mouse phenotype), Pharmacome (Drug-Gene associations), literature co-citation, and other features.

* ToppGene: Candidate gene prioritization

Prioritize or rank candidate genes based on functional similarity to training gene list.

* ToppNet: Relative importance of candidate genes in networks

Prioritize or rank candidate genes based on topological features in protein-protein interaction network.

* ToppGenet: Prioritization of neighboring genes in protein-protein interaction network

Identify and prioritize the neighboring genes of the seeds in protein-protein interaction network based on functional similarity to the "seed" list (ToppGene) or topological features in protein-protein interaction network (ToppNet).

Ten Simple Rules for Teaching Bioinformatics at the High School Level

Ten Simple Rules for Teaching Bioinformatics at the High School Level

  1. Am I energized to be enthusiastic about this class?
  2. Is the classroom arranged properly for the day's activities?
  3. Is my name, course title, and number on the chalkboard?
  4. Do I have an ice-breaker planned?
  5. Do I have a way to start leaming names?
  6. Do I have a way to gather information on student backgrounds, interests, expectations for the course, questions, concerns?
  7. Is the syllabus complete and clear?
  8. Have I outlined how students will be evaluated?
  9. Do I have announcements of needed information ready?
  10. Do I have a way of gathering student feedback?
  11. When the class is over; will students want to come back? Will you want to come back?

Ten Simple Rules for Getting involved in your scientific community

Ten Simple Rules for Getting involved in your scientific community

Activities such as organizing conferences and workshops, answering questions and discussing scientific ideas online, contributing to a scientific blog, or participating in open source software projects are typically thought of as outside classic research activity. Having scientists involved in those activities, however, is very important for the community to be dynamic and to promote fruitful discussions and collaborations.

encourage your colleagues to play an active role in the scientific community
want to maintain a balance with the activities directly related to your research projects
remember that you are not alone
If you know why you are doing it and if you enjoy it, you will take the time to do it, and you will do it well

Wednesday, October 26, 2011

Quanta Plus -- XML Editor for Ubuntu

Quanta Plus -- XML Editor for Ubuntu

# select all descendants of node parent

$ sudo apt-get install python-4suite-xml
$ 4xpath --string book.xml /catalog/book/author

Result (XPath string):
Gambardella, Matthew

Simply Python code

import libxml2
doc = libxml2.parseFile('foo.xml')
for url in doc.xpathEval('//@Url'):
print url.content

Tuesday, October 25, 2011

Peter Norvig - The Unreasonable Effectiveness of Data

Collect data, use probability (Baye's Rule) to write some simple code / model, and let the data do all the work.

good vs bad data, over-time, you might pickup your own data?

word sense disambiguation
spelling correction

Models and useful

Essentially, all models are wrong, but some are useful.
--George Box

Brown Bag Lunch

This discussion was sparked by a question: "ideas for a short (e.g. 45min) brown bag lunch type session, aiming to share information about a particular piece of work ongoing within a large (newly formed) team,in a way that encourages discussion and thought about potential internal synergies, during the lunch break."

* Called ‘Brown Bag’ because people often bring their food in one, the term refers to informal discussions around a topic (eg ongoing research, first ideas for a project) at lunch time, with lunch brought (or sometimes provided). In an organization, some lunchtime meetings are catered, while in others you're expected to bring your own lunch. For organizers, a nice way to set the expectation that no lunch will be served is to call it a 'brown bag'. That way, participants will bring their own. The equivalent in South Asia might be a 'tiffin box' lunch!).

Differentially Expressed Genes in Major Depression Reside on the Periphery of Resilient Gene Coexpression Networks

However, we found that the small-world connectivity characteristics of coexpression networks are resilient to the effects of depression (and of other neuropsychiatric diseases), and that the related pathology is not mediated by network disintegration via attack on hub nodes.

Monday, October 24, 2011

Free Android Apps


Complete up to date travel guides for over 8000 destinations Triposo is the most comprehensive guide available.
Triposo apps don’t need an internet connection, even the detailed maps are all stored on your device.

WhatsApp - WhatsApp Messenger is a cross-platform mobile messaging app which allows you to exchange messages without having to pay for SMS

Call Meter NG - Keep track of your mobile voice / data / text bills

Offline Browser - save webpage

save webpage


Google shopper


Angry Bird

Custom Rom - CyanogenMod 7.1.0 (Android 2.3.7)

For Optimus One running official 2.3.3, root using SuperOneClick.

Why install custom ROM?

Super Manager - File manager, Remove stock applications

Benefits of rooting

Titanium Backup

ClockworkMod ROM Manager

GingerBreak (works on LG OPTIMUS ONE V 2.2.2)

Z4Root (use Permanent Root) (or use GingerBreak-v1.20.apk) (for Telus, P500H, Android firmware v2.2 NOT 2.2.1)
- Enable USB debugging from Menu->Settings->Applications->Development->USB Debugging.
- Make sure the versions / firmware eg. V10B are correct (To find out, go to Settings, then About Phone)

Telus V10S stock:

How to Root LG Optimus One

Opera Mini

Adobe Reader

Bluetooth File Transfer


Advanced Task Killer (Free)

Barcode Scanner

Dolphin Browser™ HD (Play Flash)

Skyfire - (Play Flash) Skyfire Browser makes your mobile web experience richer, smarter and more fun!
Skyfire is the world’s smartest & most social mobile browser!

AdFree Android - removes all ads

Android System Info - Explore all features of your android device!

App 2 SD Free (move app to SD) - Are you running out of application storage?

Easy Uninstaller - Simplist & fastest uninstall tool for android.

Spare Parts Plus! - Allows you to enable and change some hidden settings of your Android device.

RockPlayer - RockPlayer is high performance, almost all formats media player with a lot of functions

Ectopic expression

Ectopic expression is the expression of a gene in an abnormal place in an organism. This can be caused by a disease, or it can be artificially produced as a way to help determine what the function of that gene is.

Similar gene expression profiles do not imply similar tissue functions

Although similarities in gene expression among tissues are commonly inferred to reflect functional constraints, this has never been formally tested. Furthermore, it is unclear which evolutionary processes are responsible for the observed similarities. When examining genome-wide expression data in mouse, we found that patterns of expression similarity between tissues extend to genes that are unlikely to function in the tissues. Thus, ectopic expression can seem coordinated across tissues. This indicates that knowledge of gene expression patterns per se is insufficient to infer gene function. Ectopic expression is possibly explained as expression leakage, caused by spreading of chromatin modifications or the transcription apparatus into neighboring genes.

Sunday, October 23, 2011

Normalize music volume recursively with mp3gain

$ find -name "*.mp3" -print0 | xargs -0 mp3gain -r

xargs -0 - correctly handles files with spaces

Saturday, October 22, 2011

Call landline

Call landline


Create flash cards online

Gene set enrichment analysis made simple (GSEA) MADE SIMPLE


Among the many applications of microarray technology, one of the most popular is the identification of genes that are differentially expressed in two conditions. A common statistical approach is to quantify the interest of each gene with a p-value, adjust these p-values for multiple comparisons, chose an appropriate cut-off, and create a list of candidate genes. This approach has been criticized for ignoring biological knowledge regarding how genes work together. Recently a series of methods, that do incorporate biological knowledge, have been proposed. However, many of these methods seem overly complicated. Furthermore, the most popular method, Gene Set Enrichment Analysis (GSEA), is based on a statistical test known for its lack of sensitivity. In this paper we compare the performance of a simple alternative to GSEA.We find that this simple solution clearly outperforms GSEA.We demonstrate this with eight different microarray datasets.

There are currently two major types of procedure for incorporating biological knowledge into
differential expression analysis. We will refer to these as the over-representation and the aggregate
score approaches.

Over-representation analysis can be summarized as follows: First, form a list of candidate
genes using the marginal approach. Then, for each gene set, we create a two-by-two table compar-
ing the number of candidate genes that are members of the category to those that are not members.
The significance of over-representation can be assessed, for example, using the hypergeometric
distribution or its binomial approximation.
A limitation of the over-representation approach is that it ignores all the genes that did not
make the list of candidate genes.

The aggregate score approach (eg. GSEA), does not have this limitation. The basic idea
is to assign scores to each gene set based on all the gene-specific scores for that gene set.

In this paper we compare GSEA to the one sample z-test and χ2 -test

that 7 or so genes is
sufficient to uniquely determine a gene set, -- Jesse

Hypergeometric (draws w/o replacement) and Binomial / Bernoulli (draws with replacement) distributions

In probability theory and statistics, the binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. Such a success/failure experiment is also called a Bernoulli experiment or Bernoulli trial; when n = 1, the binomial distribution is a Bernoulli distribution. The Binomial distribution is an n times repeated Bernoulli trial. The binomial distribution is the basis for the popular binomial test of statistical significance.

The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N. If the sampling is carried out without replacement, the draws are not independent and so the resulting distribution is a hypergeometric distribution, not a binomial one. However, for N much larger than n, the binomial distribution is a good approximation, and widely used.
Draw 5 cards from the deck, what are the chances that 4 are red?

> tot <- 52; m <- 26; n <- tot-m; k <- 5; q <- 4; dhyper(q,m,n,k)
[1] 0.1495598

> tot <- 52; m <- 26; n <- tot-m; k <- 5; q <- 4; phyper(q,m,n,k)
[1] 0.9746899

Friday, October 21, 2011

Steve Jobs

“Remembering that I'll be dead soon is the most important tool I've ever encountered to help me make the big choices in life. Almost everything — all external expectations, all pride, all fear of embarrassment or failure — these things just fall away in the face of death, leaving only what is truly important. Remembering that you are going to die is the best way I know to avoid the trap of thinking you have something to lose. You are already naked. There is no reason not to follow your heart.” — Steve Jobs, at a Stanford University commencement ceremony in 2005.

“People don’t know what they want until you show it to them.”

“'I was in the parking lot [after the lecture], with the key in the car,” Jobs said. “I thought to myself, If this is my last night on earth, would I rather spend it at a business meeting or with this woman? I ran across the parking lot, asked her if she'd have dinner with me. She said yes, we walked into town and we've been together ever since.''

Thursday, October 20, 2011

SWAN (Semantic Web Applications in Neuromedicine)

Welcome to the SWAN project!

SWAN (Semantic Web Applications in Neuromedicine) is a Web-based collaborative program that aims to organize and annotate scientific knowledge about Alzheimer disease (AD) and other neurodegenerative disorders. Its goal is to facilitate the formation, development and testing of hypotheses about the disease. The ultimate goal of this project is to create tools and resources to manage the evolving universe of data and information about AD in such a way that researchers can easily comprehend their larger context ("what hypothesis does this support or contradict?"), compare and contrast hypotheses ("where do these two hypotheses agree and disagree?"), identify unanswered questions and synthesize concepts and data into ever more comprehensive and useful hypotheses and treatment targets for this disease.


NIF Interoperation Capabilities
What is DISCO?

DISCO is an information integration approach designed to facilitate interoperation among Internet resources. It consists of a set of tools and services that allows resource providers who maintain information to share it with automated systems such as NIF. NIF is then able to “harvest” the information and keep those sets of information up-to-date.
How is this accomplished?

By using a series of files and/or scripts which are then placed in the root directory of the resource developer’s resource. (NIF can also host the files on its servers and crawl for changes there.) Once the files of the resource providers are in place, and DISCO is notified, the DISCO server can then recognize and "consume" the information shared, providing machine understandable information to NIF Integrator Servers (also known as Aggregators) about your resource.


DOMEO - Document Metadata Organizer

Ciccarese P, Ocana M, Clark, T. DOMEO: a web-based tool for semantic annotation of online documents. Paper at Bio-Ontologies 2011, Vienna, Austria. Accepted

So highlight text in the web (eg. Pubmed article) and hit Annotate. Loads ontology data when annotating as well. Also lets you share annotations.

PLoS Computational Biology Guidelines for Reviewers

Research articles modeling aspects of biological systems should demonstrate both scientific novelty and profound new biological insights. Research articles describing improved or routine methods, models, software, and databases will not be considered by PLoS Computational Biology, and may be more appropriate for PLoS ONE.

To be considered for publication in PLoS Computational Biology, any given manuscript must satisfy the following criteria:

* Originality
* High importance to researchers in computational biology
* Significant biological insight and general interest to life scientists
* Rigorous methodology
* Substantial evidence for its conclusions

Manuscripts also must be well written to ensure clear and effective presentation of the work and key findings.

The best possible review of a Research Article would answer the following questions:

* What are the main claims of the paper and how significant are they? Is this paper important in its discipline?
* Have the authors provided adequate proof for their claims?
* Are these claims novel? If not, please specify papers that weaken the claims of originality of this one.
* Would additional work improve the paper? How much better would the paper be if this work were performed and how difficult would it be to do this work?
* Are the claims properly placed in the context of the previous literature? Have the authors treated the literature fairly?
* Do the data and analyses support the claims? If not, what other evidence is required?
* Are original data deposited in appropriate repositories and accession/version numbers provided for genes, proteins, mutants, diseases, etc.?
* Does the study conform to any relevant guidelines such as CONSORT, MIAME, QUORUM, STROBE, and the Fort Lauderdale agreement?
* Are details of the methodology sufficient to allow the experiments to be reproduced?
* Is any software created by the authors freely available?
* PLoS Computational Biology encourages authors to publish detailed protocols and algorithms as supporting information online. Do any particular methods used in the manuscript warrant such treatment?
* Is the manuscript well organized and written clearly enough to be accessible to non-specialists? Would you recommend the author seek the services of a professional science writer?*
* Have any parts of the paper been published elsewhere? Are there any copyright issues associated with this that conflict with the PLoS license?*
* Does the paper use standardized scientific nomenclature and abbreviations? If not, are these explained at the first usage?

Oxford Journals

Wednesday, October 19, 2011

p-value, q-value (FDR)

For example, if there are 200 spots on a gel and we apply an ANOVA or t-test to each, then we would expect to get 10 false positives by chance alone. This is known as the multiple testing problem.

Another way to look at the difference is that a p-value of 0.05 implies that 5% of all tests will result in false positives. An FDR adjusted p-value (or q-value) of 0.05 implies that 5% of significant tests will result in false positives. The latter is clearly a far smaller quantity.

a p-value of 0.01 implies a 1% chance of false positives

To interpret the q-values, you need to look at the ordered list of q-values. There are 839 spots in this experiment. If we take spot 52 as an example, we see that it has a p-value of 0.01 and a q-value of 0.0141. Recall that a p-value of 0.01 implies a 1% chance of false positives, and so with 839 spots, we expect between 8 or 9 false positives, on average, i.e. 839*0.01 = 8.39. In this experiment, there are 52 spots with a value of 0.01 or less, and so 8 or 9 of these will be false positives. On the other hand, the q-value is a little greater at 0.0141, which means we should expect 1.41% of all the spots with q-value less than this to be false positives. This is a much better situation. We know that 52 spots have a q-value less than 0.0141 and so we should expect 52*0.0141 = 0.7332 false positives, i.e. less than one false positive. Just to reiterate, false positives according to p-values take all 839 values into account when determining how many false positives we should expect to see while q-values take into account only those tests with q-values less the threshold we choose.

Olver Sacks

The Mind's Eye
The Island of the Color­blind,”
“An Anthropologist on Mars,”
“The Man Who Mistook His Wife for a Hat”

Saturday, October 15, 2011

Common Latex mistakes

Using underscores eg. hello_world should be hello\_world
cutoff > 0.8 to $cutoff > 0.8$

Working with tables:

\bf{my table title}} \\
%table information
1 & 2 & 3 & 4 \\ \hline
a & b & c & d \\ \hline
\begin{flushleft} my table caption

7 ways to improve your conversation skills

1. Talk slowly
2. Hold more eye contact
3. Notice the details
4. Give unique compliments
5. Express your emotions
6. Offer interesting insights
7. Use the best words

Friday, October 14, 2011


Hemoglobin is also found outside red blood cells and their progenitor lines. Other cells that contain hemoglobin include the A9 dopaminergic neurons in the substantia nigra, macrophages, alveolar cells, and mesangial cells in the kidney. In these tissues, hemoglobin has a non-oxygen-carrying function as an antioxidant and a regulator of iron metabolism.[6]

Hemoglobin variants are a part of the normal embryonic and fetal development,

N50 - a statistical measure of average length of a set of sequences.
N50 scaffold/contig length is calculated by summing lengths of scaffolds/contigs from the longest to the shortest and determining at what point you reach 50% of the total assembly size. The length of the scaffold/contig at that point is the N50 length.
A sequence contig is a contiguous, overlapping sequence read resulting from the reassembly of the small DNA fragments generated by bottom-up sequencing strategies.
The N50 size is computed by sorting all contigs from largest to smallest and by determining the minimum set of contigs whose sizes total 50% of the entire genome. For example, for a genome of 600Mb, if the assembled sequences add up to 500Mb, the N50 would be calculated by sorting the contigs from largest to smallest and finding the length of the contig where the cumulative size is 250Mb.

Given a set of sequences of varying lengths, the N50 length is defined as the length N for which 50% of all bases in the sequences are in a sequence of length L < N.

the N50 (L50) is the median contig length from a list of all the contigs lengths in the assembly

N50 of {2, 2, 2, 3, 3, 4, 8, 8} is 5

Taking Shelter 2011

Plagued by a series of apocalyptic visions, a young husband and father questions whether to shelter his family from a coming storm, or from himself.

Wednesday, October 12, 2011

Download GEO files using R and GEOquery

source("") # download BioC installation routines
biocLite() # install the core packages
biocLite("GEOquery") # install the GEO libraries

untar("GSE20987/GSE20987_RAW.tar", exdir="data")
cel_files <- list.files("data/", pattern = "gz")
cel_files_qualified <- paste("data", cel_files, sep="/")
sapply(cel_files_qualified, gunzip)

Mouse embryology

Day 21 (E21.0)

Hearing and Balance - Fully mature morphological and physiological innervation of vestibular system (P28)

PCW - postconceptional weeks
GL - greatest length

List of images in Gray's Anatomy: IX. Neurology

List of images in Gray's Anatomy: IX. Neurology

Diploid short read assembly

Assemblathon 1,
Earl et al, Assemblathon 1: A competitive assessment of de novo short
read assembly methods. Genome Res. 2011 Sep 16. PubMed PMID: 21926179

Simpson et al, ABySS: a parallel assembler for short read sequence data.
Genome Res. 2009. Jun;19(6):1117-23. PubMed PMID: 19251739; PubMed
Central PMCID: PMC2694472

Test if two samples are different

visually using boxplot()

assuming normal distribution, use

t.test(A, B) #default does not assume equality of variance, small p-value = diff. means, Welch Two sample t-test (unpaired)

var.test(A, B) # test for equality of variance, big p-value = same variance

t.test(A, B, var.equal=TRUE) # Two sample t-test

wilcox.test(A, B) # does not assume normality, just assume a common continous distribution under the null hypothesis, small p-value = diff means

Tests of agreement with normality, comparing distributions

Kolmogorov-Smirnov test (KS test) for normality, "Do x and y come from the same distribution?"

( see ks.test() in R )

x <- rnorm(50)
y <- runif(30)
ks.test(x, y)

you get a small-pvalue so reject the hypothesis that the distributions are the same, therefore two distributions are different.

small p-value = different


Shapiro-Wilk normality test

( see shapiro.test() in R )


visually using QQ plot (Q-Q plot, see qqnorm() and qqplot() )  x: empirical data, y: theoretical data, ideally, data should lie close to the diagonal

Contingency table in R using table()

An Introduction to R

Frequency, contingency table in R using table()


You can generate frequency tables using the table( ) function, tables of proportions using the prop.table( ) function, and marginal frequencies using margin.table( ).

a <- rep(c(NA, 1/0:3), 10)

> a
[1]        NA       Inf 1.0000000 0.5000000 0.3333333        NA       Inf
[8] 1.0000000 0.5000000 0.3333333        NA       Inf 1.0000000 0.5000000
[15] 0.3333333        NA       Inf 1.0000000 0.5000000 0.3333333        NA
[22]       Inf 1.0000000 0.5000000 0.3333333        NA       Inf 1.0000000
[29] 0.5000000 0.3333333        NA       Inf 1.0000000 0.5000000 0.3333333
[36]        NA       Inf 1.0000000 0.5000000 0.3333333        NA       Inf
[43] 1.0000000 0.5000000 0.3333333        NA       Inf 1.0000000 0.5000000
[50] 0.3333333

> table(a)
0.333333333333333               0.5                 1               Inf
               10                10                10                10


# 2-Way Frequency Table
A <- letters[1:3]
B <- sample(a)
mytable <- table(A,B) # A will be rows, B will be columns

> mytable # print table
#   B
#A   a b c
#  a 0 1 0
#  b 1 0 0
#  c 0 0 1

margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)
prop.table(mytable) # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages

QR-decomposition and least squares

Least Squares process

Linear least squares regression is by far the most widely used modeling method. It is what most people mean when they say they have used "regression", "linear regression" or "least squares" to fit a model to their data.

Linear least squares regression also gets its name from the way the estimates of the unknown parameters are computed

f(x;\vec{\beta}) = \beta_0 + \beta_1x + \beta_{11}x^2

Least Square Problem Given an inconsistent system of equations, , we want to find a vector, , from so that the error is the smallest possible error. The vector is called the least squares solution.


see lm() and lsfit() in R for least squares fitting procedure


QR-Decomposition (see qr() in R)

There is a nice application of the QR-Decomposition to the Least Squares Process.

Theorem 3 Suppose that A has linearly independent columns. Then the normal system associated with Ax=b can be written as, Rx = t(Q)b

Theorem 1 Suppose that A is an n x m matrix with linearly independent columns then A can be factored as, A = QR

where Q is an n x m matrix with orthonormal columns and R is an invertible m x m upper triangular matrix.

Tuesday, October 11, 2011

Illumina BeadChips

mouseWG-6 v2 BeadChip®

The BeadChip Mouse Sentrix-6 V2 offers comprehensive analysis of genome-wide expression on a single array.

* probes are defined gene-specific 50mer oligonucleotides
* >700,000 oligonucleotides per bead/spot
* on average 30x redundancy for each transcript
* Provides comprehensive coverage of the transcribed mouse genome on a single array
* Analyzes the expression level of 45,281 mouse transcripts, variants, and EST clusters
* comprised of more than 1,600,000 beads per chip
* Up-to-date gene list and annotation of Mouse Sentrix-6 V2 BeadChip

For more information please download the Mouse Sentrix-6 V2 Whole Genome BeadChip datasheet and the corresponding technical bulletin from Illumina. One physical array is made up of 6 identical, but independent chips.

The Illumina BeadChip is proprietary method of performing multiplex gene expression and genotyping analysis. The essential element of BeadChip technology is the attachment of oligonucleotides to silica beads. The beads are then randomly deposited into wells on a substrate (for example, a glass slide). The resultant array is decoded to determine which oligonucleotide-bead combination is in which well. The decoded arrays may be used for a number of applications, including gene expression analysis and genotyping. Scroll to the bottom of the page for a primer on array decoding.
Expression array overview

Gene expression analysis is performed using a 79-base oligonucleotide that has two segments. The 5′ 50-base segment of the oligonucleotide is designed to hybridize to sequences available in the public data repositories. It is this segment that will bind to the labeled target derived from the poly(A) component of the total RNA. The 3′ 29-base segment of the oligonucleotide is the address. The address is a unique sequence created by Illumina specifically to allow unambiguous identification of the oligonucleotide after it has been deposited on the array.


Arrays may have as many as 44,000 unique oligonucleotides. Each oligonucleotide is synthesized in a large batch using standard technologies. The oligonucleotides are then attached to the surface of a 3-micron silica bead. Each bead has only one type of oligonucleotide attached to it, but it has hundreds of thousands of copies of this oligonucleotide.

Standard lithographic techniques are used to create a honeycomb pattern of wells on the surface of glass slides. Each well can hold one bead. The beads for a given array are mixed in equal amounts and deposited on the slide surface. The beads occupy the wells in a random distribution. Each bead is represented by, on average, about 20 instances within the array. The identity of each bead is determined by decoding using the address sequence. A unique array layout file is then associated with each array and used to decode the data during scanning of the array.

International Genetically Engineered Machine competition (iGEM)

The International Genetically Engineered Machine competition (iGEM) is the premiere undergraduate Synthetic Biology competition. Student teams are given a kit of biological parts at the beginning of the summer from the Registry of Standard Biological Parts. Working at their own schools over the summer, they use these parts and new parts of their own design to build biological systems and operate them in living cells. This project design and competition format is an exceptionally motivating and effective teaching method.

Erroneous analyses of interactions in neuroscience: a problem of significance

Whatever the reasons for the error, its ubiquity and potential effect suggest that researchers and reviewers should be more aware that the difference between significant and not significant is not itself necessarily significant.

A fictive example would be “Hippocampal firing synchrony correlated with memory performance in the placebo condition (r = 0.43, P = 0.01), but not in the drug condition (r = 0.19, P = 0.21)”. When making a comparison between two correlations, researchers should directly contrast the two correlations using an appropriate statistical method.


A friend is one of the nicest things you can have, and one of the best things you can be.
~Douglas Pagels

Trainees in bioinformatics and computational biology should seek depth of knowledge over breadth.

Virginia Gewin

“We don't know where we will be in ten years because the technologies and ideas are moving so fast,” he says. As Cleaver notes: “Perhaps the best career strategy is to stay flexible and curious.”

38 tips on writing an academic CV

38 tips on writing an academic CV
Posted by Rachel Bowden on Sep 27, 2011 Bookmark and Share

"[Academia] seems to be the only field where you can make it as long as you want it to be,"

The most important information should be on the first half of the first page, says Baker, and the very first thing should be your name, not the words 'curriculum vitae'.

Content: the basics
The three main sections that should form the bulk of your academic CV are:
* Research
* Teaching
* Administration

business and vision

You need to have a good vision to succeed in business.
--Sangdo / Merchant k-drama

Monday, October 10, 2011

Best places to work


trep·i·da·tion (trp-dshn)
1. A state of alarm or dread; apprehension. See Synonyms at fear.
2. An involuntary trembling or quivering.

trepidation [ˌtrɛpɪˈdeɪʃən]
1. a state of fear or anxiety
2. a condition of quaking or palpitation, esp one caused by anxiety

Non-coding RNAs: could they be the answer?

Brief Funct Genomics. 2010 Dec 22.
Non-coding RNAs: could they be the answer?
Costa FF.

Despite a considerable amount of effort by different groups to evaluate the genetic traits associated with complex diseases by genome-wide association studies (GWAS), just a few regions, mainly linked to protein-coding genes, were identified. Recently, studies from different groups have implicated new classes of long non-coding RNAs (ncRNAs) to important molecular mechanisms. Additionally, high-throughput transcriptome analyses of different cell types have shown that an unexpected amount of genomic DNA is transcribed. I am writing to propose that the majority of the regions that do not clearly correspond to a 'gene' controlling certain traits might be ncRNAs or other regulatory transcripts that are still unknown. These regions will need to be carefully examined in the future.

[PubMed - as supplied by publisher]

Sunday, October 9, 2011

Early worm

"I think we consider too much the good luck of the early bird and not enough the bad luck of the early worm."
--Theodore Roosevelt

Friday, October 7, 2011

Chaos and stillness

"In the midst of movement and chaos keep stillness inside of you."
--Deepak Chopra

Mapping gene IDs, microarray probeset probe IDs

You can't fool God!

"God knows what you've been doing, everything you've been doing. You may fool me, but you can't fool God!"
--Great Gatsby

Thursday, October 6, 2011

Computational and statistical approaches to analyzing variants identified by exome sequencing

Computational and statistical approaches to analyzing variants identified by exome sequencing

Nathan O Stitziel1,2†, Adam Kiezun2,3† and Shamil Sunyaev2,3*

New sequencing technology has enabled the identification of thousands of single nucleotide polymorphisms in the exome, and many computational and statistical approaches to identify disease-association signals have emerged.

Here we review the computational and statistical approaches that have emerged for managing these data in this rapidly exploding field. First, we briefly review the process for identifying variants in next-generation sequencing (NGS) studies and then discuss strategies for identifying the causal variant in Mendelian disorders among the total number of variants identified. We also discuss strategies for identifying the causal gene(s) in complex diseases among all genes in the genome, before outlining some challenges facing current exome sequencing studies.

Waltz through hippocampal neuropil

Reconstruction of a block of hippocampus from a rat approximately 5 micrometers
on a side from serial section transmission electron microscopy in the lab
of Kristen Harris at the University of Texas at Austin in collaboration with
Terry Sejnowski at the Salk Institute and Mary Kennedy at Caltech. Josef Spacek,
Daniel Keller, Varun Chaturvedi, Chandrajit Bajaj, Justin Kinney and Tom Bartol
made major contributions to the reconstruction and the video.

For more reconstructions:

Links to laboratories:
... (more info)
(less info)

rss TIOBE Programming Community Index for September 2011

Sep 2011-Position Sep 2010-Position Delta in Position Programming Language Ratings
Sep 2011 Delta-Sep 2010 Status
1 1 Java 18.761% +0.85% A
2 2 C 18.002% +0.86% A
3 3 C++ 8.849% -0.96% A
4 6 C# 6.819% +1.80% A
5 4 PHP 6.596% -1.77% A
6 8 Objective-C 6.158% +2.79% A
7 5 (Visual) Basic 4.420% -1.38% A
8 7 Python 4.000% -0.58% A
9 9 Perl 2.472% +0.03% A
10 11 JavaScript 1.469% -0.20% A
11 10 Ruby 1.434% -0.47% A

Brain disorders cost Europe 800 billion euros a year: study

The cost of brain disorders in Europe soared to 798 billion euros last year, double the figure for 2005 and equating to 1,550 euros per capita, says a new report.

The bill will continue to rise as people live longer, and this represents "the number one economic challenge for European health care now and in the future," says the study, which was commissioned by the European Brain Council (EBC).

But without urgent action, the situation can only worsen, given the continuing rise in life expectancy in Europe, the authors warn. Because of the ageing population, degenerative disorders such as dementia, Parkinson's and stroke are particularly destined to become more common, but anxiety and mood disorders are also very prevalent in older populations, they add.


"A teacher affects eternity; he can never tell where his influence stops."
--Henry B. Adams

Monday, October 3, 2011

download wikipedia

Department of Numbers

Department of Numbers
The Department of Numbers contextualizes public data so that individuals can form independent opinions on everyday social and economic matters.

Sunday, October 2, 2011

Best Korean Dramas of 2010

k3b and brasero problems, can read DVD but not CD

# install cdrtools as a replacement of cdrkit
dpkg -i ./libscg1*.deb
dpkg -i ./cdda2wav*.deb
dpkg -i ./cdrecord*.deb
dpkg -i ./mkisofs*.deb
dpkg -i ./cdrtools*.deb
$ wodim --version
Cdrecord-ProDVD-ProBD-Clone 3.01a03 (x86_64-unknown-linux-gnu) Copyright (C) 1995-2010 Joerg Schilling

Another reason why the CD/DVD drive might not read the CD properly is because it's not compatible. My drive can read Memorex type of CD-Rs (650MB) and Fujifilm (700MB), DVDs (including Maxell DVD), but not Maxell CD-Rs (700MB). Maybe it has something to do with the speed the CD was written in (16, 24, 40, 48)?

$sudo chmod +s /usr/bin/wodim
$sudo k3b

November 7th, 2010, 04:41 AM
I had this problem with k3b 1.91.0 on ubuntu 10.04.
I've had to purge and reinstall some packages:
hal libhal1 libhal-storage1 k3b wodim
(apt-get wants to remove many packages in addition to these, you have to reinstall all of them)

k3b always says cdrecord has no permission to open device and brasero knows theres a blank disc in but says it has 0 bites free

LG GSA-H55N Firmware

Saturday, October 1, 2011

Cell-Specific Mechanism-Based Gene Therapy Approach to Treat Retinitis Pigmentosa

Furthermore, they used a tissue-specific promoter to achieve cell-specific expression of the transduced genes, which is unusual for shRNA delivery.

ALLEN INSTITUTE FOR BRAIN SCIENCE 2011 Annual Symposium: Open Questions in Neuroscience

Sacha B. Nelson, Brandeis University
Defining the mammalian neurome

Nathaniel Heintz, Investigator, Howard Hughes Medical Institute
Research in Dr. Heintz’s laboratory aims to identify the genes, circuits, cells, macromolecular assemblies and individual molecules that contribute to the function and dysfunction of the mammalian brain. Dr. Heintz and his colleagues have developed a suite of novel approaches based on the manipulation of bacterial artificial chromosomes (BACs) to investigate the histological and functional complexities of the mammalian brain in vivo and to understand how these mechanisms become dysfunctional in disease.

Pamela Sklar, M.D., Ph.D.
Genomics and psychiatry