Tuesday, November 30, 2010

2010 Genome Sciences Centre Forum

miRNA / mRNA regulation

Talk by Phil Sharp at the 2010 Genome Sciences Centre Forum

- seed region (2-7nt) near the 5' end of the miRNA
- mIR-290-295, mIR-21, let-7
- fibroblast (extra-cellular matrices, ECM, connective tissue) converted to iPS (Induced pluripotent stem cell) via Oct4, Sox2, Nanog, Tck3 (http://en.wikipedia.org/wiki/Induced_pluripotent_stem_cell)
- Hanahan and Weinberg 2000, The hallmarks of cancer.
- loss in miRNA leads to increase in tumor formation
- there's a threshold when miRNA stops working ...

--------------
Talk by Angie-Brooks Wilson (G3, Genetics, Genomics, Gerentology)
- GWAS
- super seniors, healthy >85 year-olds
- ~20% genetics
- APOE4 - Alzheimer, heart disease (rs429358 SNP)
- BECN1 - lifespan in C. elegans (rs10512488 SNP)
- increase in cytokines -> increase in inflammation, tendency to age?

Monday, November 29, 2010

anisotropic - not the same direction

thus the origin of the word: "an" for not, "iso" for same, and "tropic" from tropism, relating to direction; anisotropic filtering does not filter the same in every direction

http://en.wikipedia.org/wiki/Anisotropic_filtering

LaTeX Subfigures to insert multiple figures

http://en.wikibooks.org/wiki/LaTeX/Floats,_Figures_and_Captions

Thursday, November 25, 2010

RNAi off-target effects

However, ‘off-target effects’ compromise the specificity of RNAi if sequence identity between siRNA and random mRNA transcripts causes RNAi to knockdown expression of non-targeted genes. The complete off-target effects must be investigated systematically on each gene in a genome by adjusting a group of parameters, which is too expensive to conduct experimentally and motivates a study in silico.

http://nar.oxfordjournals.org/content/33/6/1834

nepotism

Nepotism is favoritism granted to relatives or friends regardless of merit.

Wednesday, November 24, 2010

w3m - a text based Web browser and pager

w3m - a text based Web browser and pager

$ w3m http://localhost:8080


Press 'Insert' to see the menu, Enter on a hyperlink

Sunday, November 21, 2010

Linear Algebra - Eignen vector, Eigen value

These vectors are the eigenvectors of the matrix. A matrix acts on an eigenvector by multiplying its magnitude by a factor, which is positive if its direction is unchanged and negative if its direction is reversed. This factor is the eigenvalue associated with that eigenvector.

http://en.wikipedia.org/wiki/Eigenvalue,_eigenvector_and_eigenspace

http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/cmdscaledemo.html

R's or Matlab's cmdscale(D)

Saturday, November 20, 2010

binding surface, hiv-1, h1n1 influenza, text-mining


Identification of protein binding surfaces using surface triplet propensities.
http://www.ncbi.nlm.nih.gov/pubmed/20819959


Computational Models of HIV-1 Resistance to Gene Therapy Elucidate Therapy Design Principles
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000883



Low-dimensional clustering detects incipient dominant influenza strain clusters
http://peds.oxfordjournals.org/content/23/12/935.full

EnvMine: A text-mining system for the automatic extraction of contextual information
http://www.biomedcentral.com/1471-2105/11/294

Thursday, November 18, 2010

SVM, bagging, boosting, normalization

Sequential minimum optimization (SMO), a fast algorithm for
training SVM [26,27], was used to build MC-SVM kernel
function models, as implemented in WEKA.

Bagging vs. Boosting (Freund and Schapire 1996). Bagging (resampling) vs Boosting (iterative reweighting). -- these are used to eliminate bias in your samples

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation

-- And with microarrays, it seems that the results are largely dependent on the data itself, not so much on the algorithms / classifiers used (so pick and choose which ones, and you might squeeze in a little performance above state of the art).

Nat Genet. 2002 Dec;32 Suppl:496-501.
Microarray data normalization and transformation.
Quackenbush J.

http://www.nature.com/ng/journal/v32/n4s/full/ng1032.html

The goal of most microarray experiments is to survey patterns of gene expression by assaying the expression levels of thousands to tens of thousands of genes in a single assay.

The hypothesis underlying microarray analysis is that the measured intensities for each arrayed gene represent its relative expression level. Biologically relevant patterns of expression are typically identified by comparing measured expression levels between different states on a gene-by-gene basis. But before the levels can be compared appropriately, a number of transformations must be carried out on the data to eliminate questionable or low-quality measurements, to adjust the measured intensities to facilitate comparisons, and to select genes that are significantly differentially expressed between classes of samples.

Using this approach, a normalization factor is calculated by summing the measured intensities in both channels
Locally weighted linear regression (lowess)6 analysis has been proposed4, 5 as a normalization method that can remove such intensity-dependent effects in the log2(ratio) values.

Wednesday, November 17, 2010

Proteomics

Human Proteome Project (HPP)
Human Proteome Organisation (HUPO)

http://www.hupo.org/research/default.asp
http://en.wikipedia.org/wiki/Proteomics

Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models
http://bioinformatics.oxfordjournals.org/content/24/24/2894

Chris Overall

http://www.clip.ubc.ca/personnel/alumni.html

Leonard Foster
http://www.chibi.ubc.ca/faculty/foster

Tools
PeptideProphet http://peptideprophet.sourceforge.net/
ProteinProphet
Sequence Logo iceLogo
Mascot (Matrix Science)
X! Tandem http://www.thegpm.org/tandem/
MSQuant is a tool for quantitative proteomics/mass spectrometry and processes spectra and LC runs to find quantitative information about proteins and peptides.
MSQuant http://msquant.sourceforge.net/ 

Tuesday, November 16, 2010

Research quote

In research you really have to love and be committed to your work because things have more of a chance of going wrong than right.  But when things go right, there's is nothing more exciting.

- Dr. Michael Smith

Google Docs has Drawing and Forms

docs.google.com

Google Refine

http://code.google.com/p/google-refine/

Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

Friday, November 12, 2010

Grad school

Advice for Undergraduates Considering Graduate School Phil Agre

 http://polaris.gseis.ucla.edu/pagre/grad-school.html

Graduate school is training in research. It is for people who love research, scholarship, and teaching for their own sake and for the difference they can sometimes make in the world. It is not for people who simply want more undergraduate courses. It is not for people who are in a hurry to get a real job. The eventual goal of many doctoral students is to get a job as a college professor, or perhaps in industrial or government research. Some in technical subjects go on to start companies. But many just do it because they like it.

The best part of graduate school, the part that makes it worthwhile, comes toward the end, when you begin to present your research in public. Suddenly you will begin to join the community of scholars who work in your chosen area; they will take you seriously and you will begin to make numerous professional acquaintances, some of whom you will probably keep for the rest of your life. (I've written another article, similar to this one, about this process of professional networking. It's online at http://dlis.gseis.ucla.edu/pagre/network.html .)

In graduate school, though, your personal identity will almost certainly undergo great change. In particular, you will acquire a particular sort of professional identity: you will become known as the person who wrote such-and-such a paper, who did such-and-such research, who refuted such-and-such theory, or who initiated such-and-such line of inquiry. This process can be tremendously satisfying. But it's not for everyone.

"Hello. I'd like to ask your advice. I am thinking I might want to go to graduate school, but I'm still uncertain about where I would go or what exactly I would study. I do know that I'm pretty interested in such-and-such. How would I find out about graduate schools in that area?" Some common responses to this are as follows:
(1) "I don't actually know much about that area, but you should talk to so-and-so who is really the expert on that." Go talk to so-and-so.
(2) "I think you're going to have to define your interests a little better before I can help you." Ask for help in defining your interests better.
(3) The response you're looking for, namely a list of all the good graduate programs in that area, with as much detailed description of them as you can possibly digest.
What next? Well, let's back up and talk about research.

Getting good grades in your undergraduate classes is important, but it's not really the main thing. The main thing is this: if you want to go to graduate school, you should start getting involved in research as an undergraduate.

Writing a grant proposal may be the single most valuable experience of your project.


Your statement should demonstrate that you know what research is, that you have had at least one idea in your life, and that you have an interesting and tractable idea about your research for the future. The problem, of course, is that you probably have only the sketchiest idea of what your research in graduate school will be about. That doesn't matter. You are not promising to do the research you describe in your statement (although I am told that this is changing in some areas of the hard sciences); you are only spelling out a single plausible scenario, one that fairly reflects your interests. Try to be concrete, but also include a few hedges such as "perhaps" and "these possibilities include". Good writing counts. Project sobriety and maturity. Avoid frivolity, boasting, and self-deprecation. Show that you've read the research literature, but go easy on academic jargon. Minimize adverbs. Eschew the words "interesting" and "important", which say little. Many people start their statements with a paragraph or two of commonplaces; cut this material until you reach a statement that says something non-obvious about the world and your research involvements. Don't talk about your family, your feelings, or your non-professional interests. Don't say anything bad about anyone, including yourself. And make sure that you are not simply describing the year's most fashionable cliche of a research project -- ask for advice about this issue specifically. Put yourself in the shoes of the graduate admissions committee: they're looking at hundreds of applications and they're only going to take a second look at the ones that stand out. If you follow the above advice then your application will make the first cut and receive the serious consideration it deserves.


Meanwhile, apply for fellowships, that is, grants from foundations and other sources that pay your tuition and a small salary so that you can commit yourself full-time to studying. Don't wait until you're accepted somewhere to apply for outside funding! Deadlines typically fall between November and January in the United States and a few months later in many other countries. Ask someone in your department which are the major fellowships in your area and apply for them all. Also, at each university it is usually somebody's job to keep a list, maybe on the Web, of obscure graduate fellowships. It might be called the office of research development. You might also look in the acknowledgements sections of papers written by younger researchers in your field. Find such lists and write away for applications forms for all of the fellowships that seem relevant. Get advice about which ones are worth applying for. When in doubt, apply. Fellowships are good because they give you much more freedom to choose your own research topics. Without a fellowship, you will have to work for someone else as a teaching assistant or research assistant. Assistantships are often perfectly fine, but a fellowship is always better.

One issue that you should definitely be aware of is that people are going to really want to see you have a definite course of research in your statement of purpose. Unless you know what you want to do, pick two or three different topics that you're interested in and write up something short about each of them. Then let them sit for a day or two and see which one you feel best about. Definitely ask a professor to read over them for you if you have someone who would be willing to do so. If you don't feel comfortable asking a professor, ask other people to read them for you. Graduate students you know are a good choice; all of them have been through this process, and they remember how difficult it was.

It almost never hurts to have extra letters, and don't feel bad about asking people for letters; it's part of their job.

http://www.cs.ubc.ca/~rap/crossroads.html
  1. how they like the department
  2. can they live on their stipend
  3. what is the worst thing about the department
  4. how are the resources (building, computers, etc)
  5. if there is a specific professor who you'd like to work with, find some of her students and ask them how they like working with the faculty member, how many students the professor has, how much interaction they have with her, etc.
  6. how many people who enter the program finish with a Ph.D.
  7. why did the people who don't finish leave
  8. what happens if you decide to leave the program (some places are considering making you pay back all of the tuition if you leave)
  9. are they happy there
  10. how many hours a week they spend at work
  11. what the classes are like
  12. how many classes they have to take, and can you place out of them
  13. if there are no classes, what do you have to do instead
  14. what hurdles (like preliminary exams) do you have to take, and what type are they (oral, written, etc)
  15. anything else that's important to you; for example, if you are female ask the female students how they are treated as females. This is important; don't feel silly for asking. 
1. After a certain point, you cannot make a wrong decision. Chances are good that there is no one perfect place for you to go to, and any where that you go will be fine. You're just trying to optimize. This may not make you feel a whole lot better, but keep it in mind; it really is true.

2. No decision that you make will make everyone happy. Someone will think that you've made the wrong decision no matter where you decide to go. Accept that and when the first person expresses that you've made the wrong choice, try not to let it bother you. 

    protein-dna binding

    Discovering protein–DNA binding sequence patterns using association rule mining
    http://nar.oxfordjournals.org/content/early/2010/06/06/nar.gkq500.full.pdf+html

    GOing Bayesian: model-based gene set analysis of genome-scale data
    http://nar.oxfordjournals.org/content/38/11/3523.long

    Mapping the Druggable Allosteric Space of G-Protein Coupled Receptors: a Fragment-Based Molecular Dynamics Approach
    http://onlinelibrary.wiley.com/doi/10.1111/j.1747-0285.2010.01012.x/full

    Identification and Optimization of Classifier Genes from Multi-Class Earthworm Microarray Dataset
    http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0013715

    Thursday, November 11, 2010

    GWAS and health care

    http://www.bioworld.com/servlet/com.accumedia.web.Dispatcher?next=bioWorldHeadlines_article&forceid=53907

    Labs

    http://compbio.bccrc.ca/?page_id=39
    http://molonc.bccrc.ca/?page_id=217
    http://www.chibi.ubc.ca/
    http://www.chibi.ubc.ca/training

    Omics

    http://omics.org/index.php/Degradomics

    The major role of matrix metalloproteinases (MMPs) is for homeostatic regulation of the extracellular environment, not simply to degrade matrix as their name suggests.

    http://www.nature.com/nrm/journal/v3/n7/full/nrm858.html

    Degradomics — the application of genomic and proteomic approaches to identify the protease and protease-substrate repertoires, or 'degradomes', on an organism-wide scale — promises to uncover new roles for proteases in vivo. This knowledge will facilitate the identification of new pharmaceutical targets to treat disease. Here, we review emerging degradomic techniques and concepts.

    Wednesday, November 10, 2010

    GWAS

    Genome-wide association studies for complex traits: consensus, uncertainty and challenges
    http://dx.doi.org/10.1038/nrg2344

    Finding genes underlying human disease
    http://www.ncbi.nlm.nih.gov/pubmed/18783406

    Genomewide Association Studies and Assessment of the Risk of Disease
    http://www.ncbi.nlm.nih.gov/pubmed/20647212
    http://bioinformatics.oxfordjournals.org/content/26/21/2664.full?sid=1f67c073-33fd-40cc-8196-a8ea07ce3e9c

    These results show an increasing proportion of newly determined sequences falling within existing islands, which may indicate an approach to the representative map of the protein universe. If this trend continues, by approximately 2017 at least 80% of new sequences will fall within an existing island (Fig. 4) that is, have a sequence identity >50% with sequences already present in the database.

    Git vs SVN (subversion) version control system (VCS)

    Git
    - Distriubted, users have their own copy, fast - no network latency (except for push and fetch/pull) for branch switch, diff, status, commit, merge
    - Better branch handling, every working directory is a branch
    - Easily switch branches without creating a separate checkout
    - Takes up less space, only one copy is kept
    - SHA1 to identify a commit, use a tag instead

    Svn
    - more mature user interface eg. Tortoise, RapidSVN
    - single repository, know where files are stored
    - access control
    - revision numbers, easy to track



    https://git.wiki.kernel.org/index.php/GitSvnComparison

    git clone https://github.com/proj/proj my_proj
    cd my_proj
    git pull
    git add new_file
    git commit -m 'Adding new file'
    git pull
    git push
    git checkout revert_file

    Tuesday, November 9, 2010

    GWAS

    http://www.genomesunzipped.org/2010/07/how-to-read-a-genome-wide-association-study.php

    The basic GWAS approach is to look at approximately a million positions in the human genome (called ‘SNPs’) where different people carry different versions of the genetic code (so at some particular position I might have an ‘A’ and you might have a ‘C’). I’m going to focus here on the most common GWAS design, called case-control, where the goal is to compare the frequencies of these different versions between a group of healthy individuals (controls) and another group of people with a specific disease (cases). The places where the frequencies between cases and controls are significantly different are therefore associated with risk of developing the disease.

    Intergenic regions

    An Intergenic region (IGR) is a stretch of DNA sequences located between clusters of genes that contain few or no genes. Occasionally some intergenic DNA acts to control genes close by, but most of it has no currently known function. It is one of the DNA sequences collectively referred to as junk DNA, though it is only one phenomenon labeled such and in scientific studies today, the term is less used. In humans, intergenic regions comprise a large percentage of the genome.
    This could also be where noncoding RNAs are located. Though little is known about them, they are thought to have regulatory functions.
    Intergenic regions are different from intragenic regions (or introns), which are short, non-coding regions that are found within genes, especially within the genes of eukaryotic organisms.
    Scientists have now artificially synthesized proteins from intergenic regions. [1]

    http://en.wikipedia.org/wiki/Intergenic_region

    1000Genomes

    http://www.1000genomes.org/page.php?page=about

    The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person's DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.

    Online LaTeX editor

    http://www.codecogs.com/latex/eqneditor.php

    score = \sum\limits_{i=0}^{o,s,it,d}w_i \cdot \sum\limits_{j=0}^{n_i}(\frac{v_j}{w_j})

    Linkage disequilibrium (LD), linkage study

    When the transmission of genotype at locus A is DEPENDENT on the genotype at another locus B.

    bio.classes.ucsc.edu/bio107/Class%20pdfs/W05_lecture15.pdf

    Random genetic drift is a stochastic process (by definition). One aspect of genetic drift is the random nature of transmitting alleles from one generation to the next given that only a fraction of all possible zygotes become mature adults. The easiest case to visualize is the one which involves binomial sampling error. If a pair of diploid sexually reproducing parents (such as humans) have only a small number of offspring then not all of the parent's alleles will be passed on to their progeny due to chance assortment of chromosomes at meiosis.


    http://www.talkorigins.org/faqs/genetic-drift.html

    Agreement in the types of data that occur in natural pairs. For example, in a trait like schizophrenia, a pair of identical twins is concordant if both are affected or both are unaffected; it is discordant if one of them only is affected. Likewise, the pairs might be non-identical twins, or sibs, or husband and wife, etc.


    http://www.mondofacto.com/facts/dictionary?concordance


    www-gene.cimr.cam.ac.uk/clayton/courses/florence05/lectures/linkage-lecture.pdf
    “Genetic linkage analysis is a statistical method that is used to associate functionality of genes to their location on chromosomes.“ 
    http://bioinfo.cs.technion.ac.il/superlink/

    Neighboring genes on the chromosome have a tendency to stick together when passed on to offsprings.
    Therefore, if some disease is often passed to offsprings along with specific marker-genes , then it can be concluded that the gene(s) which are responsible for the disease are located close on the chromosome to these markers.

    Monday, November 8, 2010

    Feyerabend

    Feyerabend’s “there is no idea that is not capable of improving our knowledge.”

    ‘remote applicability’, “it is easy to massage a problem in biology, say, until it succumbs to the tricks of our trade -- and is of no use to biologists”. Database metatheory: asking the big queries, Christos Papadimitriou, PODS 95..

    It is darkest before the dawn
    http://www.weekdaywisdom.com/mm030705.htm

    "Those whose acquaintance with scientific research is derived chiefly from its practical results easily develop a completely false notion of the mentality of the men who, surrounded by a sceptical world, have shown the way to those like-minded with themselves, scattered through the earth and the centuries. " Religion and Science, The following excerpt was published in The World as I See It (1999). by Albert Einstein

    RIDICULED DISCOVERERS, VINDICATED MAVERICKS, revolutionary science

    http://amasci.com/weird/vindac.html

    Robert L. Folk (existence and importance of nanobacteria)

    Discovered bacteria with diameters far below 200nM widely present in mineral samples, able to both metabolize metals and to create calcium encrustations. Proposed their large role in creation of "metamorphic" rock and everyday metal corrosion. These ideas were rejected with hostility because the bacterial diameter is too small to include enough genetic material or ribosomes, and they seem immune to common sterilization techniques.

    Galileo (supported the Copernican viewpoint)

    It was not the church authorities who refused to look through his telescope. It was his fellow scientists! They thought that using a telescope was a waste of time, since even if they did see evidence for Galileo's claims, it could only be because Galileo had bewitched them.

    LaTeX Beamer

    http://www.math-linux.com/spip.php?article77

    Basic presentation with Beamer

    Expand Your Professional-Skills Training

    http://sciencecareers.sciencemag.org/career_magazine/previous_issues/articles/2010_10_01/caredit.a1000096

    Saturday, November 6, 2010

    On Negative Results

    "Negativity is to a large extent in the eye of the beholder" - Database metatheory: asking the big queries, Christos Papadimitriou, PODS 95

    Delivering effective presentations

    http://www.methink.com/3-techniques-for-delivering-engaging-presentations/

    1. 10 slides for 20 minutes with 30 size font
    2. Picture is worth a thousand words - slides are only aids
    3. Tell a story - put the problem in the context of a story that your audience can relate to

    X-Ray for Your Genes: Researcher Takes the Next Step in Personalized Medicine

    http://www.sciencedaily.com/releases/2010/10/101007111459.htm

    Friday, November 5, 2010

    ggplot2

    http://learnr.wordpress.com/2009/08/20/ggplot2-version-of-figures-in-lattice-multivariate-data-visualization-with-r-part-13-2/

    > library("flowViz")
    > data(GvHD, package = "flowCore")
    > pl <- densityplot(Visit ~ `FSC-H` | Patient, data = GvHD)
    > print(pl) 
    # Pie charts words grouped by journals
    fig3a <- ggplot(melted, aes(x = factor(1), fill=variable)) +  # x and y
                geom_bar(width=1) +                              # layers
                coord_polar(theta = "y") +                       # pie chart
                facet_wrap(~journal, nrow = 1) +                # group
                scale_fill_manual(values = myColors) +           # some matching colors
                opts(axis.text.x = theme_blank(),               # remove x-label
                    axis.title.x=theme_blank(),
                    title = 'Journal word frequencies') +    
                opts(legend.position="right")                   # legend to the bottom
    grid.arrange(fig3a, fig3b, nrow = 2, ncol = 1)
    

    Thursday, November 4, 2010

    Highlight syntax color code

    http://www.andre-simon.de/doku/highlight/en/highlight.html

    EST

    to map ESTs and variable reads (multiple fasta-format files) to an already known related prokaryotic genome

    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2789075/

    http://en.wikipedia.org/wiki/Expressed_sequence_tag

    The current commercially available high-throughput methodologies rely on primers or probes designed to detect each of the current reference miRNA sequences residing in miRBase, which acts as the central repository for known miRNAs (Griffiths-Jones 2006).

    However, probe-based methodologies are generally restricted to the detection and profiling of only the known miRNA sequences previously identified by sequencing or homology searches.

    Sequencing-based applications for identifying and profiling miRNAs have been hindered by laborious cloning techniques and the expense of capillary DNA sequencing (Pfeffer et al. 2005; Cummins et al. 2006).

    In contrast with capillary sequencing, recently available “next-generation” sequencing technologies offer inexpensive increases in throughput, thereby providing a more complete view of the miRNA transcriptome.

    Pluripotent human embryonic stem cells (hESCs) can be cultured under nonadherent conditions that induce them to differentiate into cells belonging to all three germ layers and form cell aggregates termed embryoid bodies (EBs) (Itskovitz-Eldor et al. 2000; Bhattacharya et al. 2004).

    Samples of undifferentiated hESCs and differentiated cells from EBs were chosen for miRNA profiling, first because the pluripotency of ESCs is known to require the presence of miRNAs (Bernstein et al. 2003; Song and Tuan 2006; Wang et al. 2007) and second because specific changes in miRNA expression are thought to accompany differentiation (Chen et al. 2007).

    These reads were mapped to the genome by forcing perfect alignments beginning at the first nucleotide and retaining the longest region of each read that could be aligned to the reference genome, along with all alignment positions. After mapping, a total of 766,199 (hESC) and 724,091 (EB) unique error-free trimmed small RNA sequences were represented by 4,351,479 and 3,886,865 reads.

    Sequences deriving from 334 distinct miRNA genes were identified. The miRNAs were the most abundant class of small RNAs on average, but spanned the entire range of expression, with sequence counts up to ~120,000 (Fig. 1A).

    Virtually no reads aligned to the genome after position 28, so we trimmed all reads at 30 nt to reduce the number of unique sequences.

    For every read, the longest alignment was determined, and this subsequence, as well as the positions for every alignment of this length, was stored in a database (to a maximum of 100 alignments). 

    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2279248/?tool=pubmed

    Wednesday, November 3, 2010

    Nanopore sequencing

    http://www.youtube.com/watch?v=HbjAMJehSlg&feature=related

    Perl database interface DBI

    http://www.perl.com/pub/1999/10/DBI.html
    # db settings
    my $db = "mydbname";
    my $host = "127.0.0.1";
    my $port = 3306;

    # connect
    my $dsn = "DBI:mysql:database=$db;host=$host;port=$port";
    my $fosdb = DBI->connect( $dsn, $user, $pass) or die ( "Couldn't connect to database: " . DBI->errstr . "\n";

    # Read the matching records and print them out          
              while (@data = $sth->fetchrow_array()) {
                my $firstname = $data[1];
                my $id = $data[2];
                print "\t$id: $firstname $lastname\n";
              }
    my $sth = $dbh->prepare('SELECT age FROM people WHERE id = ?')
                or die "Couldn't prepare statement: " . $dbh->errstr;
    $sth->execute($id) 
                or die "Couldn't execute statement: " . $sth->errstr;
    $dbh->disconnect;

    perlconsole - An interactive Perl console like Python's

    Installing the package libterm-readline-gnu-perl should get you readline support. 
     
    $ apt-cache show perlconsole
    ...
    Description: small program that lets you evaluate Perl code interactiv
    +ely
     Perl Console is a light program that lets you evaluate Perl code
     interactively. It uses Readline for grabing input and provides comple
    +tion
     with all the namespaces loaded during your session.
     .
     This is pretty useful for Perl developers that write modules.
     You can load a module in your session and test a function exported by
    + the
     module.
    
    
    http://www.perlmonks.org/?node_id=816352

    MiPred

    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1933124/

    In this article, in order to achieve higher performance of distinguishing the real pre-miRNAs from the pseudo ones, a hybrid feature by incorporating the local contiguous structure-sequence composition, the minimum of free energy (MFE) of the secondary structure and the P-value of randomization test was used.

    Genotype calling

    www.cbs.dtu.dk/chipcourse/Lectures/genotype_calling.pdf

    SNP call rate?  Plot of SNPs along allele A (eg. A) and allele B (eg. C) 

    You can either get AA (AA), AB (AC), or BB (CC).

    R draw.key positioning

    https://stat.ethz.ch/pipermail/r-help/2009-February/187229.html

    The simplest way to change position is to supply a simple 'vp' argument.

    xyplot(1~1,
           panel = function(...) {
               require(grid)
               panel.xyplot(...)
               draw.key(list(text=list(lab='catch'),
                             lines=list(lwd=c(2)),
                             text=list(lab='landings'),
                             rectangles=list(col=rgb(0.1, 0.1, 0, 0.1))),
                        draw = TRUE,
                        vp = viewport(x = unit(0.75, "npc"), y = unit(0.9, "npc")))
           })

    Tuesday, November 2, 2010

    MiR-107 and MiR-185 Can Induce Cell Cycle Arrest in Human Non Small Cell Lung Cancer Cell Lines

    MiR-107 and MiR-185 Can Induce Cell Cycle Arrest in Human Non Small Cell Lung Cancer Cell Lines

    http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0006677

    Algorithm text books

    • the book "Biological sequence analysis" by Durbin et al. (Cambridge University Press, ISBN-13: 978-0521629713) will serve as our main reference book (the BSA book)
    • if you do not have a strong Biology background, I suggest "Molecular Biology of the Gene" by James Watson et al. (Benjamin Cummings, 6th edition (2007), ISBN-13 978-0805395921) and, to a lesser extent, "Molecular Biology of the Cell" by Bruce Alberts which is also a fine book (Garland, 4th edition (2002), ISBN-13: 978-0815332183) as your reference books. Make sure you are dealing with the latest editions of these books.
    http://www.cs.ubc.ca/~irmtraud/cs_545/

    Analytic Bridge

    Data mining, statistics, quant, operations research, six sigma, econometrics, web analytics, text mining, business intelligence, SAS, biostatistics, machine learning, artificial intelligence, decision sciences, cloud computing, SaaS.

    http://www.analyticbridge.com/

    Monday, November 1, 2010

    Deep Sequencing - sequence coverage

    http://scienceblogs.com/mikethemadbiologist/2010/03/what_is_deep_sequencing.php

    Git basics

    Upload it again: 'git push'
    So, to have a successful work session do the following:
       Sit down at computer
       type 'git pull'
       make changes
       test your changes
       type 'git commit'
       describe your changes
       type 'git pull'
       type 'git push'
       Get up and walk away from the computer

    You can type 'git pull' every minute, if you like.
    You can type 'git commit -a' just after you've
    made a change and
    have tested it a little (Any syntax errors?
    Does it run at all?).
    You can type 'git push' after each commit; it has
    no effect until then.


    ~$ git config --global user.email my@email.com
    ~$ git config --global user.name myusrname