miRNA / mRNA regulation
Talk by Phil Sharp at the 2010 Genome Sciences Centre Forum
- seed region (2-7nt) near the 5' end of the miRNA
- mIR-290-295, mIR-21, let-7
- fibroblast (extra-cellular matrices, ECM, connective tissue) converted to iPS (Induced pluripotent stem cell) via Oct4, Sox2, Nanog, Tck3 (http://en.wikipedia.org/wiki/Induced_pluripotent_stem_cell)
- Hanahan and Weinberg 2000, The hallmarks of cancer.
- loss in miRNA leads to increase in tumor formation
- there's a threshold when miRNA stops working ...
--------------
Talk by Angie-Brooks Wilson (G3, Genetics, Genomics, Gerentology)
- GWAS
- super seniors, healthy >85 year-olds
- ~20% genetics
- APOE4 - Alzheimer, heart disease (rs429358 SNP)
- BECN1 - lifespan in C. elegans (rs10512488 SNP)
- increase in cytokines -> increase in inflammation, tendency to age?
Just a collection of some random cool stuff. PS. Almost 99% of the contents here are not mine and I don't take credit for them, I reference and copy part of the interesting sections.
Tuesday, November 30, 2010
Monday, November 29, 2010
anisotropic - not the same direction
thus the origin of the word: "an" for not, "iso" for same, and "tropic" from tropism, relating to direction; anisotropic filtering does not filter the same in every direction
http://en.wikipedia.org/wiki/Anisotropic_filtering
http://en.wikipedia.org/wiki/Anisotropic_filtering
Thursday, November 25, 2010
RNAi off-target effects
However, ‘off-target effects’ compromise the specificity of RNAi if sequence identity between siRNA and random mRNA transcripts causes RNAi to knockdown expression of non-targeted genes. The complete off-target effects must be investigated systematically on each gene in a genome by adjusting a group of parameters, which is too expensive to conduct experimentally and motivates a study in silico.
http://nar.oxfordjournals.org/content/33/6/1834
http://nar.oxfordjournals.org/content/33/6/1834
Wednesday, November 24, 2010
w3m - a text based Web browser and pager
w3m - a text based Web browser and pager
$ w3m http://localhost:8080
Press 'Insert' to see the menu, Enter on a hyperlink
$ w3m http://localhost:8080
Press 'Insert' to see the menu, Enter on a hyperlink
Sunday, November 21, 2010
Linear Algebra - Eignen vector, Eigen value
These vectors are the eigenvectors of the matrix. A matrix acts on an eigenvector by multiplying its magnitude by a factor, which is positive if its direction is unchanged and negative if its direction is reversed. This factor is the eigenvalue associated with that eigenvector.
http://en.wikipedia.org/wiki/Eigenvalue,_eigenvector_and_eigenspace
http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/cmdscaledemo.html
R's or Matlab's cmdscale(D)
http://en.wikipedia.org/wiki/Eigenvalue,_eigenvector_and_eigenspace
http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/cmdscaledemo.html
R's or Matlab's cmdscale(D)
Saturday, November 20, 2010
binding surface, hiv-1, h1n1 influenza, text-mining
Identification of protein binding surfaces using surface triplet propensities.
http://www.ncbi.nlm.nih.gov/pubmed/20819959Computational Models of HIV-1 Resistance to Gene Therapy Elucidate Therapy Design Principles
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000883
Low-dimensional clustering detects incipient dominant influenza strain clusters
http://peds.oxfordjournals.org/content/23/12/935.full
EnvMine: A text-mining system for the automatic extraction of contextual information
http://www.biomedcentral.com/1471-2105/11/294
Thursday, November 18, 2010
SVM, bagging, boosting, normalization
Sequential minimum optimization (SMO), a fast algorithm for
training SVM [26,27], was used to build MC-SVM kernel
function models, as implemented in WEKA.
Bagging vs. Boosting (Freund and Schapire 1996). Bagging (resampling) vs Boosting (iterative reweighting). -- these are used to eliminate bias in your samples
The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation
-- And with microarrays, it seems that the results are largely dependent on the data itself, not so much on the algorithms / classifiers used (so pick and choose which ones, and you might squeeze in a little performance above state of the art).
Nat Genet. 2002 Dec;32 Suppl:496-501.
Microarray data normalization and transformation.
Quackenbush J.
http://www.nature.com/ng/journal/v32/n4s/full/ng1032.html
The goal of most microarray experiments is to survey patterns of gene expression by assaying the expression levels of thousands to tens of thousands of genes in a single assay.
The hypothesis underlying microarray analysis is that the measured intensities for each arrayed gene represent its relative expression level. Biologically relevant patterns of expression are typically identified by comparing measured expression levels between different states on a gene-by-gene basis. But before the levels can be compared appropriately, a number of transformations must be carried out on the data to eliminate questionable or low-quality measurements, to adjust the measured intensities to facilitate comparisons, and to select genes that are significantly differentially expressed between classes of samples.
Using this approach, a normalization factor is calculated by summing the measured intensities in both channels
Locally weighted linear regression (lowess)6 analysis has been proposed4, 5 as a normalization method that can remove such intensity-dependent effects in the log2(ratio) values.
training SVM [26,27], was used to build MC-SVM kernel
function models, as implemented in WEKA.
Bagging vs. Boosting (Freund and Schapire 1996). Bagging (resampling) vs Boosting (iterative reweighting). -- these are used to eliminate bias in your samples
The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation
-- And with microarrays, it seems that the results are largely dependent on the data itself, not so much on the algorithms / classifiers used (so pick and choose which ones, and you might squeeze in a little performance above state of the art).
Nat Genet. 2002 Dec;32 Suppl:496-501.
Microarray data normalization and transformation.
Quackenbush J.
http://www.nature.com/ng/journal/v32/n4s/full/ng1032.html
The goal of most microarray experiments is to survey patterns of gene expression by assaying the expression levels of thousands to tens of thousands of genes in a single assay.
The hypothesis underlying microarray analysis is that the measured intensities for each arrayed gene represent its relative expression level. Biologically relevant patterns of expression are typically identified by comparing measured expression levels between different states on a gene-by-gene basis. But before the levels can be compared appropriately, a number of transformations must be carried out on the data to eliminate questionable or low-quality measurements, to adjust the measured intensities to facilitate comparisons, and to select genes that are significantly differentially expressed between classes of samples.
Using this approach, a normalization factor is calculated by summing the measured intensities in both channels
Locally weighted linear regression (lowess)6 analysis has been proposed4, 5 as a normalization method that can remove such intensity-dependent effects in the log2(ratio) values.
Wednesday, November 17, 2010
Proteomics
Human Proteome Project (HPP)
Human Proteome Organisation (HUPO)
http://www.hupo.org/research/default.asp
http://en.wikipedia.org/wiki/Proteomics
Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models
http://bioinformatics.oxfordjournals.org/content/24/24/2894
Chris Overall
http://www.clip.ubc.ca/personnel/alumni.html
Leonard Foster
http://www.chibi.ubc.ca/faculty/foster
Tools
PeptideProphet http://peptideprophet.sourceforge.net/
ProteinProphet
Sequence Logo iceLogo
Mascot (Matrix Science)
X! Tandem http://www.thegpm.org/tandem/
MSQuant is a tool for quantitative proteomics/mass spectrometry and processes spectra and LC runs to find quantitative information about proteins and peptides.
MSQuant http://msquant.sourceforge.net/
Human Proteome Organisation (HUPO)
http://www.hupo.org/research/default.asp
http://en.wikipedia.org/wiki/Proteomics
Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models
http://bioinformatics.oxfordjournals.org/content/24/24/2894
Chris Overall
http://www.clip.ubc.ca/personnel/alumni.html
Leonard Foster
http://www.chibi.ubc.ca/faculty/foster
Tools
PeptideProphet http://peptideprophet.sourceforge.net/
ProteinProphet
Sequence Logo iceLogo
Mascot (Matrix Science)
X! Tandem http://www.thegpm.org/tandem/
MSQuant is a tool for quantitative proteomics/mass spectrometry and processes spectra and LC runs to find quantitative information about proteins and peptides.
MSQuant http://msquant.sourceforge.net/
Tuesday, November 16, 2010
Research quote
In research you really have to love and be committed to your work because things have more of a chance of going wrong than right. But when things go right, there's is nothing more exciting.
- Dr. Michael Smith
- Dr. Michael Smith
Google Refine
http://code.google.com/p/google-refine/
Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.
Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.
Friday, November 12, 2010
Grad school
Advice for Undergraduates Considering Graduate School Phil Agre
http://polaris.gseis.ucla.edu/pagre/grad-school.htmlGraduate school is training in research. It is for people who love research, scholarship, and teaching for their own sake and for the difference they can sometimes make in the world. It is not for people who simply want more undergraduate courses. It is not for people who are in a hurry to get a real job. The eventual goal of many doctoral students is to get a job as a college professor, or perhaps in industrial or government research. Some in technical subjects go on to start companies. But many just do it because they like it.
The best part of graduate school, the part that makes it worthwhile, comes toward the end, when you begin to present your research in public. Suddenly you will begin to join the community of scholars who work in your chosen area; they will take you seriously and you will begin to make numerous professional acquaintances, some of whom you will probably keep for the rest of your life. (I've written another article, similar to this one, about this process of professional networking. It's online at http://dlis.gseis.ucla.edu/pagre/network.html .)
In graduate school, though, your personal identity will almost certainly undergo great change. In particular, you will acquire a particular sort of professional identity: you will become known as the person who wrote such-and-such a paper, who did such-and-such research, who refuted such-and-such theory, or who initiated such-and-such line of inquiry. This process can be tremendously satisfying. But it's not for everyone.
"Hello. I'd like to ask your advice. I am thinking I might want to go to graduate school, but I'm still uncertain about where I would go or what exactly I would study. I do know that I'm pretty interested in such-and-such. How would I find out about graduate schools in that area?" Some common responses to this are as follows:
(1) "I don't actually know much about that area, but you should talk to so-and-so who is really the expert on that." Go talk to so-and-so.
(2) "I think you're going to have to define your interests a little better before I can help you." Ask for help in defining your interests better.
(3) The response you're looking for, namely a list of all the good graduate programs in that area, with as much detailed description of them as you can possibly digest.
What next? Well, let's back up and talk about research.
Getting good grades in your undergraduate classes is important, but it's not really the main thing. The main thing is this: if you want to go to graduate school, you should start getting involved in research as an undergraduate.
Writing a grant proposal may be the single most valuable experience of your project.
Your statement should demonstrate that you know what research is, that you have had at least one idea in your life, and that you have an interesting and tractable idea about your research for the future. The problem, of course, is that you probably have only the sketchiest idea of what your research in graduate school will be about. That doesn't matter. You are not promising to do the research you describe in your statement (although I am told that this is changing in some areas of the hard sciences); you are only spelling out a single plausible scenario, one that fairly reflects your interests. Try to be concrete, but also include a few hedges such as "perhaps" and "these possibilities include". Good writing counts. Project sobriety and maturity. Avoid frivolity, boasting, and self-deprecation. Show that you've read the research literature, but go easy on academic jargon. Minimize adverbs. Eschew the words "interesting" and "important", which say little. Many people start their statements with a paragraph or two of commonplaces; cut this material until you reach a statement that says something non-obvious about the world and your research involvements. Don't talk about your family, your feelings, or your non-professional interests. Don't say anything bad about anyone, including yourself. And make sure that you are not simply describing the year's most fashionable cliche of a research project -- ask for advice about this issue specifically. Put yourself in the shoes of the graduate admissions committee: they're looking at hundreds of applications and they're only going to take a second look at the ones that stand out. If you follow the above advice then your application will make the first cut and receive the serious consideration it deserves.
Meanwhile, apply for fellowships, that is, grants from foundations and other sources that pay your tuition and a small salary so that you can commit yourself full-time to studying. Don't wait until you're accepted somewhere to apply for outside funding! Deadlines typically fall between November and January in the United States and a few months later in many other countries. Ask someone in your department which are the major fellowships in your area and apply for them all. Also, at each university it is usually somebody's job to keep a list, maybe on the Web, of obscure graduate fellowships. It might be called the office of research development. You might also look in the acknowledgements sections of papers written by younger researchers in your field. Find such lists and write away for applications forms for all of the fellowships that seem relevant. Get advice about which ones are worth applying for. When in doubt, apply. Fellowships are good because they give you much more freedom to choose your own research topics. Without a fellowship, you will have to work for someone else as a teaching assistant or research assistant. Assistantships are often perfectly fine, but a fellowship is always better.
One issue that you should definitely be aware of is that people are going to really want to see you have a definite course of research in your statement of purpose. Unless you know what you want to do, pick two or three different topics that you're interested in and write up something short about each of them. Then let them sit for a day or two and see which one you feel best about. Definitely ask a professor to read over them for you if you have someone who would be willing to do so. If you don't feel comfortable asking a professor, ask other people to read them for you. Graduate students you know are a good choice; all of them have been through this process, and they remember how difficult it was.
It almost never hurts to have extra letters, and don't feel bad about asking people for letters; it's part of their job.
http://www.cs.ubc.ca/~rap/crossroads.html
- how they like the department
- can they live on their stipend
- what is the worst thing about the department
- how are the resources (building, computers, etc)
- if there is a specific professor who you'd like to work with, find some of her students and ask them how they like working with the faculty member, how many students the professor has, how much interaction they have with her, etc.
- how many people who enter the program finish with a Ph.D.
- why did the people who don't finish leave
- what happens if you decide to leave the program (some places are considering making you pay back all of the tuition if you leave)
- are they happy there
- how many hours a week they spend at work
- what the classes are like
- how many classes they have to take, and can you place out of them
- if there are no classes, what do you have to do instead
- what hurdles (like preliminary exams) do you have to take, and what type are they (oral, written, etc)
- anything else that's important to you; for example, if you are female ask the female students how they are treated as females. This is important; don't feel silly for asking.
2. No decision that you make will make everyone happy. Someone will think that you've made the wrong decision no matter where you decide to go. Accept that and when the first person expresses that you've made the wrong choice, try not to let it bother you.
protein-dna binding
Discovering protein–DNA binding sequence patterns using association rule mining
http://nar.oxfordjournals.org/content/early/2010/06/06/nar.gkq500.full.pdf+html
GOing Bayesian: model-based gene set analysis of genome-scale data
http://nar.oxfordjournals.org/content/38/11/3523.long
Mapping the Druggable Allosteric Space of G-Protein Coupled Receptors: a Fragment-Based Molecular Dynamics Approach
http://onlinelibrary.wiley.com/doi/10.1111/j.1747-0285.2010.01012.x/full
Identification and Optimization of Classifier Genes from Multi-Class Earthworm Microarray Dataset
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0013715
http://nar.oxfordjournals.org/content/early/2010/06/06/nar.gkq500.full.pdf+html
GOing Bayesian: model-based gene set analysis of genome-scale data
http://nar.oxfordjournals.org/content/38/11/3523.long
Mapping the Druggable Allosteric Space of G-Protein Coupled Receptors: a Fragment-Based Molecular Dynamics Approach
http://onlinelibrary.wiley.com/doi/10.1111/j.1747-0285.2010.01012.x/full
Identification and Optimization of Classifier Genes from Multi-Class Earthworm Microarray Dataset
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0013715
Thursday, November 11, 2010
Labs
http://compbio.bccrc.ca/?page_id=39
http://molonc.bccrc.ca/?page_id=217
http://www.chibi.ubc.ca/
http://www.chibi.ubc.ca/training
http://molonc.bccrc.ca/?page_id=217
http://www.chibi.ubc.ca/
http://www.chibi.ubc.ca/training
Omics
http://omics.org/index.php/Degradomics
The major role of matrix metalloproteinases (MMPs) is for homeostatic regulation of the extracellular environment, not simply to degrade matrix as their name suggests.
http://www.nature.com/nrm/journal/v3/n7/full/nrm858.html
Degradomics — the application of genomic and proteomic approaches to identify the protease and protease-substrate repertoires, or 'degradomes', on an organism-wide scale — promises to uncover new roles for proteases in vivo. This knowledge will facilitate the identification of new pharmaceutical targets to treat disease. Here, we review emerging degradomic techniques and concepts.
The major role of matrix metalloproteinases (MMPs) is for homeostatic regulation of the extracellular environment, not simply to degrade matrix as their name suggests.
http://www.nature.com/nrm/journal/v3/n7/full/nrm858.html
Degradomics — the application of genomic and proteomic approaches to identify the protease and protease-substrate repertoires, or 'degradomes', on an organism-wide scale — promises to uncover new roles for proteases in vivo. This knowledge will facilitate the identification of new pharmaceutical targets to treat disease. Here, we review emerging degradomic techniques and concepts.
Wednesday, November 10, 2010
GWAS
Genome-wide association studies for complex traits: consensus, uncertainty and challenges
http://dx.doi.org/10.1038/nrg2344
Finding genes underlying human disease
http://www.ncbi.nlm.nih.gov/pubmed/18783406
Genomewide Association Studies and Assessment of the Risk of Disease
http://www.ncbi.nlm.nih.gov/pubmed/20647212
http://dx.doi.org/10.1038/nrg2344
Finding genes underlying human disease
http://www.ncbi.nlm.nih.gov/pubmed/18783406
Genomewide Association Studies and Assessment of the Risk of Disease
http://www.ncbi.nlm.nih.gov/pubmed/20647212
http://bioinformatics.oxfordjournals.org/content/26/21/2664.full?sid=1f67c073-33fd-40cc-8196-a8ea07ce3e9c
These results show an increasing proportion of newly determined sequences falling within existing islands, which may indicate an approach to the representative map of the protein universe. If this trend continues, by approximately 2017 at least 80% of new sequences will fall within an existing island (Fig. 4) that is, have a sequence identity >50% with sequences already present in the database.
These results show an increasing proportion of newly determined sequences falling within existing islands, which may indicate an approach to the representative map of the protein universe. If this trend continues, by approximately 2017 at least 80% of new sequences will fall within an existing island (Fig. 4) that is, have a sequence identity >50% with sequences already present in the database.
Git vs SVN (subversion) version control system (VCS)
Git
- Distriubted, users have their own copy, fast - no network latency (except for push and fetch/pull) for branch switch, diff, status, commit, merge
- Better branch handling, every working directory is a branch
- Easily switch branches without creating a separate checkout
- Takes up less space, only one copy is kept
- SHA1 to identify a commit, use a tag instead
Svn
- more mature user interface eg. Tortoise, RapidSVN
- single repository, know where files are stored
- access control
- revision numbers, easy to track
https://git.wiki.kernel.org/index.php/GitSvnComparison
git clone https://github.com/proj/proj my_proj
cd my_proj
git pull
git add new_file
git commit -m 'Adding new file'
git pull
git push
git checkout revert_file
- Distriubted, users have their own copy, fast - no network latency (except for push and fetch/pull) for branch switch, diff, status, commit, merge
- Better branch handling, every working directory is a branch
- Easily switch branches without creating a separate checkout
- Takes up less space, only one copy is kept
- SHA1 to identify a commit, use a tag instead
Svn
- more mature user interface eg. Tortoise, RapidSVN
- single repository, know where files are stored
- access control
- revision numbers, easy to track
https://git.wiki.kernel.org/index.php/GitSvnComparison
git clone https://github.com/proj/proj my_proj
cd my_proj
git pull
git add new_file
git commit -m 'Adding new file'
git pull
git push
git checkout revert_file
Tuesday, November 9, 2010
GWAS
http://www.genomesunzipped.org/2010/07/how-to-read-a-genome-wide-association-study.php
The basic GWAS approach is to look at approximately a million positions in the human genome (called ‘SNPs’) where different people carry different versions of the genetic code (so at some particular position I might have an ‘A’ and you might have a ‘C’). I’m going to focus here on the most common GWAS design, called case-control, where the goal is to compare the frequencies of these different versions between a group of healthy individuals (controls) and another group of people with a specific disease (cases). The places where the frequencies between cases and controls are significantly different are therefore associated with risk of developing the disease.
The basic GWAS approach is to look at approximately a million positions in the human genome (called ‘SNPs’) where different people carry different versions of the genetic code (so at some particular position I might have an ‘A’ and you might have a ‘C’). I’m going to focus here on the most common GWAS design, called case-control, where the goal is to compare the frequencies of these different versions between a group of healthy individuals (controls) and another group of people with a specific disease (cases). The places where the frequencies between cases and controls are significantly different are therefore associated with risk of developing the disease.
Intergenic regions
An Intergenic region (IGR) is a stretch of DNA sequences located between clusters of genes that contain few or no genes. Occasionally some intergenic DNA acts to control genes close by, but most of it has no currently known function. It is one of the DNA sequences collectively referred to as junk DNA, though it is only one phenomenon labeled such and in scientific studies today, the term is less used. In humans, intergenic regions comprise a large percentage of the genome.
This could also be where noncoding RNAs are located. Though little is known about them, they are thought to have regulatory functions.
Intergenic regions are different from intragenic regions (or introns), which are short, non-coding regions that are found within genes, especially within the genes of eukaryotic organisms.
Scientists have now artificially synthesized proteins from intergenic regions. [1]
http://en.wikipedia.org/wiki/Intergenic_region
This could also be where noncoding RNAs are located. Though little is known about them, they are thought to have regulatory functions.
Intergenic regions are different from intragenic regions (or introns), which are short, non-coding regions that are found within genes, especially within the genes of eukaryotic organisms.
Scientists have now artificially synthesized proteins from intergenic regions. [1]
http://en.wikipedia.org/wiki/Intergenic_region
1000Genomes
http://www.1000genomes.org/page.php?page=about
The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person's DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.
The goal of the 1000 Genomes Project is to find most genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome. The pieces are then aligned to the reference sequence and joined together. To find the complete genomic sequence of one person with current sequencing platforms requires sequencing that person's DNA the equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for detecting structural variants, and allows sequencing errors to be corrected.
Online LaTeX editor
http://www.codecogs.com/latex/eqneditor.php
score = \sum\limits_{i=0}^{o,s,it,d}w_i \cdot \sum\limits_{j=0}^{n_i}(\frac{v_j}{w_j})
score = \sum\limits_{i=0}^{o,s,it,d}w_i \cdot \sum\limits_{j=0}^{n_i}(\frac{v_j}{w_j})
Linkage disequilibrium (LD), linkage study
When the transmission of genotype at locus A is DEPENDENT on the genotype at another locus B.
bio.classes.ucsc.edu/bio107/Class%20pdfs/W05_lecture15.pdf
Random genetic drift is a stochastic process (by definition). One aspect of genetic drift is the random nature of transmitting alleles from one generation to the next given that only a fraction of all possible zygotes become mature adults. The easiest case to visualize is the one which involves binomial sampling error. If a pair of diploid sexually reproducing parents (such as humans) have only a small number of offspring then not all of the parent's alleles will be passed on to their progeny due to chance assortment of chromosomes at meiosis.
http://www.talkorigins.org/faqs/genetic-drift.html
Agreement in the types of data that occur in natural pairs. For example, in a trait like schizophrenia, a pair of identical twins is concordant if both are affected or both are unaffected; it is discordant if one of them only is affected. Likewise, the pairs might be non-identical twins, or sibs, or husband and wife, etc.
http://www.mondofacto.com/facts/dictionary?concordance
www-gene.cimr.cam.ac.uk/clayton/courses/florence05/lectures/linkage-lecture.pdf
“Genetic linkage analysis is a statistical method that is used to associate functionality of genes to their location on chromosomes.“
http://bioinfo.cs.technion.ac.il/superlink/
Neighboring genes on the chromosome have a tendency to stick together when passed on to offsprings.
Therefore, if some disease is often passed to offsprings along with specific marker-genes , then it can be concluded that the gene(s) which are responsible for the disease are located close on the chromosome to these markers.
bio.classes.ucsc.edu/bio107/Class%20pdfs/W05_lecture15.pdf
Random genetic drift is a stochastic process (by definition). One aspect of genetic drift is the random nature of transmitting alleles from one generation to the next given that only a fraction of all possible zygotes become mature adults. The easiest case to visualize is the one which involves binomial sampling error. If a pair of diploid sexually reproducing parents (such as humans) have only a small number of offspring then not all of the parent's alleles will be passed on to their progeny due to chance assortment of chromosomes at meiosis.
http://www.talkorigins.org/faqs/genetic-drift.html
Agreement in the types of data that occur in natural pairs. For example, in a trait like schizophrenia, a pair of identical twins is concordant if both are affected or both are unaffected; it is discordant if one of them only is affected. Likewise, the pairs might be non-identical twins, or sibs, or husband and wife, etc.
http://www.mondofacto.com/facts/dictionary?concordance
www-gene.cimr.cam.ac.uk/clayton/courses/florence05/lectures/linkage-lecture.pdf
“Genetic linkage analysis is a statistical method that is used to associate functionality of genes to their location on chromosomes.“
http://bioinfo.cs.technion.ac.il/superlink/
Neighboring genes on the chromosome have a tendency to stick together when passed on to offsprings.
Therefore, if some disease is often passed to offsprings along with specific marker-genes , then it can be concluded that the gene(s) which are responsible for the disease are located close on the chromosome to these markers.
Monday, November 8, 2010
Feyerabend
Feyerabend’s “there is no idea that is not capable of improving our knowledge.”
‘remote applicability’, “it is easy to massage a problem in biology, say, until it succumbs to the tricks of our trade -- and is of no use to biologists”. Database metatheory: asking the big queries, Christos Papadimitriou, PODS 95..
It is darkest before the dawn
http://www.weekdaywisdom.com/mm030705.htm
"Those whose acquaintance with scientific research is derived chiefly from its practical results easily develop a completely false notion of the mentality of the men who, surrounded by a sceptical world, have shown the way to those like-minded with themselves, scattered through the earth and the centuries. " Religion and Science, The following excerpt was published in The World as I See It (1999). by Albert Einstein
‘remote applicability’, “it is easy to massage a problem in biology, say, until it succumbs to the tricks of our trade -- and is of no use to biologists”. Database metatheory: asking the big queries, Christos Papadimitriou, PODS 95..
It is darkest before the dawn
http://www.weekdaywisdom.com/mm030705.htm
"Those whose acquaintance with scientific research is derived chiefly from its practical results easily develop a completely false notion of the mentality of the men who, surrounded by a sceptical world, have shown the way to those like-minded with themselves, scattered through the earth and the centuries. " Religion and Science, The following excerpt was published in The World as I See It (1999). by Albert Einstein
RIDICULED DISCOVERERS, VINDICATED MAVERICKS, revolutionary science
http://amasci.com/weird/vindac.html
Robert L. Folk (existence and importance of nanobacteria)
Discovered bacteria with diameters far below 200nM widely present in mineral samples, able to both metabolize metals and to create calcium encrustations. Proposed their large role in creation of "metamorphic" rock and everyday metal corrosion. These ideas were rejected with hostility because the bacterial diameter is too small to include enough genetic material or ribosomes, and they seem immune to common sterilization techniques.
Galileo (supported the Copernican viewpoint)
It was not the church authorities who refused to look through his telescope. It was his fellow scientists! They thought that using a telescope was a waste of time, since even if they did see evidence for Galileo's claims, it could only be because Galileo had bewitched them.
Robert L. Folk (existence and importance of nanobacteria)
Discovered bacteria with diameters far below 200nM widely present in mineral samples, able to both metabolize metals and to create calcium encrustations. Proposed their large role in creation of "metamorphic" rock and everyday metal corrosion. These ideas were rejected with hostility because the bacterial diameter is too small to include enough genetic material or ribosomes, and they seem immune to common sterilization techniques.
Galileo (supported the Copernican viewpoint)
It was not the church authorities who refused to look through his telescope. It was his fellow scientists! They thought that using a telescope was a waste of time, since even if they did see evidence for Galileo's claims, it could only be because Galileo had bewitched them.
Saturday, November 6, 2010
On Negative Results
"Negativity is to a large extent in the eye of the beholder" - Database metatheory: asking the big queries, Christos Papadimitriou, PODS 95
Delivering effective presentations
http://www.methink.com/3-techniques-for-delivering-engaging-presentations/
1. 10 slides for 20 minutes with 30 size font
2. Picture is worth a thousand words - slides are only aids
3. Tell a story - put the problem in the context of a story that your audience can relate to
1. 10 slides for 20 minutes with 30 size font
2. Picture is worth a thousand words - slides are only aids
3. Tell a story - put the problem in the context of a story that your audience can relate to
Friday, November 5, 2010
ggplot2
http://learnr.wordpress.com/2009/08/20/ggplot2-version-of-figures-in-lattice-multivariate-data-visualization-with-r-part-13-2/
> library("flowViz") > data(GvHD, package = "flowCore")
> pl <- densityplot(Visit ~ `FSC-H` | Patient, data = GvHD) > print(pl)
# Pie charts words grouped by journals fig3a <- ggplot(melted, aes(x = factor(1), fill=variable)) + # x and y geom_bar(width=1) + # layers coord_polar(theta = "y") + # pie chart facet_wrap(~journal, nrow = 1) + # group scale_fill_manual(values = myColors) + # some matching colors opts(axis.text.x = theme_blank(), # remove x-label axis.title.x=theme_blank(), title = 'Journal word frequencies') + opts(legend.position="right") # legend to the bottom
grid.arrange(fig3a, fig3b, nrow = 2, ncol = 1)
Thursday, November 4, 2010
EST
to map ESTs and variable reads (multiple fasta-format files) to an already known related prokaryotic genome
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2789075/
http://en.wikipedia.org/wiki/Expressed_sequence_tag
The current commercially available high-throughput methodologies rely on primers or probes designed to detect each of the current reference miRNA sequences residing in miRBase, which acts as the central repository for known miRNAs (Griffiths-Jones 2006).
However, probe-based methodologies are generally restricted to the detection and profiling of only the known miRNA sequences previously identified by sequencing or homology searches.
Sequencing-based applications for identifying and profiling miRNAs have been hindered by laborious cloning techniques and the expense of capillary DNA sequencing (Pfeffer et al. 2005; Cummins et al. 2006).
In contrast with capillary sequencing, recently available “next-generation” sequencing technologies offer inexpensive increases in throughput, thereby providing a more complete view of the miRNA transcriptome.
Pluripotent human embryonic stem cells (hESCs) can be cultured under nonadherent conditions that induce them to differentiate into cells belonging to all three germ layers and form cell aggregates termed embryoid bodies (EBs) (Itskovitz-Eldor et al. 2000; Bhattacharya et al. 2004).
Samples of undifferentiated hESCs and differentiated cells from EBs were chosen for miRNA profiling, first because the pluripotency of ESCs is known to require the presence of miRNAs (Bernstein et al. 2003; Song and Tuan 2006; Wang et al. 2007) and second because specific changes in miRNA expression are thought to accompany differentiation (Chen et al. 2007).
These reads were mapped to the genome by forcing perfect alignments beginning at the first nucleotide and retaining the longest region of each read that could be aligned to the reference genome, along with all alignment positions. After mapping, a total of 766,199 (hESC) and 724,091 (EB) unique error-free trimmed small RNA sequences were represented by 4,351,479 and 3,886,865 reads.
Sequences deriving from 334 distinct miRNA genes were identified. The miRNAs were the most abundant class of small RNAs on average, but spanned the entire range of expression, with sequence counts up to ~120,000 (Fig. 1A).
Virtually no reads aligned to the genome after position 28, so we trimmed all reads at 30 nt to reduce the number of unique sequences.
For every read, the longest alignment was determined, and this subsequence, as well as the positions for every alignment of this length, was stored in a database (to a maximum of 100 alignments).
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2279248/?tool=pubmed
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2789075/
http://en.wikipedia.org/wiki/Expressed_sequence_tag
The current commercially available high-throughput methodologies rely on primers or probes designed to detect each of the current reference miRNA sequences residing in miRBase, which acts as the central repository for known miRNAs (Griffiths-Jones 2006).
However, probe-based methodologies are generally restricted to the detection and profiling of only the known miRNA sequences previously identified by sequencing or homology searches.
Sequencing-based applications for identifying and profiling miRNAs have been hindered by laborious cloning techniques and the expense of capillary DNA sequencing (Pfeffer et al. 2005; Cummins et al. 2006).
In contrast with capillary sequencing, recently available “next-generation” sequencing technologies offer inexpensive increases in throughput, thereby providing a more complete view of the miRNA transcriptome.
Pluripotent human embryonic stem cells (hESCs) can be cultured under nonadherent conditions that induce them to differentiate into cells belonging to all three germ layers and form cell aggregates termed embryoid bodies (EBs) (Itskovitz-Eldor et al. 2000; Bhattacharya et al. 2004).
Samples of undifferentiated hESCs and differentiated cells from EBs were chosen for miRNA profiling, first because the pluripotency of ESCs is known to require the presence of miRNAs (Bernstein et al. 2003; Song and Tuan 2006; Wang et al. 2007) and second because specific changes in miRNA expression are thought to accompany differentiation (Chen et al. 2007).
These reads were mapped to the genome by forcing perfect alignments beginning at the first nucleotide and retaining the longest region of each read that could be aligned to the reference genome, along with all alignment positions. After mapping, a total of 766,199 (hESC) and 724,091 (EB) unique error-free trimmed small RNA sequences were represented by 4,351,479 and 3,886,865 reads.
Sequences deriving from 334 distinct miRNA genes were identified. The miRNAs were the most abundant class of small RNAs on average, but spanned the entire range of expression, with sequence counts up to ~120,000 (Fig. 1A).
Virtually no reads aligned to the genome after position 28, so we trimmed all reads at 30 nt to reduce the number of unique sequences.
For every read, the longest alignment was determined, and this subsequence, as well as the positions for every alignment of this length, was stored in a database (to a maximum of 100 alignments).
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2279248/?tool=pubmed
Wednesday, November 3, 2010
Perl database interface DBI
http://www.perl.com/pub/1999/10/DBI.html
# db settings
my $db = "mydbname";
my $host = "127.0.0.1";
my $port = 3306;
# connect
my $dsn = "DBI:mysql:database=$db;host=$host;port=$port";
my $fosdb = DBI->connect( $dsn, $user, $pass) or die ( "Couldn't connect to database: " . DBI->errstr . "\n";
# db settings
my $db = "mydbname";
my $host = "127.0.0.1";
my $port = 3306;
# connect
my $dsn = "DBI:mysql:database=$db;host=$host;port=$port";
my $fosdb = DBI->connect( $dsn, $user, $pass) or die ( "Couldn't connect to database: " . DBI->errstr . "\n";
# Read the matching records and print them out while (@data = $sth->fetchrow_array()) { my $firstname = $data[1]; my $id = $data[2]; print "\t$id: $firstname $lastname\n"; }
my $sth = $dbh->prepare('SELECT age FROM people WHERE id = ?') or die "Couldn't prepare statement: " . $dbh->errstr;
$sth->execute($id) or die "Couldn't execute statement: " . $sth->errstr;
$dbh->disconnect;
perlconsole - An interactive Perl console like Python's
Installing the package libterm-readline-gnu-perl should get you readline support.
$ apt-cache show perlconsole
...
Description: small program that lets you evaluate Perl code interactiv
+ely
Perl Console is a light program that lets you evaluate Perl code
interactively. It uses Readline for grabing input and provides comple
+tion
with all the namespaces loaded during your session.
.
This is pretty useful for Perl developers that write modules.
You can load a module in your session and test a function exported by
+ the
module.
http://www.perlmonks.org/?node_id=816352
MiPred
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1933124/
In this article, in order to achieve higher performance of distinguishing the real pre-miRNAs from the pseudo ones, a hybrid feature by incorporating the local contiguous structure-sequence composition, the minimum of free energy (MFE) of the secondary structure and the P-value of randomization test was used.
In this article, in order to achieve higher performance of distinguishing the real pre-miRNAs from the pseudo ones, a hybrid feature by incorporating the local contiguous structure-sequence composition, the minimum of free energy (MFE) of the secondary structure and the P-value of randomization test was used.
Genotype calling
www.cbs.dtu.dk/chipcourse/Lectures/genotype_calling.pdf
SNP call rate? Plot of SNPs along allele A (eg. A) and allele B (eg. C)
You can either get AA (AA), AB (AC), or BB (CC).
SNP call rate? Plot of SNPs along allele A (eg. A) and allele B (eg. C)
You can either get AA (AA), AB (AC), or BB (CC).
R draw.key positioning
https://stat.ethz.ch/pipermail/r-help/2009-February/187229.html
The simplest way to change position is to supply a simple 'vp' argument.
xyplot(1~1, panel = function(...) { require(grid) panel.xyplot(...) draw.key(list(text=list(lab='catch'), lines=list(lwd=c(2)), text=list(lab='landings'), rectangles=list(col=rgb(0.1, 0.1, 0, 0.1))), draw = TRUE, vp = viewport(x = unit(0.75, "npc"), y = unit(0.9, "npc"))) })
Tuesday, November 2, 2010
MiR-107 and MiR-185 Can Induce Cell Cycle Arrest in Human Non Small Cell Lung Cancer Cell Lines
MiR-107 and MiR-185 Can Induce Cell Cycle Arrest in Human Non Small Cell Lung Cancer Cell Lines
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0006677
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0006677
Algorithm text books
- the book "Biological sequence analysis" by Durbin et al. (Cambridge University Press, ISBN-13: 978-0521629713) will serve as our main reference book (the BSA book)
- if you do not have a strong Biology background, I suggest "Molecular Biology of the Gene" by James Watson et al. (Benjamin Cummings, 6th edition (2007), ISBN-13 978-0805395921) and, to a lesser extent, "Molecular Biology of the Cell" by Bruce Alberts which is also a fine book (Garland, 4th edition (2002), ISBN-13: 978-0815332183) as your reference books. Make sure you are dealing with the latest editions of these books.
Analytic Bridge
Data mining, statistics, quant, operations research, six sigma, econometrics, web analytics, text mining, business intelligence, SAS, biostatistics, machine learning, artificial intelligence, decision sciences, cloud computing, SaaS.
http://www.analyticbridge.com/
Monday, November 1, 2010
Git basics
Upload it again: 'git push'
So, to have a successful work session do the following:
Sit down at computer
type 'git pull'
make changes
test your changes
type 'git commit'
describe your changes
type 'git pull'
type 'git push'
Get up and walk away from the computer
You can type 'git pull' every minute, if you like.
You can type 'git commit -a' just after you've
made a change and
have tested it a little (Any syntax errors?
Does it run at all?).
You can type 'git push' after each commit; it has
no effect until then.
~$ git config --global user.email my@email.com
~$ git config --global user.name myusrname
So, to have a successful work session do the following:
Sit down at computer
type 'git pull'
make changes
test your changes
type 'git commit'
describe your changes
type 'git pull'
type 'git push'
Get up and walk away from the computer
You can type 'git pull' every minute, if you like.
You can type 'git commit -a' just after you've
made a change and
have tested it a little (Any syntax errors?
Does it run at all?).
You can type 'git push' after each commit; it has
no effect until then.
~$ git config --global user.email my@email.com
~$ git config --global user.name myusrname
Subscribe to:
Posts (Atom)