Monday, February 4, 2013

TCGA Data Classification

source:https://wiki.nci.nih.gov/display/TCGA/Data+Classification


Skip to end of metadata
Go to start of metadata

This page describes the different ways data can be classified in TCGA. The following topics are included in this section:

Data Type

A data type is a label to categorize the many forms of platform data within the TCGA Network.
Each platform can potentially produce many kinds of data (data types). For example, SNP-based array platforms are the most complex in that the platform yields three data types: Copy Number Results, LOH and SNP. The following table identifies data types produced by the six listed platforms.

Agilent Human Genome CGH Custom Microarray 2x415K
Agilent Human Genome CGH Microarray 244A
Agilent SurePrint G3 Human CGH Microarray Kit 1x1M
Affymetrix Genome-Wide Human SNP Array 6.0
Illumina 550K Infinium HumanHap550 SNP Chip
Illumina Human1M-Duo BeadChip
Copy Number Results
yes
yes
yes
yes
yes
yes
LOH
yes
yes
yes
SNP
yes
yes
yes

Data Level Classification

Data level is a method of data categorization used within the TCGA network to facilitate researchers in communicating and locating their data of interest.
Data levels are assigned for each data type, platform and center. There are four data levels: Level 1 (for Raw Data), Level 2 (for Processed Data), Level 3 (for Segmented or Interpreted Data) and Level 4 (for Region of Interest Data).
The following table outlines and describes the four TCGA data levels.
Data Level
Level Type
Description
1
Raw
  • Low-level data for single sample
  • Not normalized
2
Processed
  • Normalized single sample data
  • Interpreted for presence or absence of specific molecular abnormalities
3
Segmented/ Interpreted
  • Aggregate of processed data from single sample
  • Grouped by probed loci to form larger contiguous regions (in some cases)
4
Summary/Regions of Interest (ROI)
  • Quantified association across classes of samples
  • Associations based on two or more
    • Molecular abnormalities
    • Sample characteristics
    • Clinical variables

Relationships Between Data Type and Data Level

Each platform can produce multiple data types. To understand data categorization, it is important to clarify the relationship between data type and data level.
Each data type is associated with sets of data that span one more data levels. Each center and platform may have a slightly different concept of data level depending on their data types, and the algorithms used for analysis. The table below displays a current list of raw normalized data levels as they apply to each data type. Data types are listed in the Code Tables Report and data level descriptions are listed above under Data Level Classification.

Data Type and Corresponding Data Level Descriptions

Data Type
Data Subtypes
Level 1
Level 2
Level 3
Important Metadata
Clinical Data
1. Clinical data
n/a
Clinical information for each participant (including demographic information, treatment information, survival data, etc)

File types: tab-delimited "biotab" (.txt) and .xml
n/a
n/a
The BCR data dictionarydescribes of all the clinical and biospecimen data elements in TCGA
2. Biospecimen data
n/a
Information on how samples from each participant were processed by the Biospecimen Core Resource Center (BCR)

File types: tab-delimited "biotab" (.txt) and .xml
n/a
n/a
The BCR data dictionarydescribes of all the clinical and biospecimen data elements in TCGA
Tissue Slide Images
1. Diagnostic image
n/a
Tissue images used by the hospital to diagnose participant

File type: .svs (image viewer)
n/a
n/a
Available images are listed in thebiospecimen biotab and xml files
2. Tissue image
n/a
Images of tissue samples from each participant that were used for TCGA analyses

File type: .svs (image viewer)
n/a
n/a
Available images are listed in thebiospecimen biotab and xml files
Pathology Reports
 
n/a
Pathology reports for a subset of participants

File type: .pdf
n/a
n/a
n/a
Microsatellite Instability (MSI)
 
n/a
Markers indicating presence or absence of a MSI shift, allele homozygosity/heterozygosity, and loss of heterozygosity (LOH) observed in the tumor sample for each participant

File types: fragment analysis trace file (.fsa) and tab-delimited (.txt) file summarizing the trace file
n/a
Classifications of microsatellite instability detected for each participant's tumor sample

File type: auxiliary.xml
Level 1 data are submitted as part of a standardMAGE-TABarchive

Level 3 data are contained in the BCR clinical data archives
DNA Sequencing
1. Whole exome sequence (available at the Cancer Genomics Hub)
IluminaGA_DNASeq
SOLiD_DNASeq
Whole exome sequence for both tumor and normal sample for each participant

File type: binary alignment file (.bam)
n/a
n/a
Experimental protocol, including primer information, is contained in the metadata.xml fileassociated with each .bam file
2. Whole genome sequence (available at the Cancer Genomics Hub)
IluminaGA_DNASeq
SOLiD_DNASeq
Whole genome sequence for both tumor and normal sample for select participants

File type: binary alignment file (.bam)
n/a
n/a
Experimental protocol, including primer information, is contained in the metadata.xml fileassociated with each .bam file
3. Sequence traces (may be available at the NCBI Trace Archive)1
n/a
Raw sequence output from older sequencing technologies

File type: sequence chromatogram format (.scf)
n/a
n/a
Trace-sample relationship (.tr) files map NCBI trace IDs to TCGA biospecimen barcodes
4. Mutations
IluminaGA_DNASeq
SOLiD_DNASeq
Whole genome and exome sequence - see above
Validated and unvalidated DNA variant/ mutations for each participant

File types: mutation annotation file (.maf) and variant calling file (.vcf)
Validated DNA variants/mutations for each participant

File type: mutation annotation file (.maf)
TheDESCRIPTIONfile contains a summary of the mutation detection and validation method

The .maf files do not have an standard MAGE-TAB archive associated with them
Expression - miRNA Sequencing
1. miRNA sequence (available at the Cancer Genomics Hub)
IlluminaGA_miRNASeq
IlluminaHiSeq_miRNASeq
miRNA sequence for each participant's tumor sample

File type: binary alignment file (.bam)
n/a
n/a
Experimental protocol, including primer information, is contained in the metadata.xml fileassociated with each .bam file
2. miRNA
IlluminaGA_miRNASeq
IlluminaHiSeq_miRNASeq
miRNA sequence for each participant's tumor sample - see above
n/a
The calculated expression for all reads aligning to a particular miRNA, per sample

File type: tab-delimited (.txt)
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
3. Isoform
IlluminaGA_miRNASeq
IlluminaHiSeq_miRNASeq
miRNA sequence for each participant's tumor sample - see above
n/a
The calculated expression for each individual miRNA sequence isoform observed, per sample

File type: tab-delimited (.txt)
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
Expression - Protein Array
 
MDA_RPPA_Core
High resolution images of protein array slides (up to 1000 samples per slide) and raw signals per slide

File types: .tiff (image viewer) for images and tab-delimited (.txt) for signals
Dilution curves for each sample

File type: tab-delimited (.txt)
Normalized protein expression for each gene

File type: tab-delimited (.txt)
Array design files, antibody annotations, and the experimental protocol are included inMAGE-TABarchive
Expression - mRNA Sequencing2
1. mRNA sequence (available at the Cancer Genomics Hub)
IlluminaGA_RNASeq
IlluminaHiSeq_RNASeq
mRNA sequence for each participant's tumor sample

File type: binary alignment file (.bam)
n/a
n/a
Experimental protocol, including primer information, is contained in the metadata.xml fileassociated with each .bam file
2. Exon
IlluminaGA_RNASeq
IlluminaHiSeq_RNASeq
mRNA sequence for each participant's tumor sample - see above
n/a
The calculated expression signal of a particular composite exon of a gene

File type: tab-delimited (.txt)
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
3. Gene
IlluminaGA_RNASeq
IlluminaHiSeq_RNASeq
mRNA sequence for each participant's tumor sample - see above
n/a
The calculated expression signal of a gene

File type: tab-delimited (.txt)
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
4. Splice Junction
IlluminaGA_RNASeq
IlluminaHiSeq_RNASeq
mRNA sequence for each participant's tumor sample - see above
n/a
The calculated expression signal of a particular composite splice junction of a gene

File type: tab-delimited (.txt)
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
5. Isoform
IlluminaGA_RNASeq
IlluminaHiSeq_RNASeq
mRNA sequence for each participant's tumor sample - see above
n/a
The normalized expression signal of individual isoforms (transcripts)

File type: tab-delimited (.txt)
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
Expression Array1
1. Gene
AgilentG4502A_07_3
AgilentG4502A_07_2
AgilentG4502A_07_1
HT-HG-U133A
HG-U133-Plus2
Raw signals per probe

File types: binary (.CEL) and tab-delimited (.txt)
Normalized signals per probe or probe set

File type: tab-delimited (.txt)
Expression calls for genes, per sample

File type: tab-delimited (.txt)
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive

Probe information is contained in the Array design files for each platform
2. Exon
HuEx-1_0-st-v2
Raw signals per probe

File type: binary (.CEL)
Normalized signals per probe or probe set

File type: .tab-delimited (.txt)
Expression calls for exons/variants, per sample

File type: .tab-delimited (.txt)
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive

Probe information is contained in the Array design files for each platform
3. miRNA
H-miRNA_8x15K
H-miRNA_8x15Kv2
Raw signals per probe

File type: tab-delimited (.txt)
Normalized signals per probe or probe set

File type: tab-delimited (.txt)
Expression calls for miRNAs, per sample

File type: tab-delimited (.txt)
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive

Probe information is contained in the Array design files for each platform
DNA Methylation
 
HumanMethylation27
HumanMethylation450

IlluminaDNAMethylation_ OMA002_CPI

IlluminaDNAMethylation_ OMA003_CPI
Raw signal intensities of probes

File type: tab-delimited (.txt) and binary [(.idat) (applies to HumanMethylation450 platform)]
Calculated beta values

File type: tab-delimited (.txt)
Calculated beta values mapped to genome, per sample

File type: tab-delimited (.txt)
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive

Probe information is contained in the Array design files for each platform
SNP
 
Genome_Wide_SNP_6
Human1MDuo
HumanHap550
Raw data

File types: binary (.CEL), binary (.idat), and tab-delimited (.txt)
Unnormalized SNP, copy number, and LOH data

File type: tab-delimited (.txt)
Normalized copy number and LOH data, per sample

File type: tab-delimited (.txt)
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive

Probe information is contained in the Array design files for each platform
Copy Number Results
1. Sequencing based
IlluminaHiSeq_DNASeqC
Low pass, whole genome sequence of both tumor and normal samples for each participant and analysis of differences in read counts between the tumor and normal sample

File type: binary alignment file (.bam)
DNA variants/mutations and copy number variation for each participant

File type: variant calling format (.vcf)
Regions with differences in genome coverage (number of reads) between normal and tumor samples for each participant

File type: tab-delimited (.tsv)
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
2. Array based - CGH1
HG-CGH-244A
HG-CGH-415K_G4124A
CGH-1x1M_G4447A
Raw signals per probe

File type: tab-delimited (.txt)
Normalized signals for copy number alterations of aggregated regions, per probe or probe set

File type: tab-delimited (.tsv and .mat)
Copy number alterations for aggregated/segmented regions, per sample

File type: tab-delimited (.tsv and .txt)
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive

Probe information is contained in the Array design files for each platform
3. Array based - SNP (see above)





1 Denotes an older data type/platform that applies to TCGA pilot projects (GBM and Ovarian) only.
2 RNA sequencing has two versions - V1 and V2. Version 2 differs from Version 1 by the algorithm used to generate the data.

No comments:

Post a Comment