This page describes the different ways data can be classified in TCGA. The following topics are included in this section:
Data Type
A data type is a label to categorize the many forms of platform data within the TCGA Network.
Each platform can potentially produce many kinds of data (data types). For example, SNP-based array platforms are the most complex in that the platform yields three data types: Copy Number Results, LOH and SNP. The following table identifies data types produced by the six listed platforms.
Agilent Human Genome CGH Custom Microarray 2x415K
|
Agilent Human Genome CGH Microarray 244A
|
Agilent SurePrint G3 Human CGH Microarray Kit 1x1M
|
Affymetrix Genome-Wide Human SNP Array 6.0
|
Illumina 550K Infinium HumanHap550 SNP Chip
|
Illumina Human1M-Duo BeadChip
| |
---|---|---|---|---|---|---|
Copy Number Results
|
yes
|
yes
|
yes
|
yes
|
yes
|
yes
|
LOH
|
—
|
—
|
—
|
yes
|
yes
|
yes
|
SNP
|
—
|
—
|
—
|
yes
|
yes
|
yes
|
Data Level Classification
Data level is a method of data categorization used within the TCGA network to facilitate researchers in communicating and locating their data of interest.
Data levels are assigned for each data type, platform and center. There are four data levels: Level 1 (for Raw Data), Level 2 (for Processed Data), Level 3 (for Segmented or Interpreted Data) and Level 4 (for Region of Interest Data).
The following table outlines and describes the four TCGA data levels.
Data Level
|
Level Type
|
Description
|
---|---|---|
1
|
Raw
|
|
2
|
Processed
|
|
3
|
Segmented/ Interpreted
|
|
4
|
Summary/Regions of Interest (ROI)
|
|
Relationships Between Data Type and Data Level
Each platform can produce multiple data types. To understand data categorization, it is important to clarify the relationship between data type and data level.
Each data type is associated with sets of data that span one more data levels. Each center and platform may have a slightly different concept of data level depending on their data types, and the algorithms used for analysis. The table below displays a current list of raw normalized data levels as they apply to each data type. Data types are listed in the Code Tables Report and data level descriptions are listed above under Data Level Classification.
Data Type and Corresponding Data Level Descriptions
Data Type
|
Data Subtypes
|
Level 1
|
Level 2
|
Level 3
|
Important Metadata
| ||
Clinical Data
|
1. Clinical data
|
n/a
|
Clinical information for each participant (including demographic information, treatment information, survival data, etc)
File types: tab-delimited "biotab" (.txt) and .xml |
n/a
|
n/a
|
The BCR data dictionarydescribes of all the clinical and biospecimen data elements in TCGA
| |
2. Biospecimen data
|
n/a
|
Information on how samples from each participant were processed by the Biospecimen Core Resource Center (BCR)
File types: tab-delimited "biotab" (.txt) and .xml |
n/a
|
n/a
|
The BCR data dictionarydescribes of all the clinical and biospecimen data elements in TCGA
| ||
Tissue Slide Images
|
1. Diagnostic image
|
n/a
|
Tissue images used by the hospital to diagnose participant
File type: .svs (image viewer) |
n/a
|
n/a
|
Available images are listed in thebiospecimen biotab and xml files
| |
2. Tissue image
|
n/a
|
Images of tissue samples from each participant that were used for TCGA analyses
File type: .svs (image viewer) |
n/a
|
n/a
|
Available images are listed in thebiospecimen biotab and xml files
| ||
Pathology Reports
|
n/a
|
Pathology reports for a subset of participants
File type: .pdf |
n/a
|
n/a
|
n/a
| ||
Microsatellite Instability (MSI)
|
n/a
|
Markers indicating presence or absence of a MSI shift, allele homozygosity/heterozygosity, and loss of heterozygosity (LOH) observed in the tumor sample for each participant
File types: fragment analysis trace file (.fsa) and tab-delimited (.txt) file summarizing the trace file |
n/a
|
Classifications of microsatellite instability detected for each participant's tumor sample
File type: auxiliary.xml |
Level 1 data are submitted as part of a standardMAGE-TABarchive
Level 3 data are contained in the BCR clinical data archives | ||
DNA Sequencing
|
1. Whole exome sequence (available at the Cancer Genomics Hub)
|
IluminaGA_DNASeq
SOLiD_DNASeq |
Whole exome sequence for both tumor and normal sample for each participant
File type: binary alignment file (.bam) |
n/a
|
n/a
|
Experimental protocol, including primer information, is contained in the metadata.xml fileassociated with each .bam file
| |
2. Whole genome sequence (available at the Cancer Genomics Hub)
|
IluminaGA_DNASeq
SOLiD_DNASeq |
Whole genome sequence for both tumor and normal sample for select participants
File type: binary alignment file (.bam) |
n/a
|
n/a
|
Experimental protocol, including primer information, is contained in the metadata.xml fileassociated with each .bam file
| ||
3. Sequence traces (may be available at the NCBI Trace Archive)1
|
n/a
|
Raw sequence output from older sequencing technologies
File type: sequence chromatogram format (.scf) |
n/a
|
n/a
|
Trace-sample relationship (.tr) files map NCBI trace IDs to TCGA biospecimen barcodes
| ||
4. Mutations
|
IluminaGA_DNASeq
SOLiD_DNASeq |
Whole genome and exome sequence - see above
|
Validated and unvalidated DNA variant/ mutations for each participant
File types: mutation annotation file (.maf) and variant calling file (.vcf) |
Validated DNA variants/mutations for each participant
File type: mutation annotation file (.maf) |
TheDESCRIPTIONfile contains a summary of the mutation detection and validation method
The .maf files do not have an standard MAGE-TAB archive associated with them | ||
Expression - miRNA Sequencing
|
1. miRNA sequence (available at the Cancer Genomics Hub)
|
IlluminaGA_miRNASeq
IlluminaHiSeq_miRNASeq |
miRNA sequence for each participant's tumor sample
File type: binary alignment file (.bam) |
n/a
|
n/a
|
Experimental protocol, including primer information, is contained in the metadata.xml fileassociated with each .bam file
| |
2. miRNA
|
IlluminaGA_miRNASeq
IlluminaHiSeq_miRNASeq |
miRNA sequence for each participant's tumor sample - see above
|
n/a
|
The calculated expression for all reads aligning to a particular miRNA, per sample
File type: tab-delimited (.txt) |
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
| ||
3. Isoform
|
IlluminaGA_miRNASeq
IlluminaHiSeq_miRNASeq |
miRNA sequence for each participant's tumor sample - see above
|
n/a
|
The calculated expression for each individual miRNA sequence isoform observed, per sample
File type: tab-delimited (.txt) |
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
| ||
Expression - Protein Array
|
MDA_RPPA_Core
|
High resolution images of protein array slides (up to 1000 samples per slide) and raw signals per slide
File types: .tiff (image viewer) for images and tab-delimited (.txt) for signals |
Dilution curves for each sample
File type: tab-delimited (.txt) |
Normalized protein expression for each gene
File type: tab-delimited (.txt) |
Array design files, antibody annotations, and the experimental protocol are included inMAGE-TABarchive
| ||
Expression - mRNA Sequencing2
|
1. mRNA sequence (available at the Cancer Genomics Hub)
|
IlluminaGA_RNASeq
IlluminaHiSeq_RNASeq |
mRNA sequence for each participant's tumor sample
File type: binary alignment file (.bam) |
n/a
|
n/a
|
Experimental protocol, including primer information, is contained in the metadata.xml fileassociated with each .bam file
| |
2. Exon
|
IlluminaGA_RNASeq
IlluminaHiSeq_RNASeq |
mRNA sequence for each participant's tumor sample - see above
|
n/a
|
The calculated expression signal of a particular composite exon of a gene
File type: tab-delimited (.txt) |
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
| ||
3. Gene
|
IlluminaGA_RNASeq
IlluminaHiSeq_RNASeq |
mRNA sequence for each participant's tumor sample - see above
|
n/a
|
The calculated expression signal of a gene
File type: tab-delimited (.txt) |
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
| ||
4. Splice Junction
|
IlluminaGA_RNASeq
IlluminaHiSeq_RNASeq |
mRNA sequence for each participant's tumor sample - see above
|
n/a
|
The calculated expression signal of a particular composite splice junction of a gene
File type: tab-delimited (.txt) |
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
| ||
5. Isoform
|
IlluminaGA_RNASeq
IlluminaHiSeq_RNASeq |
mRNA sequence for each participant's tumor sample - see above
|
n/a
|
The normalized expression signal of individual isoforms (transcripts)
File type: tab-delimited (.txt) |
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
| ||
Expression Array1
|
1. Gene
|
AgilentG4502A_07_3
AgilentG4502A_07_2 AgilentG4502A_07_1 HT-HG-U133A HG-U133-Plus2 |
Raw signals per probe
File types: binary (.CEL) and tab-delimited (.txt) |
Normalized signals per probe or probe set
File type: tab-delimited (.txt) |
Expression calls for genes, per sample
File type: tab-delimited (.txt) |
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive
Probe information is contained in the Array design files for each platform | |
2. Exon
|
HuEx-1_0-st-v2
|
Raw signals per probe
File type: binary (.CEL) |
Normalized signals per probe or probe set
File type: .tab-delimited (.txt) |
Expression calls for exons/variants, per sample
File type: .tab-delimited (.txt) |
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive
Probe information is contained in the Array design files for each platform | ||
3. miRNA
|
H-miRNA_8x15K
H-miRNA_8x15Kv2 |
Raw signals per probe
File type: tab-delimited (.txt) |
Normalized signals per probe or probe set
File type: tab-delimited (.txt) |
Expression calls for miRNAs, per sample
File type: tab-delimited (.txt) |
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive
Probe information is contained in the Array design files for each platform | ||
DNA Methylation
|
HumanMethylation27
HumanMethylation450 IlluminaDNAMethylation_ OMA002_CPI IlluminaDNAMethylation_ OMA003_CPI |
Raw signal intensities of probes
File type: tab-delimited (.txt) and binary [(.idat) (applies to HumanMethylation450 platform)] |
Calculated beta values
File type: tab-delimited (.txt) |
Calculated beta values mapped to genome, per sample
File type: tab-delimited (.txt) |
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive
Probe information is contained in the Array design files for each platform | ||
SNP
|
Genome_Wide_SNP_6
Human1MDuo HumanHap550 |
Raw data
File types: binary (.CEL), binary (.idat), and tab-delimited (.txt) |
Unnormalized SNP, copy number, and LOH data
File type: tab-delimited (.txt) |
Normalized copy number and LOH data, per sample
File type: tab-delimited (.txt) |
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive
Probe information is contained in the Array design files for each platform | ||
Copy Number Results
|
1. Sequencing based
|
IlluminaHiSeq_DNASeqC
|
Low pass, whole genome sequence of both tumor and normal samples for each participant and analysis of differences in read counts between the tumor and normal sample
File type: binary alignment file (.bam) |
DNA variants/mutations and copy number variation for each participant
File type: variant calling format (.vcf) |
Regions with differences in genome coverage (number of reads) between normal and tumor samples for each participant
File type: tab-delimited (.tsv) |
Experimental protocol, including calculation methods, is included in the DESCRIPTION file of theMAGE-TABarchive
| |
2. Array based - CGH1
|
HG-CGH-244A
HG-CGH-415K_G4124A CGH-1x1M_G4447A |
Raw signals per probe
File type: tab-delimited (.txt) |
Normalized signals for copy number alterations of aggregated regions, per probe or probe set
File type: tab-delimited (.tsv and .mat) |
Copy number alterations for aggregated/segmented regions, per sample
File type: tab-delimited (.tsv and .txt) |
Experimental protocol, including calculation methods, is included in theMAGE-TABarchive
Probe information is contained in the Array design files for each platform | ||
3. Array based - SNP (see above)
|
1 Denotes an older data type/platform that applies to TCGA pilot projects (GBM and Ovarian) only.
2 RNA sequencing has two versions - V1 and V2. Version 2 differs from Version 1 by the algorithm used to generate the data.
No comments:
Post a Comment