Data for bulk download at legumeinfo is now stored at the Data Store. The layout looks like:
Genus/species/datatype/
Genotype.DatasetTypeVersion.Subtype.Key/
README.Key.yml
gensp.Genotype.DatasetTypeVersion.Subtype.KeyA.DESCRIPTION1.ext.gz
gensp.Genotype.DatasetTypeVersion.Subtype.KeyA.DESCRIPTION2.ext.gz
Example:
Trifolium/pratense/annotations/
MilvusB.gnm2.ann1.DFgp/
README.DFgp.yml
tripr.MilvusB.gnm2.ann1.DFgp.gene_models_main.gff3.gz
[more files]
Trifolium/pratense/genomes/
MilvusB.gnm2.ann1.gNmT/
README.gNmT.yml
tripr.MilvusB.gnm2.gNmT.genome_main.fna.gz
etc...
Abbreviations in the directory names:
We use (generally) three-letter abbreviations to indicate data types. This is to avoid the ambiguity of unlabeled version numbers (e.g. how to indicate assembly version 1, annotation version 2)? Here are the abbreviations and corresponding data types:
- ann => annotation
- gnm => genome assembly
- tcp => transcriptome
- div => diversity (e.g. SNPs)
- map => map
- syn => synteny
- div => diversity
- gwas => GWAS
- qtl => QTL
- pan => pan-genes and pan-genomes
Dataset key names:
The four-character string in the README and the filenames (e.g. Key or gNmT above) is a unique key, which associates the file(s) in a directory (a data collection) with the metadata for the file(s). The keys are also recorded at a Registry, along with other curatorial and status information.
For data types types that are usually associated with publications (QTL, GWAS, map, marker), the "key" instead had the form Author_Author_YEAR (rather than random four-character string). For example:
Phaseolus/vulgaris/genetic/mixed.gwas.Raggi_Caproni_2019/
Searching for information in the Data Store:
Access the search text-entry box by clicking on the magnifying glass in the upper left corner. Then enter, for example, "protein" to find all files with that text in the name, or a key name (e.g. "Qq0N", if you know that from the Registry), or "Lupinus", or "lupan" (the five-letter GENus SPecies code). Also, the data is organized in a standard way, so you can probably find what you want by navigating through the directory tree.
Contributing data:
Yes, please! contact us. For developers of other sites: if you like this system, and would like to host a similar one, please contact us about the installation and configuration details (this uses the h5ai browser package). This file system and metadata schema are part of the Legume Federation project, and we would like to see addition of other instances.
The relationship of data in this repository to other instances of the data:
In many cases, data in the Data Store is re-hosted from another primary repository. We host it here in order to provide access in regularized processes at this site. The metadata file (README.KeyX.yml) in every terminal directory gives the data provenance, in cases where there is another primary repository for this data. This file also contains citation information, information about any file name changes from the previous repository, and transformations on the data contents (if any).
Metadata templates and protocols:
See more information about the Data Store directory, including metadata templates, the Registry, and associated protocols, at the metadata directory. See more about the Legume Federation project, which funds part of the development and curation work on the Data Store.
Go to the Data Store.