<- read.table("~/genes.txt", header = T, sep = "\t") gene_list
BioMart is an integral part of the Ensembl project, and it was designed to facilitate the access and retrieval of biological data. With BioMart, users can easily extract information about genes, proteins, and other genomic features, linking this data across various biological datasets. Its flexibility and ease of use make it an invaluable tool for bioinformaticians and biologists aiming to integrate and analyze large-scale biological data efficiently. BioMart is available through its web page, however, there’s the biomaRt
R/Bioconductor package that can be used to access data programatically.
In this tutorial, we show how to make a simple query using the biomaRt
R package.
Suppose that we have a list of human peptides (Ensembl peptide IDs) and we want to retrieve their corresponding gene symbols.
head(gene_list)
ensembl_peptide_id
1 ENSP00000330918
2 ENSP00000364486
3 ENSP00000436217
4 ENSP00000364699
5 ENSP00000432005
6 ENSP00000436669
With biomaRt package, we can do that in 3 simple steps:
library(biomaRt)
- Set the ensembl channel:
<- useMart("ENSEMBL_MART_ENSEMBL") ensembl
Depending on the type of data you want to collect (e.g., variants), there are other channels available. To list all channels available, use the listMarts()
function.
- Choose a dataset (organism):
For human data, the dataset keyword is hsapiens_gene_ensembl
. To access the keywords for other organisms, run the listDatasets()
function.
<- useDataset(dataset = "hsapiens_gene_ensembl", mart = ensembl) ensembl
- Make the query:
Before making the query, we need to know which attribute keywords are available at BioMart database. We can list the attributes available with the listAttributes()
function. We also need to obtain the filter keywords by using the listFilters()
function. In summary, the attributes are the ID keywords you want to retrieve and the filters are the ID keywords you’ll use to make the query.
With proper attributes and filters selected, you can pass them to the attributes
and filters
arguments from the getBM()
function. If you have a specific list of identifiers to convert, like we do have from the gene_list
dataframe, pass it to the values
argument. If you do not provide anything to the values
argument, the getBM()
function will return all identifiers requested for the attributes provided on the chosen organism.
# Dataframe with attributes
<- listAttributes(ensembl)
att
# Dataframe with filters
<- listFilters(ensembl)
filters
# Make the query with getBM function
<- getBM(attributes = c("hgnc_symbol", "ensembl_peptide_id"),
ids filters = "ensembl_peptide_id",
values = gene_list$ensembl_peptide_id,
mart = ensembl)
head(ids)
hgnc_symbol ensembl_peptide_id
1 PCK2 ENSP00000494029
2 PCK2 ENSP00000496343
3 PCK2 ENSP00000496102
4 PCK2 ENSP00000494919
5 FOXO1 ENSP00000368880
6 PTPN2 ENSP00000320298
And that’s it! You can make many different queries on the database identifiers provided by BioMart.