<img src="https://github.com/caufieldjh/awesome-bioie/blob/master/images/abie_head.png" alt="Awesome BioIE Logo"/>
<br>
<a href="https://awesome.re">
    <img src="https://awesome.re/badge-flat2.svg" alt="Awesome">
</a>
<br>
How to extract information from unstructured biomedical data and text.
<br>

What is BioIE? It includes any effort to extract structured information from unstructured (or, at least inconsistently structured) biological, clinical, or other biomedical data. The data source is often some collection of text documents written in technical language. If the resulting information is verifiable and consistent across sources, we may then consider it knowledge. Extracting information and producing knowledge from bio data requires adaptations upon methods developed for other types of unstructured data.

Resources included here are preferentially those available at no monetary cost and limited license requirements. Methods and datasets should be publicly accessible and actively maintained.

See also awesome-nlp, awesome-biology and Awesome-Bioinformatics.

Please read the contribution guidelines before contributing. Please add your favourite resource by raising a pull request.

Contents

Research Overviews

Back to Top

Groups Active in the Field

Back to Top

Organizations

Back to Top

Journals and Events

The interdisciplinary nature of BioIE means researchers in this space may share their findings and tools in a variety of ways. They may publish papers in journals, as is common in the biomedical and life sciences. They may publish conference papers and, upon acceptance, give a poster and/or oral presentation at an event; this is common practice in computer science and engineering fields. Conference papers are often published in collections of proceedings. Preprint publication is an increasingly popular and institutionally-accepted way to publish findings as well. Surrounding these formal, written products are the ideas of open science, open data, and open source: the code, data, and software BioIE researchers develop are valuable resources to the community.

Journals

For preprints, try arXiv, especially the subjects Computation and Language (cs.CL) and Information Retrieval (cs.IR); bioRxiv; or medRxiv, especially the Health Informatics subject area.

Conferences and Other Events

Challenges

Some events in BioIE are organized around formal tasks and challenges in which groups develop their own computational solutions, given a dataset.

Back to Top

Tutorials

The field changes rapidly enough that tutorials any older than a few years are missing crucial details. A few more recent educational resources are listed below. A good foundational understanding of text mining techniques is very helpful, as is some basic experience with the Python and or R languages. Starting with the NLTK tutorials and then trying out the tutorials for the Flair framework will provide excellent examples of natural language processing, text mining, and modern machine learning-driven methods, all in Python. Most of the examples don’t include anything biomedical, however, so the best option may be to learn by doing.

Guides

Video Lectures and Online Courses

Back to Top

Code Libraries

Repos for Specific Datasets

Back to Top

Tools, Platforms, and Services

Annotation Tools

Back to Top

Techniques

Text Embeddings

This paper from Hongfang Liu’s group at Mayo Clinic demonstrates how text embeddings trained on biomedical or clinical text can, but don’t always, perform better on biomedical natural language processing tasks. That being said, pre-trained embeddings may be appropriate for your needs, especially as training domain-specific embeddings can be computationally intensive.

Word Embeddings

Language Models

Back to Top

Datasets

Some of the datasets listed below require a UMLS Terminology Services (UTS) account to access. Please note that the license granted with the UTS account requires users to submit an annual report about their use of UMLS resources. This is less challenging than it sounds.

Biomedical Text Sources

The following resources contain indexed text documents in the biomedical sciences. * OHSUMED - paper - 348,566 MEDLINE entries (title and sometimes abstract) from between 1987 and 1991. Includes MeSH labels. Primarily of historical significance. * PubMed Central Open Access Subset - A set of PubMed Central articles usable under licenses other than traditional copyright, though the exact licenses vary by publication and source. Articles are available as PDF and XML. * CORD-19 - A corpus of scholarly manuscripts concerning COVID-19. Articles are primarily from PubMed Central and preprint servers, though the set also includes metadata on papers without full-text availability.

Annotated Text Data

Protein-protein Interaction Annotated Corpora

Protein-protein interactions are abbreviated as PPI. The following sets are available in BioC format. The older sets (AIMed, BioInfer, HPRD50, IEPA, and LLL) are available courtesy of the WBI corpora repository and were originally derived from the original sets by a group at Turku University.

Other Datasets

Back to Top

Ontologies and Controlled Vocabularies

Back to Top

Data Models

Do you need a data model? If you are working with biomedical data, then the answer is probably “Yes”.

Back to Top

Credits

Credits for curators and sources.

License

CC0

License