MedCAT (Medical Concept Annotation Tool) is an open-source Named Entity Recognition and Linking (NER+L) toolkit developed as part of the CogStack ecosystem, primarily by researchers associated with the NHS. It is designed to extract and structure information from unstructured biomedical documents, such as Electronic Health Records (EHRs). The tool links identified clinical concepts to major biomedical ontologies like SNOMED-CT and UMLS.
MedCAT employs a novel self-supervised machine learning approach for concept extraction and disambiguation, offering high performance, speed, and ease of use, with demonstrated strong transferability between different hospitals and datasets. It is lightweight and fast, capable of handling large-scale entity extraction. The software is distributed under the Elastic License 2.0.
Key Capabilities:
- NER+L: Named Entity Recognition and Linking to millions of biomedical concepts.
- MetaCAT: A component for detecting the status of a concept (e.g., affirmed, negated, or hypothetical).
- MedCATtrainer: An accompanying open-source web interface (with a REST API) that allows clinicians and annotators to inspect, improve, and customize MedCAT models through supervised training and active learning.
- Scalability: Supports multiprocessing for handling large datasets (100M+ documents).
- Technology: The core library is Python-based, and the newer MedCAT v2 utilizes transformer-based models for improved contextual information and robustness.
