Author
Listed:
- Jonathan Batty
(University of Leeds)
- Marlous Hall
(University of Leeds)
Abstract
Routinely collected healthcare data (including electronic healthcare records and administrative data) are increasingly available at the whole-population scale, and may span decades of data collection. These data may be analysed as part of clinical, pharmacoepidemiologic and health services research, producing insights that improve future clinical care. However, the analysis of healthcare data on this scale presents a number of unique challenges. These include the storage of diagnosis, medication and procedure codes using a number of discordant systems (including ICD-9 and 10, SNOMED-CT, Read codes, etc.) and the inherently relational nature of the data (each patient has multiple clinical contacts, during which multiple codes may be recorded). Pre-processing and analysing these data using optimised methods has a number of benefits, including minimisation of computational requirements, analytic time, carbon footprint and cost. We will focus on one of the main issues faced by the healthcare data analyst: how to most efficiently collapse multiple, disparate diagnosis codes (stored as strings across a number of variables) into a discrete disease entity, using a pre-defined code list. A number of approaches (including the use of Boolean logic, the inlist function, string functions and regular expressions) will be sequentially benchmarked in a large, real-world healthcare dataset (n = 192 million hospitalisation episodes during a 12-year period; approximately 1 terabyte of data). The time and space complexity of each approach (in addition to its carbon footprint), will be reported. The most efficient strategy has been implemented into our newly-developed Stata command: codefinder, which will be discussed.
Suggested Citation
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:boc:lsug24:21. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Christopher F Baum (email available below). General contact details of provider: https://edirc.repec.org/data/stataea.html .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.