Text and Data Mining (TDM) offer a solution for health researchers wishing to analyse a large corpus of resources, including research papers, medical records, and other material, even when the information is held in an unstructured form. The resultant output may be used to identify hidden patterns that emerge over time and across geographic regions, predict and address gaps within the data, and convert content into a form better suited to modern research.
Common outputs produced as a result of TDM activities include:
- Summarization: The key points of a large document are extracted and a shorter version produced.
- Extraction: Specific entities – names, dates, diagnosis terms, or other values – are identified and put into a structure format for analysis.
- Categorization: the organisation of information into categories based upon pre-defined criteria, e.g. papers that contain reference to a specific and related illness.
- Visualisation: Information contained within one or more papers are represented in a graphic form.
The value of TDM to health research and practice has been recognised since the 1980s. Notable work in this area include that by Krallinger, Valencia and Hirschman (2008), which explored the practical application of text mining techniques to research papers, in order to extract protein and genomic sequence information, expression profiles, and protein structure coordinates, and a study by Przybyła et al (2016), which examined tools and services available to perform text mining in the life sciences. Authors such as Durairaj and Ranjani (2013) and Alkhatib (2015) also provide case studies on the practical application of TDM techniques in healthcare projects.
More recently, a substantial investment was made by the European Commission in the OpenMinTeD project to enhance the infrastructure that underpins the mining of scientific research and the DTMBio conference was established to encourage research and debate into DTM use in biomedical informatics.
But is it legal?
Application of large-scale text and data mining techniques have often been limited by arguments over rights issues. Traditionally, it has been necessary to obtain permission from the rights holders to extract information and convert it into a new, machine-processable form – a potentially time-consuming and expensive activity. However, the legalities of TDM were addressed by the UK government as part of a the 2014 reform of the Copyright, Design and Patents Act (section 29A), which introduced permission for UK researchers to perform text and data mining without having to obtain individual permission in certain circumstances. The UK Intellectual Property Office describes the amendment as follows:
An exception to copyright exists which allows researchers to make copies of any copyright material for the purpose of computational analysis if they already have the right to read the work (that is, they have ‘lawful access’ to the work). This exception only permits the making of copies for the purpose of text and data mining for non-commercial research. Researchers will still have to buy subscriptions to access material; this could be from many sources including academic publishers.
This exception allows UK researchers to copy unpublished and published in-copyright works, including research papers, data, sound, video, and other resources, to which they have *lawful access*, and perform text and data mining as necessary for non-commercial research, without having to gain specific permission from the rights holder. At present this TDM exception has not been introduced in other countries, although the potential for applying it across the European Union has been debated.
The legal implications of applying the TDM exception to international research taking place at LSHTM can still be challenging, however. A 2016 JISC guide discussing the scenario of a UK affiliated research project that includes project staff located in different countries notes that, although the non-UK researcher may have lawful access to resources they wish to mine through an institutional subscription, data transfer should be performed by a UK-based researcher.
How does open access and Creative Commons fit into this?
OA resources are often easier to obtain and have fewer licence conditions in comparison to their subscription access cousins, making them a prime target for analysis. Many of these resources may be mined by a researcher located anywhere in the world, even in countries where there is no TDM exception, subject to licence conditions being met.
- CC-BY licensed works can be mined for any research purpose
- CC-NC licensed works can by mined for any non-commercial research
- CC-ND licensed works cannot be mined (ND stands for non-derivative).
How do I obtain resources for analysis?
Many researchers download resources to their local machine, in order to convert them to the correct format and to increase processing speed. File downloads can often take several hours or days to download, due to their large size. There are also resource implications for the host server. Large platforms such as Wikipedia, PubMed Central and ScienceDirect are sufficiently robust to allow researchers to query their website and download a large amount of data. Elsevier, for instance, provides an API (Application Programming Interface) that allows researchers to query their database and download resources that meet their requirements. However, it’s common for researchers to accidentally cause a smaller website to crash, or have their access blocked, when downloading a large number of files. If you do wish to download a large number of files from a single website, it’s advisable to consult the administrator for advice on how and when this should be performed.
Where can I find text mining tools?
Lots of text and data mining tools exist, both open source and commercial that can be applied to a wide range of resources (with some configuration).
- GATE: General Architecture for Text Engineering
- Apache UIMA
- Apache OpenNLP
- Natural Language Toolkit (NLTK)
Other TDM tools can found at:
- Stephen Thomas’ Text Mining Resources
- The National Center for Text Mining
- Text and Data Mining Wikipedia page.
Image: Amgueddfa Cymru – National Museum Wales. Strike Poster. (CC BY-NC 2.0)