{"id":4194,"date":"2017-10-24T08:39:16","date_gmt":"2017-10-24T08:39:16","guid":{"rendered":"http:\/\/blogs.lshtm.ac.uk\/library\/?p=4194"},"modified":"2017-10-25T08:43:40","modified_gmt":"2017-10-25T08:43:40","slug":"textmining-for-health","status":"publish","type":"post","link":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/","title":{"rendered":"Gain new insight into your health research using text mining"},"content":{"rendered":"<p>An abundance of scholarly resources are available to the researcher, easily discoverable through use of a few search terms. However, this opulence comes at a price: there is too much literature for a researcher to find and read themselves.<\/p>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Text_mining\">Text and Data Mining<\/a> (TDM) offer a solution for health researchers wishing to analyse a large corpus of resources, including research papers, medical records, and other material, even when the information is held in an unstructured form. The resultant output may be used to identify hidden patterns that emerge over time and across geographic regions, predict and address gaps within the data, and convert content into a form better suited to modern research.<!--more--><\/p>\n<p>Common outputs produced as a result of TDM activities include:<\/p>\n<ul>\n<li><em>Summarization:<\/em>\u00a0The key points of a large document are extracted\u00a0and a shorter version produced.<\/li>\n<li><em>Extraction:<\/em> Specific entities \u2013 names, dates, diagnosis terms, or other values \u2013 are identified and put into a structure format for analysis.<\/li>\n<li><em>Categorization:<\/em> the organisation of information into categories based upon pre-defined criteria, e.g. papers that contain reference to a specific and related illness.<\/li>\n<li><em>Visualisation:<\/em> Information contained within one or more papers\u00a0are represented in a graphic form.<\/li>\n<\/ul>\n<p>The value of TDM to health research and practice has been recognised since the 1980s. Notable work in this area include that by\u00a0<a href=\"https:\/\/doi.org\/10.1186\/gb-2008-9-s2-s8\">Krallinger, Valencia and Hirschman<\/a>\u00a0(2008), which\u00a0explored\u00a0the practical application of text mining techniques to research papers, in order to extract protein and genomic sequence information, expression profiles, and protein structure coordinates,\u00a0and a study by\u00a0<a href=\"https:\/\/doi.org\/10.1093\/database\/baw145\">Przyby\u0142a et al<\/a>\u00a0(2016), which examined tools and services available to perform text mining in the life sciences. Authors such as <a href=\"http:\/\/www.ijstr.org\/final-print\/oct2013\/Data-Mining-Applications-In-Healthcare-Sector-A-Study.pdf\">Durairaj and Ranjani<\/a> (2013)\u00a0and <a href=\"https:\/\/arxiv.org\/ftp\/arxiv\/papers\/1606\/1606.01354.pdf\">Alkhatib<\/a> (2015) also provide case studies on the practical application\u00a0of\u00a0TDM techniques in healthcare projects.<\/p>\n<p>More recently, a\u00a0substantial investment was made by the European Commission in the <a href=\"http:\/\/openminted.eu\">OpenMinTeD project<\/a>\u00a0to enhance the infrastructure that underpins the mining of scientific research and the <a href=\"http:\/\/dtmbio.net\/dtmbio2017\/\">DTMBio conference<\/a> was established to encourage research and debate into\u00a0DTM use in biomedical informatics.<\/p>\n<h3>But is it legal?<\/h3>\n<p>Application of large-scale text and data mining techniques have often been limited by arguments over rights issues. Traditionally, it has been necessary to obtain permission from the rights holders to extract information and convert it into a new, machine-processable form \u2013 a potentially time-consuming and expensive activity. However, the legalities of TDM were addressed by the UK government as part of a the 2014 reform of the Copyright, Design and Patents Act (<a href=\"http:\/\/www.legislation.gov.uk\/ukpga\/1988\/48\/section\/29\">section 29A<\/a>), which introduced permission for UK researchers to perform text and data mining without having to obtain individual permission in certain circumstances. The <a href=\"https:\/\/www.gov.uk\/guidance\/exceptions-to-copyright#text-and-data-mining-for-non-commercial-research\">UK Intellectual Property Office<\/a> describes the amendment as follows:<\/p>\n<blockquote><p>An exception to copyright exists which allows researchers to make copies of any copyright material for the purpose of computational analysis if they already have the right to read the work (that is, they have \u2018lawful access\u2019 to the work). This exception only permits the making of copies for the purpose of text and data mining for non-commercial research. Researchers will still have to buy subscriptions to access material; this could be from many sources including academic publishers.<\/p><\/blockquote>\n<p>This exception allows UK researchers to copy unpublished and published in-copyright works, including research papers, data, sound, video, and other resources, to which they have *lawful access*, and perform text and data mining\u00a0as necessary for non-commercial research, without having to gain specific permission from the rights holder. At present this TDM exception has not been introduced in other countries, although the potential for applying it across the European Union has been debated.<\/p>\n<p>The legal implications of applying the TDM exception to international research taking place at LSHTM can still be challenging, however. A <a href=\"https:\/\/www.jisc.ac.uk\/guides\/text-and-data-mining-copyright-exception\">2016 JISC guide<\/a>\u00a0discussing the scenario of a UK affiliated research project that includes project staff located in different countries notes that, although the non-UK researcher may have lawful access to resources they wish to mine through an institutional subscription, data transfer should be performed by a UK-based researcher.<\/p>\n<h3><strong>How does open access and Creative Commons\u00a0fit into this?<\/strong><\/h3>\n<p>OA resources are often easier to obtain and have fewer licence conditions in comparison to their subscription access cousins, making them a prime target for analysis. Many of these resources may be mined by a researcher located anywhere in the world, even in countries where there is no TDM exception, subject to licence conditions being\u00a0met.<\/p>\n<ul>\n<li>CC-BY licensed works can be mined for any\u00a0research purpose<\/li>\n<li>CC-NC licensed works can by mined for any non-commercial\u00a0research<\/li>\n<li>CC-ND licensed\u00a0works cannot be\u00a0mined (ND stands for non-derivative).<\/li>\n<\/ul>\n<h3>How do I obtain resources for analysis?<\/h3>\n<p>Many researchers download resources to their local machine, in order to convert them to the correct format and to increase processing speed. File downloads can often take several hours or days to download, due to their large size. There are also resource implications for the host server. Large platforms such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Database_download\">Wikipedia<\/a>, PubMed Central and ScienceDirect are sufficiently robust to allow researchers to query their website and download a large amount of data. <a href=\"https:\/\/www.elsevier.com\/about\/our-business\/policies\/text-and-data-mining\">Elsevier<\/a>, for instance, provides an API (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Application_programming_interface\">Application Programming Interface<\/a>) that allows researchers to query their database and download resources that meet their requirements. However, it\u2019s common for researchers to accidentally cause a smaller website to crash, or have their access blocked, when downloading a large number of files. If you do wish to download a large number of files from a single website, it\u2019s advisable to consult the administrator for advice on how and when this should be performed.<\/p>\n<h3><strong>Where can I find text mining tools?<\/strong><\/h3>\n<p>Lots of text and data mining tools exist, both open source and commercial that can be applied to a wide range of resources (with some configuration).<\/p>\n<ul>\n<li><a href=\"https:\/\/gate.ac.uk\/\">GATE: General Architecture for Text Engineering<\/a><\/li>\n<li><a href=\"https:\/\/uima.apache.org\/\">Apache UIMA<\/a><\/li>\n<li><a href=\"https:\/\/opennlp.apache.org\/\">Apache OpenNLP<\/a><\/li>\n<li><a href=\"http:\/\/www.nltk.org\/\">Natural Language Toolkit (NLTK)<\/a><\/li>\n<\/ul>\n<p>Other TDM tools\u00a0can found at:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/stepthom\/text_mining_resources\">Stephen Thomas\u2019 Text Mining Resources<\/a><\/li>\n<li><a href=\"http:\/\/www.nactem.ac.uk\/\">The National Center for Text Mining<\/a><\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_text_mining_software\">Text and Data Mining Wikipedia page<\/a>.<\/li>\n<\/ul>\n<p>Image:\u00a0<a href=\"https:\/\/www.flickr.com\/photos\/museumwales\/4051901771\/\">Amgueddfa Cymru &#8211; National Museum Wales. Strike Poster.<\/a> (CC BY-NC 2.0)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>An abundance of scholarly resources are available to the researcher, easily discoverable through use of&#8230;<\/p>\n","protected":false},"author":121,"featured_media":4207,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[81946,81948,1],"tags":[64529,35,64528],"class_list":["post-4194","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-open-access","category-research-data-management","category-uncategorised","tag-data-mining","tag-research-papers","tag-text-mining","odd"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.9 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Gain new insight into your health research using text mining - Library, Archive &amp; Open Research Services blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Gain new insight into your health research using text mining - Library, Archive &amp; Open Research Services blog\" \/>\n<meta property=\"og:description\" content=\"An abundance of scholarly resources are available to the researcher, easily discoverable through use of...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/\" \/>\n<meta property=\"og:site_name\" content=\"Library, Archive &amp; Open Research Services blog\" \/>\n<meta property=\"article:published_time\" content=\"2017-10-24T08:39:16+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2017-10-25T08:43:40+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blogs.lshtm.ac.uk\/library\/files\/2017\/10\/miner-difficulty.gif\" \/>\n\t<meta property=\"og:image:width\" content=\"265\" \/>\n\t<meta property=\"og:image:height\" content=\"312\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/gif\" \/>\n<meta name=\"author\" content=\"Gareth Knight\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@LSHTMlibrary\" \/>\n<meta name=\"twitter:site\" content=\"@LSHTMlibrary\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Gareth Knight\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/\"},\"author\":{\"name\":\"Gareth Knight\",\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/#\/schema\/person\/88dbaea1053dd6869c8ea4a17d7bf8ae\"},\"headline\":\"Gain new insight into your health research using text mining\",\"datePublished\":\"2017-10-24T08:39:16+00:00\",\"dateModified\":\"2017-10-25T08:43:40+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/\"},\"wordCount\":1010,\"image\":{\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blogs.lshtm.ac.uk\/library\/files\/2017\/10\/miner-difficulty.gif\",\"keywords\":[\"data mining\",\"research papers\",\"text mining\"],\"articleSection\":[\"Open Access\",\"Research Data Management\"],\"inLanguage\":\"en-GB\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/\",\"url\":\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/\",\"name\":\"Gain new insight into your health research using text mining - Library, Archive &amp; Open Research Services blog\",\"isPartOf\":{\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blogs.lshtm.ac.uk\/library\/files\/2017\/10\/miner-difficulty.gif\",\"datePublished\":\"2017-10-24T08:39:16+00:00\",\"dateModified\":\"2017-10-25T08:43:40+00:00\",\"author\":{\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/#\/schema\/person\/88dbaea1053dd6869c8ea4a17d7bf8ae\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/#primaryimage\",\"url\":\"https:\/\/blogs.lshtm.ac.uk\/library\/files\/2017\/10\/miner-difficulty.gif\",\"contentUrl\":\"https:\/\/blogs.lshtm.ac.uk\/library\/files\/2017\/10\/miner-difficulty.gif\",\"width\":265,\"height\":312},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/#website\",\"url\":\"https:\/\/blogs.lshtm.ac.uk\/library\/\",\"name\":\"Library, Archive &amp; Open Research Services blog\",\"description\":\"News &amp; features from LAORS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blogs.lshtm.ac.uk\/library\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-GB\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/blogs.lshtm.ac.uk\/library\/#\/schema\/person\/88dbaea1053dd6869c8ea4a17d7bf8ae\",\"name\":\"Gareth Knight\",\"description\":\"Research Data Manager at the London School of Hygiene &amp; Tropical Medicine\",\"sameAs\":[\"http:\/\/datacompass.lshtm.ac.uk\/\"],\"url\":\"https:\/\/blogs.lshtm.ac.uk\/library\/author\/alibgkni\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Gain new insight into your health research using text mining - Library, Archive &amp; Open Research Services blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/","og_locale":"en_GB","og_type":"article","og_title":"Gain new insight into your health research using text mining - Library, Archive &amp; Open Research Services blog","og_description":"An abundance of scholarly resources are available to the researcher, easily discoverable through use of...","og_url":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/","og_site_name":"Library, Archive &amp; Open Research Services blog","article_published_time":"2017-10-24T08:39:16+00:00","article_modified_time":"2017-10-25T08:43:40+00:00","og_image":[{"width":265,"height":312,"url":"https:\/\/blogs.lshtm.ac.uk\/library\/files\/2017\/10\/miner-difficulty.gif","type":"image\/gif"}],"author":"Gareth Knight","twitter_card":"summary_large_image","twitter_creator":"@LSHTMlibrary","twitter_site":"@LSHTMlibrary","twitter_misc":{"Written by":"Gareth Knight","Estimated reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/#article","isPartOf":{"@id":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/"},"author":{"name":"Gareth Knight","@id":"https:\/\/blogs.lshtm.ac.uk\/library\/#\/schema\/person\/88dbaea1053dd6869c8ea4a17d7bf8ae"},"headline":"Gain new insight into your health research using text mining","datePublished":"2017-10-24T08:39:16+00:00","dateModified":"2017-10-25T08:43:40+00:00","mainEntityOfPage":{"@id":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/"},"wordCount":1010,"image":{"@id":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/#primaryimage"},"thumbnailUrl":"https:\/\/blogs.lshtm.ac.uk\/library\/files\/2017\/10\/miner-difficulty.gif","keywords":["data mining","research papers","text mining"],"articleSection":["Open Access","Research Data Management"],"inLanguage":"en-GB"},{"@type":"WebPage","@id":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/","url":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/","name":"Gain new insight into your health research using text mining - Library, Archive &amp; Open Research Services blog","isPartOf":{"@id":"https:\/\/blogs.lshtm.ac.uk\/library\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/#primaryimage"},"image":{"@id":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/#primaryimage"},"thumbnailUrl":"https:\/\/blogs.lshtm.ac.uk\/library\/files\/2017\/10\/miner-difficulty.gif","datePublished":"2017-10-24T08:39:16+00:00","dateModified":"2017-10-25T08:43:40+00:00","author":{"@id":"https:\/\/blogs.lshtm.ac.uk\/library\/#\/schema\/person\/88dbaea1053dd6869c8ea4a17d7bf8ae"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/"]}]},{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/blogs.lshtm.ac.uk\/library\/2017\/10\/24\/textmining-for-health\/#primaryimage","url":"https:\/\/blogs.lshtm.ac.uk\/library\/files\/2017\/10\/miner-difficulty.gif","contentUrl":"https:\/\/blogs.lshtm.ac.uk\/library\/files\/2017\/10\/miner-difficulty.gif","width":265,"height":312},{"@type":"WebSite","@id":"https:\/\/blogs.lshtm.ac.uk\/library\/#website","url":"https:\/\/blogs.lshtm.ac.uk\/library\/","name":"Library, Archive &amp; Open Research Services blog","description":"News &amp; features from LAORS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blogs.lshtm.ac.uk\/library\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-GB"},{"@type":"Person","@id":"https:\/\/blogs.lshtm.ac.uk\/library\/#\/schema\/person\/88dbaea1053dd6869c8ea4a17d7bf8ae","name":"Gareth Knight","description":"Research Data Manager at the London School of Hygiene &amp; Tropical Medicine","sameAs":["http:\/\/datacompass.lshtm.ac.uk\/"],"url":"https:\/\/blogs.lshtm.ac.uk\/library\/author\/alibgkni\/"}]}},"_links":{"self":[{"href":"https:\/\/blogs.lshtm.ac.uk\/library\/wp-json\/wp\/v2\/posts\/4194","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.lshtm.ac.uk\/library\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.lshtm.ac.uk\/library\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.lshtm.ac.uk\/library\/wp-json\/wp\/v2\/users\/121"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.lshtm.ac.uk\/library\/wp-json\/wp\/v2\/comments?post=4194"}],"version-history":[{"count":16,"href":"https:\/\/blogs.lshtm.ac.uk\/library\/wp-json\/wp\/v2\/posts\/4194\/revisions"}],"predecessor-version":[{"id":4242,"href":"https:\/\/blogs.lshtm.ac.uk\/library\/wp-json\/wp\/v2\/posts\/4194\/revisions\/4242"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blogs.lshtm.ac.uk\/library\/wp-json\/wp\/v2\/media\/4207"}],"wp:attachment":[{"href":"https:\/\/blogs.lshtm.ac.uk\/library\/wp-json\/wp\/v2\/media?parent=4194"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.lshtm.ac.uk\/library\/wp-json\/wp\/v2\/categories?post=4194"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.lshtm.ac.uk\/library\/wp-json\/wp\/v2\/tags?post=4194"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}