Kutipan singkat mengenai kehidupan. “Tentang Catatan Kehidupan” is published by Aufa Billah in Aufa Billah.
Working with high-intensity corpora like legal or clinical texts presents a unique challenge for NLP engineers. This is how to understand it.
Doctors and lawyers are expected to keep up with a staggering amount of information. Each year, there’s a verifiable deluge — nearly 5 million — of scientific papers and legal opinions published. It can be nearly impossible for these professionals to keep up with this information overload. But what if there was a better way to use this wealth of information rather than expect those interested to skim 2.5 million or so documents? Natural language processing — a subset of artificial intelligence — focuses on producing a solution to that problem.
When you’re considering a use case for NLP, I often find that a lot of potential start-up innovators tend to focus on the headlines and don’t really have a good understanding of what NLP can do. This leads to poorly defined problem statements and proposal rejections. It helps to frame your task in the most explicit terms you can. Do you want a summarizer that can take in those 5 million articles and give you three-four sentences about them? Do you want a classifier: something that can read all of those 5 million articles and say arbitrarily (based on subjective criteria) that these articles are trash, and these articles are legitimate? Do you want a decision-maker — something that can read all those options — extract relevant information and put it to use? These questions and more are what you should be asking, because if you can’t frame the problem as explicitly as possible, artificial intelligence applications aren’t going to help your use case. This is because NLP is a super broad field, within an already broader field of artificial intelligence. A summarizer uses mechanisms called ‘attention’ and ‘pointer-generators’ that use linear algebra to vectorize information and assign more relevant examples higher weights to mimic human-based attention. A classifier normally falls under the broad category of sentiment analysis, where you give the system enough examples of whatever you want to classify and it begins to learn features about the data to allow it to distinguish between the classes. A decision-maker typically uses data to inform its decisions, so you’d want to use a named entity recognition model to pull out this data and a deep reinforcement learning model to put it to use. Involved in all of this is a ambiguity limiter, which allows you to determine the meaning of a word from its context. You can see how this begins to spiral out of control — hopefully illustrating the necessity of a cleanly defined problem statement from the very beginning.
In this post specifically, I’m going to assume that the user wants to be able to extract useful data points and connect their outputs to a decision-making process that will replicate what a professional naturally does when they read through a white paper. So let’s break that process down — what do clinicians do when they read white papers? What do they look for? Your standard doctor typically wants to know the most up-to-date treatment options for whatever they’ve diagnosed you with. They want them to be reliable, they want to limit the side effects, they want the materials needed to be available at the hospital or practice they work at, and they want the procedure to be as low cost as possible. Like I mentioned previously, the art of determining whether or not a paper is reliable is a separate task — it involves a fair amount of domain knowledge and a classifier model that can correctly analyze the corpora. You’ll need a named entity recognizer, but only after ambiguous terms are properly lemmatized or grouped. You’ll need a long-term memory deep reinforcement learning mechanism that allows it to make decisions by comparing each treatment to another, which means you’ll need some kind of classifier that determines whether or not the sum total of a paper’s approach is better than another. That is way too much to cover in one post, so I’m going to just focus on the named entity recognition model and the ambiguity reduction model, since they’re linked.
Named entity recognition and word sense disambiguation are two major issues in natural language processing. Both are difficult to achieve because of the ‘Chicken and the Egg’ data problem I’ve mentioned in a previous post. You essentially need to have the model working already if you want to train the model without significant manual labor. That manual labor isn’t easy either — you can’t just have an Amazon Turk do it for you on the cheap, you need to have a licensed professional go through and hand annotate all the data for you. Obviously, this isn’t a good solution. Luckily, there’s a lot of work being done in the field. There’s already some publicly available datasets that cover this exact problem through the ShARe project and the i2b2 initiative. There’s also plenty of work being done developing unsupervised models that are able to learn how to annotate documents themselves. Those unsupervised models are what my work focuses on, and here are some of the unique challenges that those working with clinical texts face.
Clinical papers include the following categories — discharge summaries, electrocardiogram, echocardiograms, radiology reports, white papers, discussions of a particular treatment, and reviews of a variety of treatments. For the sake of remaining HIPAA complaint, it’s probably best to focus on the latter three. One of the major issues with analyzing clinical white papers is their ambiguity. A single concept can not only be represented in a variety of ways, but it can also be a multi-token (i.e. have multiple words) concept, or be discontinuous mentions (i.e. requires multi-word or sentence relationship analysis to properly categorize. The objective with all NLP approaches is to group each concept into a defined label or identifier. The doctor is able to easily distinguish between drug names, procedure names, disorder names, and so on. The computer system needs to be able to do the same. In order to do that, the model needs to have a method of grouping all of these unique terms into groups. In my next post, I’ll go over the several approaches towards taking all these unique terms and assigning them a CUI — a concept-unique identifier.
The human brain weighs about 3 pounds, and consists of approximately 100 billion neurons, creating a complex neural-network of 100 trillion connections. This gives the human brain the cognitive power…
I used to think a raised bed involved spending money on boards, hardware and the most ridiculous thing: imported dirt. But you don’t need any of those things, because the soil you need is under the…
Author has an unhappy experience with the purchase of a rotten pumpkin. A bad business decision leads to lost future business.#pr#fairness#humanbehavior