Since a long time, engineers have been striving to make machines perform tasks that human beings do; which has led to the birth of the field of machine learning. Understanding the language humans speak, constitutes a vital part of this field. This field of computer science which deals with human-machine interactions, especially concerned with computer programs that can process natural language efficiently, is known as Natural Language Processing, mostly referred to by the abbreviation NLP.
NLP sits at the intersection of computer science, artificial intelligence, and computational linguistics. By utilizing Natural Language Processing algorithms, developers can organize and structure textual data to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation. (En.wikipedia.org, 2017)
Natural Language Processing is characterized as a hard problem in computer science since human language is rarely precise or plainly spoken. To understand human language, one must not only understand the words but their meaning & context and how they interconnect to form meaning. The vagueness and ambiguous nature of human language make it difficult to learn for computers while being easy to learn for humans.
Components of NLP
There are two components of NLP which are listed as follows:
1. Natural Language Understanding(NLU)
This includes understanding the different aspects of the language and mapping the input text in natural language to useful representations. This is the harder of the two components since this section has to deal with the ambiguity & complexity of the language. There are mainly three levels of ambiguity which are as follows:
- Word-level or Lexical Ambiguity
- Syntax Level or Parsing Ambiguity
- Referential Ambiguity
2. Natural Language Generation(NLG)
As evident from the name, NLG is the process of producing or generating meaningful phrases and sentences in the form of natural language. It involves text planning, sentence planning, and text realization.
Syntax: It refers to the arrangement of words that form a sentence. It also involves the determination of the structural role of each word in the sentence.
Phonology: It is the study of organizing sounds systematically.
Morphology: It is a study of how words are constructed using primitive meaningful units.
Semantics: It deals with the meaning of words and how they can be joined/combined to form meaningful sentences.
Discourse: This determines how the immediately preceding sentence can affect the interpretation of the next sentence.
Pragmatics: This deals with how the interpretation of a sentence changes according to the situation.
What can developers use NLP algorithms for?
- Summarizing blocks of text to extract the meaningful information from the given text, ignoring the remaining non-relevant text
- Understanding the input and generating the output in Chatbots
- Deriving the sentiment of a piece of text using Sentiment analysis
- Break up large text into simpler tokens such as sentences or words
Some Open Source NLP Libraries
- Apache OpenNLP
It is a Java based machine learning toolkit provided by Apache, that supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection, and coreference resolution. OpenNLP also includes maximum entropy and perceptron based machine learning. It provides built-in Java classes for each function as well as a command-line interface for testing the pre-built agents.
- Natural Language Toolkit(NLTK)
It is a platform for building Python programs to read and process human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
- Stanford CoreNLP
Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract or open-class relations between entity mentions, get the quotes people said, etc.
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. Apart from classification, MALLET includes tools for sequence tagging for applications such as named-entity extraction from text. Algorithms include Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random Fields.
These are few of the many open source libraries and toolkits available for development on Natural Language Processing which can be utilized by developers in their applications.
In conclusion, Natural Language Processing is an important part of the artificial intelligence field and needs to be given importance if someone wants to master the trade of Machine Learning or Artificial Intelligence.