Video description
Unstructured data in the form of documents, web pages, and social media interactions is an ever-growing, ever-more valuable data source for addressing present business problems, from exploring brand sentiment to identifying sensitive information in internal documents. Unfortunately, the classification and annotation algorithms behind solving these problems often require significant amounts of labeled training data to produce desired accuracy.
Michael Johnson and Norris Heintzelman (Lockheed Martin) share several techniques they’ve implemented to build classification and NER models from scratch. They lead a tour through this space as it applies to NLP and demonstrate their approach and architecture for the following techniques:
- Weak supervision for news documents: Using rules base classification alongside deep learning system for text classification
- Active learning and human in the loop: Explaining how breakthroughs in transfer learning for NLP have impacted their active learning framework for building an LSTM-based relevance model
- Creative training sets: Identifying and cleaning already-labeled datasets, training classifier on “only” positive examples
- NER adjudication: Combining knowledge from several annotation sources that leverages the strengths of each source
For each of these topics, Michael and Norris outline the theoretical foundation, the implementation architecture, and tools used and discuss the problems they encountered—so you can avoid making the same mistakes.
Table of Contents
NLP from scratch: Solving the cold start problem for natural language processing - Michael Johnson (Lockheed Martin), Norris Heintzelman (Lockheed Martin)