Snorkel AI Boosts Success of AI Projects by Reducing Bottlenecks in Training Data
Snorkel AI is shipping an AI platform powered by programmatic data labeling to reduce the bottleneck in training data for ML models. IDN looks at Snorkel Flow.
Snorkel AI is shipping an AI platform powered by programmatic data labeling to reduce the bottleneck in training data for ML models.
Now in general availability, Snorkel Flow adds features to help enterprises to accelerate AI app development using its “automated data labeling” technology, according to Snorkel AI CEO and co-founder Alex Ratner.
More than 80% of AI development time is spent gathering, organizing, and manually labeling the training data used to train machine learning models, according to a report from Cognilytica.
The “training data bottleneck” is a primary reason a vast majority of AI projects (up to 87%) never make it into production, according to Ratner.
The company’s Snorkel Flow offering is designed to eliminate such bottlenecks with a data-centric AI platform to label training data programmatically.
Snorkel Flow also uses error analysis to guide training data and model iteration in tandem to more quickly adapt to real-world changes with a few clicks -- rather than require AI teams to perform time-consuming manual relabeling. With Snorkel Flow, he added that organizations are achieving machine learning model accuracy in days rather than weeks or months.
How Snorkel Flow Technologies Break the Training Data Bottleneck
Of all the tasks that can slow AI projects, “manual labeling is notoriously expensive and slow,” Ratner noted.
Snorkel Flow delivers a data-centric development workflow for data science and machine learning practitioners to tackle document intelligence applications, Ratner added.
Snorkel Flow aims to solve the bottleneck of “training data” by providing enterprise teams with a range of functionality.
Programmatic data labeling: No-code and Python SDK interfaces for programmatic labeling, with state-of-the-art weak supervision algorithms.
Integrated ML modeling suite: No-code, continuous training of leading, pre-configured models and modeling tools like AutoML available in-platform.
Collaborative AI application development: Workflows for domain experts to encode labeling insight and rationale at scale and platform tools for real-time troubleshooting.
Guided data iteration: Actionable error analysis and active learning workflows to improve training data quality and achieve production-worthy model accuracy faster.
Accelerated document intelligence: Built-in pipeline templates with pre- and post-processing operators, models, and business logic for document classification and extraction applications.
Pre-built templates: These speed classification of and information extraction from a range of documents
Snorkel Flow’s Approach to Data Annotation
One of Snorkel Flow’s enabling technologies is how it performs “data annotation,” according to a company blog posted by Snorkel AI’s ML solution engineer Anastassia Kornilova.
Data annotation refers to the process of categorizing and labeling data for training datasets. In order for a training dataset to be usable, it must be categorized appropriately and annotated for a specific use case. With Snorkel Flow, organizations can annotate high-quality labeled training data via Labeling Functions and rapidly develop and adapt AI applications by iterating on labeled data programmatically.
Teams often overlook the importance of data annotation guidelines and best practices until they’ve run into problems caused by their absence. Supervised machine learning problems require labeled data, whether you are trying to analyze financial documents, build a fact-checking system, or automate other use cases. Snorkel Flow accelerates the process of generating labeled data via programmatic labeling, but teams still need a clear definition of the labels (i.e., ground truth).
Annotation guidelines are the guideposts that annotators, domain experts, and data scientists follow when labeling data. The critical steps for creating these guidelines are:
Consider your audience (both the annotators and the downstream users of the data)
Iterate early to refine definitions
Consistently keep track of confusing and difficult data examples.
The Snorkel Flow platform supports this process by providing a custom annotation workspace and tagging capabilities for flagging ambiguous data points as a part of the end-to-end, data-centric AI application development platform. Moreover, the interplay of programmatic labeling and hand annotations can surface systematic problems in the annotation guidelines.
The Snorkel Flow platform can help with all aspects of designing annotation guidelines in several ways, Kornilova added. Among the notable features she mentioned are:
- An in-platform annotation workspace pre-integrated with the main model development loop.
- Built-in capability for tagging data points tightens the loop for iterating annotation guidelines.
- Powerful labeling functions allow users to encode and evaluate guidelines.
- Simple ways to let users edit the label space, as well as preserve existing ground truth and rules
Snorkel Flow’s architecture offers cloud-agnostic Kubernetes deployment options, role-based access controls, SSO integrations, encryption in-transit and at-rest, and other features.
One early adopter of Snorkel AI can attest to efficiencies and speed improvements in their AI/ML projects.
Memorial Sloan Kettering Cancer Center’s Deputy CIO and VP of Digital Solutions Janet Mak said, “We have applied Snorkel Flow to two use cases using pathology reports. We accurately labeled a few thousand pathology reports (95% accuracy, 85% precision) using one SME in days versus weeks. In addition to these material time savings, Snorkel Flow allows our teams to collaborate on the data accuracy and provides time efficiencies for our highly valued physicians and medical professionals.”
Mak also applauded the overall direction of Snorkel AI’s and Snorkel Flow technology efforts, noting “a significant need for AI models is labeled data, often tedious and expensive to generate.”