Exploring Amazon SageMaker’s new features — CloudFormation, Data Wrangler
The data tooling and infrastructure space is growing rapidly, and this trend is showing no signs of slowing down. Behemoth data storage firm Snowflake IPOed late last year and became more valuable than IBM, and Databricks recently raised a $1 billion Series G with a $28 billion post-money valuation, to name two examples. The long tail of the data tools space is becoming increasingly crowded, as evidenced by Matt Turck’s 2020 Data & AI Landscape (just look at the image below).
AWS is one of the most prominent players in the space, and SageMaker is its flagship solution for the machine learning development workflow. When AWS announces new SageMaker features, the industry pays attention. Having written two reviews since Sagemaker Studio’s inception, we were interested to see a swathe of new features come across the wire last December and at Swami Sivasubramanian’s Machine Learning Keynote at re:Invent. After spending some time with the new features, we’ve put together a two-part piece on our impressions. This first part covers:
- Better integration with AWS CloudFormation, which allows for easier provisioning of resources
- General ability to use Sagemaker and the platform’s usability
- Data Wrangler, a GUI-based tool for data preparation and feature engineering
The second part covers
- Feature Store, a tool for storing, retrieving, editing, and sharing purpose-built features for ML workflows
- Clarify, which claims to “detect bias in ML models” and to aid in model interpretability
- Sagemaker Pipelines, which help automate and organize the flow of ML pipelines
Let’s get started!
One-click provisioning makes it easier to get started
Overall, we found the experience with SageMaker much smoother than last time. The SageMaker Studio environment would actually start and provision (it embarrassingly refused to launch last time during re:Invent). There overall experience felt much improved, and the tutorials and documentation are better integrated with the platform.
One of the environment’s best features is AWS CloudFormation, which have been around since 2011 but seem to have been better integrated into SageMaker. It’s a significant pain point in computing to get hardware and infrastructure provisioned safely — getting S3 buckets, databases, EC2 instances all up and talking to each other securely. This often meant hours of tinkering with IAM permissions just to get a “Hello World” server going. CloudFormation simplifies that by pre-defining infrastructure configuration “stacks” into YAML files (think Kubernetes Object YAML but for AWS infrastructure), which can be fired up with one click. An AWS spokesperson told us the integration was part of a move “to make SageMaker widely accessible for the most sophisticated ML engineers and data scientists as well as those who are just getting started.” Even better, many of the AWS tutorials now feature buttons to launch stacks with just one click:
(The buttons are reminiscent of a late ’90s Amazon.com One-Click Shopping button and that resemblance may be subliminal marketing. Both distill immensely complex infrastructure, whether e-commerce or cloud, into a single consumer-friendly button that drives sales.)
Sagemaker has improved but usability is still lacking, hindering adoption
Given the interest in deep learning, we wanted to try out deep learning on AWS. These models are on the leading edge of machine learning but are notoriously computationally expensive to train, requiring GPUs, which can be quite spendy. We decided to test out these newfound capabilities by running examples from FastAI’s popular deep learning book to see how easy it is to get started. Fortunately, the Deep Learning models come with convenient launch buttons, so you can get up and running pretty smoothly. The AWS instances were very powerful (for a fairly computationally intensive NLP example their ml.p3.2xlarge ran about 20X faster than the free tier Quadro P5000 available on Gradient), and for only $3.825 an hour.
Nonetheless, the tools were not without their hiccups. On AWS, most of the GPU instances are not automatically available; instead, users must request a quota limit increase. Requesting a limit increase appears to require human approval and usually takes a day, killing momentum. Also, the launch stacks sometimes don’t line up with the tutorial types: e.g., the entity resolution tutorial launches with a CPU instance type, which required 24 hours to approve. When the notebook ran, it required a GPU instance. Users are not given any resource quotas for this by default and must request an increase manually, adding another 24-hour delay. This assumes they are eligible for such increases at all (one of us was not, and only found a workaround after contacting an AWS representative). Some of this may have been due to the fact that we were using a relatively new AWS account. But great software has to work for new users as well as veterans if it hopes to grow and this is what we set out to test. Great software should also work for users who do not have the luxury of a contact at AWS.
Our experience is well-summarized by Jesse Anderson, author of Data Teams. He told us that “AWS’s intent is to offload data engineer tasks to make them more doable for the data scientists. It lowers the bar somewhat but it isn’t a huge shift. There’s still a sizable amount of data engineering needed just to get something ready for SageMaker.”
To be fair to AWS, service quotas are useful in helping control cloud costs, particularly in a large enterprise setting where a CIO might want to enable the rank-and-file to request the services they need without incurring an enormous bill. Yet, one could easily imagine a better world. At a minimum, AWS-related error messages (e.g. resource limit constraints) should come with links to details on how to fix them rather than making users spend time hunting through console pages. For example, GCloud Firebase, which has a similar service quotas, does this well.
Even better, it would be nice if there were single-click buttons that immediately granted account owners a single instance for 24 hours so users don’t have to wait for human approval. In the end, we expected a more straightforward interface. We’ve seen some tremendous improvements over last year, but AWS is still leaving a lot on the table.
Data Wrangler: right problem, wrong approach
There’s a now-old trope (immortalized by Big Data Borat) that data scientists spend 80% of their time cleaning and preparing data:
Industry leaders recognize the importance of tackling this problem well. As Ozan Unlu, Founder and CEO of automated observability startup Edge Delta explained to us, “allowing data scientists to more efficiently surpass the early stages of the project allows them to spend a much larger proportion of their time on significantly higher value additive tasks.” Indeed, one of us previously wrote an article called The Unreasonable Importance of Data Preparation, clarifying the need to automate parts of the data preparation process. SageMaker Studio’s Data Wrangler claims to “provide the fastest and easiest way for developers to prepare data for machine learning” and comes packed with exciting features, including: 300+ data transformation features (including one-hot encoders, which are table stakes for machine learning), the ability to hand-code your own transformations, and upcoming integrations with Snowflake, MongoDB, and Databricks. Users are also able to output their results and workflows to a variety of formats like SageMaker pipelines (more on this in Part 2), Jupyter notebooks, or a Feature Store (we’ll get to this in Part 2 as well).
However, we’re not convinced that most developers or data scientists would find it very useful yet. First off, it’s GUI-based, and the vast majority of data scientists will avoid GUIs like the plague. There are several reasons for this, perhaps the most important being that GUIs are antithetical to reproducible data work. Hadley Wickham, Chief Scientist at RStudio and author of the principles of tidy data, has even given a talk entitled “You can’t do data science in a GUI.”
To be fair to SageMaker, you can export your workflow as Python code, which will help alleviate reproducibility to a certain extent. This approach follows in the footsteps of products such as Looker (acquired last year by Google for $2.6 billion!), which generates SQL code based upon user interactions with a drag and drop interface. But it will probably not appeal to developers or data scientists (if you can already express your ideas in code, why learn to use someone else’s GUI?).
There may be some value in enabling non-technical domain experts (who are presumably less expensive talent resources) to transform data and export the process to code. However, the code generated from recording an iterative exploratory GUI session may not be very clean and could require significant engineering or data scientist intervention. Much of the future of data work will occur in GUIs and drag-and-drop interfaces, but this will be the long tail of data work and not that of developers and data scientists.
Data Wrangler’s abstraction away from code and the abstraction over many other parts of the data preparation workflow are also concerning. Take the “quick model” feature which, according to AWS Evangelist Julien Simon, “immediately trains a model on the pre-processed data,” shows “the impact of your data preparation steps,” and provides insight into feature importance. When building this quick model, it isn’t clear in the product what kind of model is actually trained, so it’s not obvious how any insight could be developed here or whether the “important features” are important at all.
. . .