Serverless Apache Zeppelin on AWS
The Serverless infrastructure behind Apache Zeppelin on AWS. Do you want make it yours? You can start from here.
Table of Contents
What is Apache Zeppelin?
First of all, it is worth to ask: what is a notebook interface?
A notebook is an interface for interactively running code, it let you explore and visualize data. You can mix narrative, rich media, and data in a unique space.
Now we can go with the Apache Zeppelin definition. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at a regular interval.
It's easier to mix languages in the same notebook. You can do some code and then markdown to document it all together. You can also easily convert your notebook into a presentation style - for maybe presenting to management or using dashboards. Further more.
What does Serverless Means?
The idea behind the serverless is that you as a developer shouldn't need to care about the server infrastructure. You pay to run the code without mentions about what type of physical infrastructure is running below.
There are quite a few advantages to serverless. Scalability essentially comes for free. Because you're just paying to run a logic, the cloud provider can easily dedicate more hardware to run your code. Also, you pay by code execution rather than having a fixed rate. Even more, the cloud provider manages the server software and hardware. You shouldn't need to care about that. Finally, serverless frees up developers to focus on what they're good at - code.
Solution Requirements
Build a serverless infrastructure to run Apache Zeppelin and persists notebook file. The solution must be publicly available and provide login and logout capability. Also, the compute platform must auto shut down after 30 min of inactivity.
High-level Architecture
The below diagram shows the high-level architecture. As you can see, it is a serverless infrastructure, and you can operate with Apache Zeppelin by using a public endpoint while Elastic File System stores the notebook files. Amazon CloudWatch custom metric counts the line of logs and shutdowns the Amazon Fargate container after 30 min of inactivity.
The only missing feature in this architecture is the login and logout capability. In this case, Apache Zeppelin provides Shiro for notebook authentication. Apache Shiro is a powerful and easy-to-use Java security framework that performs authentication, authorization, cryptography, and session management. Here you can find a step-by-step guide about how Shiro works. This example uses the default configuration.
Infrastructure as Code Description
Amazon API Gateway
AWSTemplateFormatVersion: '2010-09-09'
Globals:
Function:
Timeout: 60
MemorySize: 128
Architectures:
- arm64
Parameters:
[...]
Resources:
ZeppelinApi:
Type: AWS::Serverless::Api
Properties:
StageName: !Ref ServiceName
Outputs:
ZeppelinApi:
Description: "API Gateway endpoint URL for Prod stage for Hello World function"
Value: !Sub "https://${ZeppelinApi}.execute-api.${AWS::Region}.amazonaws.com/${ServiceName}/"
The solution uses SAM. Here you can see the global configuration for lambda functions and the public API you can use to access to Apache Zeppelin. The stack deploy provides the Url as output value
Elastic File System
AWSTemplateFormatVersion: '2010-09-09'
Globals:
Function:
Timeout: 60
MemorySize: 128
Architectures:
- arm64
Parameters:
[...]
Resources:
ZeppelinApi:
[...]
AccessPoint:
Type: 'AWS::EFS::AccessPoint'
Properties:
FileSystemId: !Ref FileSystem
PosixUser:
Uid: "500"
Gid: "500"
SecondaryGids:
- "2000"
RootDirectory:
CreationInfo:
OwnerGid: "500"
OwnerUid: "500"
Permissions: "0777"
Path: !Sub "/${ServiceName}"
FileSystem:
Type: AWS::EFS::FileSystem
Properties:
PerformanceMode: generalPurpose
FileSystemTags:
- Key: ServiceName
Value: !Ref ServiceName
MountTarget1:
[Availability Zone A Configuration]
MountTarget2:
[Availability Zone B Configuration]
MountTarget3:
[Availability Zone C Configuration]
When provisioned, each Amazon ECS task hosted on AWS Fargate receives an ephemeral storage for bind mounts, everything on the disk is lost after container termination. In order to persist notebook file the solution uses and Amazon Elastic File System, all notebook on EFS are preserved after the container termination. The Access Point configuration allows Apache Zeppelin to have write permissions on Amazon Elastic File System.
Amazon Cloud Watch Custom Metric
AWSTemplateFormatVersion: '2010-09-09'
Globals:
Function:
Timeout: 60
MemorySize: 128
Architectures:
- arm64
Parameters:
[...]
Resources:
ZeppelinApi:
[...]
AccessPoint:
[...]
FileSystem:
[...]
ShutdownSnsTopic:
[description later in this post]
ZeppelinLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/ecs/fargate-${ServiceName}"
RetentionInDays: 1
ActivityMetricFilter:
Type: AWS::Logs::MetricFilter
Properties:
LogGroupName: !Ref ZeppelinLogGroup
FilterPattern: "INFO"
MetricTransformations:
-
MetricValue: "1"
MetricNamespace: !Sub "${ServiceName}/Actions"
MetricName: "ActionsCount"
ZeppelinActionsCountAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ZeppelinActionsCountAlarm
MetricName: ActionsCount
Namespace: !Sub "${ServiceName}/Actions"
Statistic: SampleCount
Period: '300'
EvaluationPeriods: '6'
TreatMissingData: breaching
Threshold: '1'
ComparisonOperator: LessThanOrEqualToThreshold
AlarmActions:
- !Ref ShutdownSnsTopic
To provide auto-shut-down feature, the Apache Serverless solution uses a custom metric. AWS Fargate saves logs into Amazon CloudWatch Log Group, and the Amazon CloudWatch Custom Metric Filter counts the log lines. If the custom metric is zero for about 30 minutes, the alarm publishes a message to Amazon Simple Notification Service to terminate the cluster.
AWS Fargate
AWSTemplateFormatVersion: '2010-09-09'
Globals:
Function:
Timeout: 60
MemorySize: 128
Architectures:
- arm64
Parameters:
[...]
Resources:
ZeppelinApi:
[...]
AccessPoint:
[...]
FileSystem:
[...]
ZeppelinLogGroup:
Cluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: !Join ['', [!Ref ServiceName, Cluster]]
ZeppelinTaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
RequiresCompatibilities:
- "FARGATE"
Cpu: !Ref ContainerCPU
Memory: !Ref MemoryHardLimit
NetworkMode: "awsvpc"
TaskRoleArn: !GetAtt ZeppelinTaskRole.Arn
ExecutionRoleArn: !GetAtt ZeppelinTaskRole.Arn
ContainerDefinitions:
- Name: !Ref ServiceName
Image: "apache/zeppelin:0.10.0"
EntryPoint:
- /bin/bash
- -c
- |
cp conf/shiro.ini.template conf/shiro.ini
/usr/bin/tini -- bin/zeppelin.sh
Command: ["echo", "done!"]
MemoryReservation: !Ref MemorySoftLimit
Memory: !Ref MemoryHardLimit
PortMappings:
- ContainerPort: !Ref ContainerPort
Protocol: tcp
- ContainerPort: 4040
Protocol: tcp
LogConfiguration:
LogDriver: awslogs
Options:
awslogs-group: !Ref ZeppelinLogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: !Sub 'ecs-${ServiceName}-awsvpc'
MountPoints:
- ContainerPath: !Ref ZeppelinPersistNotebookPath
SourceVolume: !Sub "${ServiceName}"
ReadOnly: false
Volumes:
- Name: !Sub "${ServiceName}"
EFSVolumeConfiguration:
AuthorizationConfig:
IAM: ENABLED
AccessPointId: !Ref AccessPoint
FilesystemId: !Ref FileSystem
TransitEncryption: ENABLED
Here there is AWS Fargate Cluster and Task Definition. The Apache Serverless solution uses Shiro to enable login and logout capability, as stated here you can create a shiro.ini
file by doing the cp command. You can find it in the EntryPoint
property of the container definition.
AWS Lambda | Workflow
It implements the workflow that the high-level architecture introduces.
In the beginning, it checks if the Apache Zeppelin Container is running.
- In case of a yes, AWS Lambda returns 302 to the Apache Zeppelin public IP.
- In case of a no, the AWS Lambda executes the next step.
Then, it checks if the Apache Zeppelin Container exists.
- In case of a yes, AWS Lambda returns static web content. It is a loading page with an auto-refresh every 20 seconds.
- In case of a no, the AWS Lambda starts a new Apache Zeppelin container and returns the loading page.
Every 20 seconds the client checks Apache Zeppelin provisioning, and gets the notebook interface if the container is running, otherwise, it gets the loading page. When you have the notebook interface, to use Apache Zeppelin you must provide your user credentials.
AWS Lambda | Shutdown
If the custom metric is zero for about 30 minutes, the alarm publishes a message to Amazon Simple Notification Service, and an AWS Lambda Function terminates the cluster. The Amazon Simple Notification is the AWS Lambda Function trigger.
Usage Suggestions & Improvements
Apache Zeppelin also supports Amazon S3 to persist notebook files, as stated here you can use ZEPPELIN_NOTEBOOK_STORAGE
, ZEPPELIN_NOTEBOOK_S3_BUCKET
, and ZEPPELIN_NOTEBOOK_S3_USER
as environment variables.
Amazon Elastic File System, on the other hand, lets you have a very generic solution that can be used for different purposes, the only limit is your imagination. Since Amazon EFS is File System you don't have to deal with Amazon S3 Object Storage, in this case, you can simply upload your application to a docker container and run it on AWS Fargate, just by replacing Apache Zeppelin.
For example you can run Serverless Visual Studio Code, check the container here.
Another improvement, related to Serverless Apache Zeppelin on AWS, is to configure Amazon DynamoDB as an external database for Shiro users.
You can find GitHub solution here.
What will be your next application to deploy as Serverless?