Serverless Apache Zeppelin on AWS

1 May 2022

The Serverless infrastructure behind Apache Zeppelin on AWS. Do you want make it yours? You can start from here.

What is Apache Zeppelin?
What does Serverless Means?
Solution Requirements
High-level Architecture
Infrastructure as Code Description
Usage Suggestions & Improvements

What is Apache Zeppelin?

First of all, it is worth to ask: what is a notebook interface?
A notebook is an interface for interactively running code, it let you explore and visualize data. You can mix narrative, rich media, and data in a unique space.

Now we can go with the Apache Zeppelin definition. It is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with Python, Scala, SQL, Spark, and more. You can execute code and even schedule a job (via cron) to run at a regular interval.

It's easier to mix languages in the same notebook. You can do some code and then markdown to document it all together. You can also easily convert your notebook into a presentation style - for maybe presenting to management or using dashboards. Further more.

Personalize your analysis result — Apache Zeppelin multi-user support

What does Serverless Means?

The idea behind the serverless is that you as a developer shouldn't need to care about the server infrastructure. You pay to run the code without mentions about what type of physical infrastructure is running below.

There are quite a few advantages to serverless. Scalability essentially comes for free. Because you're just paying to run a logic, the cloud provider can easily dedicate more hardware to run your code. Also, you pay by code execution rather than having a fixed rate. Even more, the cloud provider manages the server software and hardware. You shouldn't need to care about that. Finally, serverless frees up developers to focus on what they're good at - code.

Serverless Computing — Serverless Compute on AWS (Video)

Solution Requirements

Build a serverless infrastructure to run Apache Zeppelin and persists notebook file. The solution must be publicly available and provide login and logout capability. Also, the compute platform must auto shut down after 30 min of inactivity.

High-level Architecture

The below diagram shows the high-level architecture. As you can see, it is a serverless infrastructure, and you can operate with Apache Zeppelin by using a public endpoint while Elastic File System stores the notebook files. Amazon CloudWatch custom metric counts the line of logs and shutdowns the Amazon Fargate container after 30 min of inactivity.

The only missing feature in this architecture is the login and logout capability. In this case, Apache Zeppelin provides Shiro for notebook authentication. Apache Shiro is a powerful and easy-to-use Java security framework that performs authentication, authorization, cryptography, and session management. Here you can find a step-by-step guide about how Shiro works. This example uses the default configuration.

Infrastructure as Code Description

Amazon API Gateway

AWSTemplateFormatVersion: '2010-09-09'
Globals:
  Function:
    Timeout: 60
    MemorySize: 128
    Architectures: 
      - arm64
Parameters:
  [...]
    
Resources:
  ZeppelinApi:
    Type: AWS::Serverless::Api
    Properties:
      StageName: !Ref ServiceName
    
Outputs:
   ZeppelinApi:
     Description: "API Gateway endpoint URL for Prod stage for Hello World function"
     Value: !Sub "https://${ZeppelinApi}.execute-api.${AWS::Region}.amazonaws.com/${ServiceName}/"

The solution uses SAM. Here you can see the global configuration for lambda functions and the public API you can use to access to Apache Zeppelin. The stack deploy provides the Url as output value

Elastic File System

AWSTemplateFormatVersion: '2010-09-09'
Globals:
  Function:
    Timeout: 60
    MemorySize: 128
    Architectures: 
      - arm64
Parameters:
  [...]

Resources:
  ZeppelinApi:
    [...]
    
  AccessPoint:
    Type: 'AWS::EFS::AccessPoint'
    Properties:
      FileSystemId: !Ref FileSystem
      PosixUser:
        Uid: "500"
        Gid: "500"
        SecondaryGids:
          - "2000"
      RootDirectory:
        CreationInfo:
          OwnerGid: "500"
          OwnerUid: "500"
          Permissions: "0777"
        Path: !Sub "/${ServiceName}"
  FileSystem:
    Type: AWS::EFS::FileSystem
    Properties:
      PerformanceMode: generalPurpose
      FileSystemTags:
      - Key: ServiceName
        Value: !Ref ServiceName
  MountTarget1:
    [Availability Zone A Configuration]
  MountTarget2:
    [Availability Zone B Configuration]
  MountTarget3:
    [Availability Zone C Configuration]

When provisioned, each Amazon ECS task hosted on AWS Fargate receives an ephemeral storage for bind mounts, everything on the disk is lost after container termination. In order to persist notebook file the solution uses and Amazon Elastic File System, all notebook on EFS are preserved after the container termination. The Access Point configuration allows Apache Zeppelin to have write permissions on Amazon Elastic File System.

Amazon Cloud Watch Custom Metric

AWSTemplateFormatVersion: '2010-09-09'
Globals:
  Function:
    Timeout: 60
    MemorySize: 128
    Architectures: 
      - arm64
Parameters:
  [...]

Resources:
  ZeppelinApi:
    [...]
  AccessPoint:
    [...]
  FileSystem:
    [...]
  ShutdownSnsTopic:
    [description later in this post]
  
  ZeppelinLogGroup:
    Type: AWS::Logs::LogGroup
    Properties: 
      LogGroupName: !Sub "/ecs/fargate-${ServiceName}"
      RetentionInDays: 1
  ActivityMetricFilter: 
    Type: AWS::Logs::MetricFilter
    Properties: 
      LogGroupName: !Ref ZeppelinLogGroup
      FilterPattern: "INFO"
      MetricTransformations: 
        - 
          MetricValue: "1"
          MetricNamespace: !Sub "${ServiceName}/Actions"
          MetricName: "ActionsCount"
  ZeppelinActionsCountAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ZeppelinActionsCountAlarm
      MetricName: ActionsCount
      Namespace: !Sub "${ServiceName}/Actions"
      Statistic: SampleCount
      Period: '300'
      EvaluationPeriods: '6'
      TreatMissingData: breaching
      Threshold: '1'
      ComparisonOperator: LessThanOrEqualToThreshold
      AlarmActions:
      - !Ref ShutdownSnsTopic

To provide auto-shut-down feature, the Apache Serverless solution uses a custom metric. AWS Fargate saves logs into Amazon CloudWatch Log Group, and the Amazon CloudWatch Custom Metric Filter counts the log lines. If the custom metric is zero for about 30 minutes, the alarm publishes a message to Amazon Simple Notification Service to terminate the cluster.

AWS Fargate

AWSTemplateFormatVersion: '2010-09-09'
Globals:
  Function:
    Timeout: 60
    MemorySize: 128
    Architectures: 
      - arm64
Parameters:
  [...]

Resources:
  ZeppelinApi:
    [...]
  AccessPoint:
    [...]
  FileSystem:
    [...]
  ZeppelinLogGroup:
    
  Cluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: !Join ['', [!Ref ServiceName, Cluster]]
  ZeppelinTaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      RequiresCompatibilities:
        - "FARGATE"
      Cpu: !Ref ContainerCPU
      Memory: !Ref MemoryHardLimit
      NetworkMode: "awsvpc"
      TaskRoleArn: !GetAtt ZeppelinTaskRole.Arn
      ExecutionRoleArn: !GetAtt ZeppelinTaskRole.Arn
      ContainerDefinitions:
        - Name: !Ref ServiceName
          Image: "apache/zeppelin:0.10.0"
          EntryPoint: 
            - /bin/bash
            - -c
            - |
              cp conf/shiro.ini.template conf/shiro.ini 
              /usr/bin/tini -- bin/zeppelin.sh
          Command: ["echo", "done!"]
          MemoryReservation: !Ref MemorySoftLimit
          Memory: !Ref MemoryHardLimit
          PortMappings:
            - ContainerPort: !Ref ContainerPort
              Protocol: tcp
            - ContainerPort: 4040
              Protocol: tcp
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: !Ref ZeppelinLogGroup
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: !Sub 'ecs-${ServiceName}-awsvpc'
          MountPoints:
            - ContainerPath: !Ref ZeppelinPersistNotebookPath
              SourceVolume: !Sub "${ServiceName}"
              ReadOnly: false
      Volumes:
        - Name: !Sub "${ServiceName}"
          EFSVolumeConfiguration:
            AuthorizationConfig: 
              IAM: ENABLED
              AccessPointId: !Ref AccessPoint
            FilesystemId: !Ref FileSystem
            TransitEncryption: ENABLED

Here there is AWS Fargate Cluster and Task Definition. The Apache Serverless solution uses Shiro to enable login and logout capability, as stated here you can create a shiro.ini file by doing the cp command. You can find it in the EntryPoint property of the container definition.

AWS Lambda | Workflow

It implements the workflow that the high-level architecture introduces.

In the beginning, it checks if the Apache Zeppelin Container is running.

In case of a yes, AWS Lambda returns 302 to the Apache Zeppelin public IP.
In case of a no, the AWS Lambda executes the next step.

Then, it checks if the Apache Zeppelin Container exists.

In case of a yes, AWS Lambda returns static web content. It is a loading page with an auto-refresh every 20 seconds.
In case of a no, the AWS Lambda starts a new Apache Zeppelin container and returns the loading page.

Every 20 seconds the client checks Apache Zeppelin provisioning, and gets the notebook interface if the container is running, otherwise, it gets the loading page. When you have the notebook interface, to use Apache Zeppelin you must provide your user credentials.

AWS Lambda | Shutdown

If the custom metric is zero for about 30 minutes, the alarm publishes a message to Amazon Simple Notification Service, and an AWS Lambda Function terminates the cluster. The Amazon Simple Notification is the AWS Lambda Function trigger.

Usage Suggestions & Improvements

Apache Zeppelin also supports Amazon S3 to persist notebook files, as stated here you can use ZEPPELIN_NOTEBOOK_STORAGE, ZEPPELIN_NOTEBOOK_S3_BUCKET, and ZEPPELIN_NOTEBOOK_S3_USER as environment variables.

Amazon Elastic File System, on the other hand, lets you have a very generic solution that can be used for different purposes, the only limit is your imagination. Since Amazon EFS is File System you don't have to deal with Amazon S3 Object Storage, in this case, you can simply upload your application to a docker container and run it on AWS Fargate, just by replacing Apache Zeppelin.

For example you can run Serverless Visual Studio Code, check the container here.

Another improvement, related to Serverless Apache Zeppelin on AWS, is to configure Amazon DynamoDB as an external database for Shiro users.

You can find GitHub solution here.
What will be your next application to deploy as Serverless?