Data Lake Training using Python

May 22,2017 | Posted by : Melissa Ricahrdson

In this 3 day training course, we’ll describe how a data lake is much more than a few servers cobbled together: it takes planning, discipline, and governance to make an effective data lake. And we’ll do it using Python.

Data lakes are emerging as an increasingly viable solution for extracting value from big data at the enterprise level, and represent the logical next step for early adopters and newcomers alike. The flexibility, agility, and security of having structured, unstructured, and historical data readily available in segregated logical zones brings a bevy of transformational capabilities to businesses.

What many potential users fail to understand, however, is what defines a usable data lake. Often, those new to big data, and even well-versed Hadoop veterans, will attempt to stand up a few clusters and piece them together with different scripts, tools, and third-party vendors. This method is neither cost-effective nor sustainable.

What the Course Outline Includes: 

1.Introduction to Data Lakes

2.Introduction to Python and other languages

3.Lake Basics

4.Extract, Transform, and Load vs Extract, Load, and Transform

5.Transformers and Provisioners

6.Working with Basic Zones

a.Transient Zone

b.Raw Zone

c.Trusted Zone

d.Refined Zone

7.Source Connection manager

a.Source Type, Credentials, Owner

8.Data Feed Configuration

a.Feed Name, Type (RDBMS/File/Streaming)

b.Mode - Incremental/Full/CDC

c.Expected Latency

d.Structure information, PII

9.Workflows of core components

a.Hadoop API’s for file

b.Sqoop for RDBMS

c.Kafka, Flume for streaming

10.Operational Stats

a.What, Who, When, Why

b.Failures and Notifications

c.SLA monitoring

11.Application Development Platform

a.Hadoop components Spark, MapReduce, Pig, Hive

b.Abstract and build reusable workflows for common problems

12.Business Rules Integration

a.Rules provided by business

13.Workflow Scheduling / Management

a.Scheduling, dependency


14.Destination Connections

a.Destination Type, Credentials, Owner

b.Provisioning Metadata

c.Type (RDBMS/File/Streaming)

d.Filters if applicable

e.Mode Full / Incremental

f.Frequency: daily / hourly / message


16.Scaling Python

17.Debugging and Unit Testing Python


Install Python and writing basic scripts

Language features needed in all applications

Basic features of Python

Core of Python


Basketball Dataset

Homework challenge

Demographic data

Movie ratings

php datalakes datalake training datalake python

Views: 1589