Data Lake Training using Python
May 22,2017 | Posted by : Melissa Ricahrdson
In this 3 day training course, we’ll describe how a data lake is much more than a few servers cobbled together: it takes planning, discipline, and governance to make an effective data lake. And we’ll do it using Python.
Data lakes are emerging as an increasingly viable solution for extracting value from big data at the enterprise level, and represent the logical next step for early adopters and newcomers alike. The flexibility, agility, and security of having structured, unstructured, and historical data readily available in segregated logical zones brings a bevy of transformational capabilities to businesses.What many potential users fail to understand, however, is what defines a usable data lake. Often, those new to big data, and even well-versed Hadoop veterans, will attempt to stand up a few clusters and piece them together with different scripts, tools, and third-party vendors. This method is neither cost-effective nor sustainable.
What the Course Outline Includes:
1.Introduction to Data Lakes
2.Introduction to Python and other languages
3.Lake Basics
4.Extract, Transform, and Load vs Extract, Load, and Transform
5.Transformers and Provisioners
6.Working with Basic Zones
a.Transient Zone
b.Raw Zone
c.Trusted Zone
d.Refined Zone
7.Source Connection manager
a.Source Type, Credentials, Owner
8.Data Feed Configuration
a.Feed Name, Type (RDBMS/File/Streaming)
b.Mode - Incremental/Full/CDC
c.Expected Latency
d.Structure information, PII
9.Workflows of core components
a.Hadoop API’s for file
b.Sqoop for RDBMS
c.Kafka, Flume for streaming
10.Operational Stats
a.What, Who, When, Why
b.Failures and Notifications
c.SLA monitoring
11.Application Development Platform
a.Hadoop components Spark, MapReduce, Pig, Hive
b.Abstract and build reusable workflows for common problems
12.Business Rules Integration
a.Rules provided by business
13.Workflow Scheduling / Management
a.Scheduling, dependency
b.Logging
14.Destination Connections
a.Destination Type, Credentials, Owner
b.Provisioning Metadata
c.Type (RDBMS/File/Streaming)
d.Filters if applicable
e.Mode Full / Incremental
f.Frequency: daily / hourly / message
15.Scripts
16.Scaling Python
17.Debugging and Unit Testing Python
Labs
Install Python and writing basic scripts
Language features needed in all applications
Basic features of Python
Core of Python
Matrices
Basketball Dataset
Homework challenge
Demographic data
Movie ratings