This website uses cookies to improve your user experience and to show you content related to your preferences. If you continue browsing, we consider that you agree to their use. More information.
Ok, don't show again
Close

Creating data processing pipelines

Schedule:
Mon, Wed, Fri 19:00 - 22:00


Address:
MegaFon office,
41 Oruzheyniy Lane, Moscow
March 11 - April 27, 2020

DATA ENGINEER 6.0

Data must be accessible
And also complete, accurate, well-timed, interconnected, consistent, relevant
Behind any product, service – whether it is a recommender system on the site, sending out personalized offers, a customer retention campaign - there is data. The quality of decisions depends on the quality of this data, because garbage in – garbage out. A data engineer is responsible for delivering quality data from various sources (for example, a company's website, CRM, social networks). Employers cannot close vacancies for these specialists for months.
What is included in the program
6 labs
Almost every week you will need to complete lab task and advanced one. Labs are combined into 2 projects: lambda architecture and kappa architecture.
10+ instruments
With some of the tools you will work deep: Kafka, HDFS, ClickHouse, Spark, Airflow. With others you will practice
less: ELK, Flink, Docker, Grafana, Kubernetes, etc.
21 lessons
With live broadcasts and videos on the portal. Classes are more like workshops, where instructors demonstrates different cases of using the tools, showing pitfalls and best pratices.
We developed the program for
-1-
Data engineers
Do you have experience with some tools and want to get experience with others? You can do this by solving our labs and exercises, asking questions to our instructors.
-2-
Database admins
Do you know how to work with classic relational databases and want to get experience with other data storage tools? During the program you can work with HDFS, ClickHouse, Kafka, ElasticSearch.
-3-
Managers
Are you developing a product or division? At the program you will get an understanding of what tools can be used for this or that task, their advantages and disadvantages.
What will you learn
Our program has three components
Installation
You will learn how to install all the tools used on the program by following our detailed manuals.
Customization
You will learn how to connect tools to each other, creating pipelines, getting a baseline solution.
Tuning
You will learn to improve the performance and fault tolerance of both individual tools and entire pipelines.
Project 1. Lambda-architecture
Before starting the project, you will need to implement the preparatory phase - deploying your own cluster in cloud. Then organize the collection of data about users visiting various pages of the site and their purchases.
In this lab, you will need to organize a batch-layer in lambda architecture. You will get the data from Kafka, put it in HDFS. Using Airflow, you will transfer pre-processed data to ClickHouse.

Using Spark Streaming, you will need to build a speed-layer that will process data in real time, filling in the missing information in the batch-layer.


The first project ends with connection of one of the BI tools to both layers - batch and speed - to perform analytical queries regarding the average check and other metrics.

Project 2. Kappa-architecture
As part of this project, you will need to build a machine learning model using Spark ML, and then use it to predict the gender and age of users who visit the website.

The second project ends with connecting a BI-tool, which upon request will be able to produce the requested audience segments existed in the history without using a batch-layer.
Our instructors – industry practitioners who can explain complex things in simple words
Anton Pilipenko
Data Engineer, Lamoda
Nikolay Markov
Senior Data Engineer, Aligned Research Group

Andrey Titov
Senior Spark Engineer, NVIDIA
Egor Mateshuk
Head of Analytics, Data Science and Data Engineering Department, MaximaTelecom
Pavel Tarasov
Head of ML department, CIAN
Victor Egorov
Senior DBA, Data Egret
Igor Mosyagin
R&D-developer, Lamoda
Vadim Madison
Head of development, М-Теch
Program infrastructure
What you will be working with every day
Cluster
Our program is about big data, that's why you will work with Hadoop-cluster that we administer, configure, and support.
GitHub
All the presentations, jupyter-notebooks, labs and manuals we upload to a private repository on GitHub. This tool has become the standard of work among programmers and data professionals.
Our portal
Here you can check the correct execution of labs using automatic checkers and also watch live broadcasts and previous videos of classes.
Slack
All the communication during the program is in Slack - a convenient messenger for teams. There you can ask questions during the live broadcast, communicate with instructors, organizers and with each other, follow updates on GitHub and be informed of program news.
You need to know
Program prerequisites
Python 3
This is the main programming language used within the program. It's good if you are already familiar with the basic syntaxis, loops, conditional statements, functions, reading and writing files.
Linux basic knowledge
You'll spend lots of time in Linux command line working with the cluster. Great if you already familiar with navigation through directories, creating and editing files, connecting to the remote server via ssh.
SQL
During the program you will use tools such as Hive and Apache Spark. To work with them, you need to know how to write queries in this language: selects, joins, filters, subqueries.
Hadoop
During the program, you will deploy your Hadoop cluster and work with YARN, HDFS. It's good if you are already familiar with these tools and understand what they are needed for.
Where our alumni work
Here they live and work

Our principles in teaching
To make learning effective and interesting we use andragogy
-1-
The material is focused on specific tasks
Our goal is to teach you to solve problems from real life, and not just cover the list of topics. Theory is only a tool necessary for solving problems, and not a goal itself.
-2-
The ability to apply new knowledge immediately
After the first week you will learn how to deploy your Hadoop-cluster in cloud and will be able to use this knowledge for a pilot project at work.
-3-
Autonomy in lab tasks

Our lab tasks are composed in such a way so that you often need to google something. After the program, you will have your 'luggage' of quality resources to deal with different tasks.
Will be happy to answer your questions
Оставьте свой вопрос и контакты ниже
Мы с вами свяжемся
I have read and accept your Privacy Policy.