This website uses cookies to improve your user experience and to show you content related to your preferences. If you continue browsing, we consider that you agree to their use. More information.
Ok, don't show again
Close

Algorithms. Technologies. Business cases

Schedule:
- Tue, Thu 19:00–22:00,
- Sat 11:00–14:00

Address:
Office of MegaFon,
41 Oruzheyniy Lane, Moscow
March 24 – June 25, 2020

BIG DATA SPECIALIST 12.0

Big Data is no longer a hype
But a real need for many companies and professionals
The amount of data in organizations is growing exponentially. Analyzing them with standard tools is becoming increasingly difficult. In this case, distributed processing technologies come to the rescue: the Hadoop ecosystem (HDFS, MapReduce, Hive, HBase), Apache Spark.
What is included in the program?
10 labs
Each week you have to complete a lab task and advanced one
2 projects
Additionally to the labs, you work in teams on projects for 6 weeks each
36 lessons
With live broadcasts and videos on our portal
We developed the program for
-1-
Programmers
You have good programming background but lack of knowledge and skills in data analysis, right? At the program, you will learn how to use various machine learning algorithms, including in Apache Spark.
-2-
Analysts
You already know how to analyze data, but seeking for knowledge of new tools? After the first week, you will learn how to deploy a Hadoop cluster in cloud and will be able to use this knowledge for a pilot project at work.
-3-
Managers
Are you developing a product or division? During the program you will get a hands-on experience in the field of big data analysis, having tried to do many things yourself.
What will you learn?
Our program has three components
Algorithms
Learn to process data in Pandas, build machine learning models (logistic regression, trees, random forest) in Scikit-learn, analyze text data, apply different algorithms of recommender systems.
Technologies
Learn to write MapReduce-jobs in Python using Hadoop Streaming, write SQL-like queries in Hive for solving analytical problems, work with HBase column database, access HDFS data, analyze data in Apache Spark.
Business
Learn to choose the right metrics for your tasks, collect requirements before starting a project, evaluate the financial effect of introducing a model, and use storytelling to present your results.
Module 1. Building DMP-system
Project: forecasting gender and age of users by their logs
Based on the results of only the first week of the program, you will learn how to deploy a Hadoop cluster in the cloud using the HortonWorks distribution. You can write your first MapReduce job using Hadoop Streaming and Python.
In this lab, you will need to filter out the logs located on HDFS (distributed file system) and put them in a table in HBase (column database) using a map-only job.


Using simple heuristics, you will need to classify users by interests (drivers, entrepreneurs, housewives, etc.). This time you will need to use Hive.


Using the anonymized data of the bank's customers, you will need to predict the likelihood of each of them leaving the bank in the next few months.

In this lab you will need to find similar job descriptions. Advanced task - participation on Kaggle to determine the emotional coloring of reviews on the Internet.
Module 2. Development of a recommender system
Project: product recommender system for the online-store
The task is to build various top lists of recommended films to users we have no data about.


You will need, calculating the similarity of the descriptions of online courses, to identify those that can be recommended in addition.

Using matrix expansions, develop recommendations that take into account the genre, style, and other implicit factors of the film.

A competition in which you will need to achieve the best score using together different algorithms of recommender systems.
Using the data on watching TV shows of different users, make recommendations for films by subscription.

Our instructors – industry practitioners who can explain complex things in simple words
Anton Pilipenko
Data Engineer,
Lamoda
Nikolay Markov
Senior Data Engineer,
Aligned Research Group
Организатор конференции PyData и Data Science завтраков
Andrey Zimovnov
Senior developer,
Yandex.Zen
Alexander Ulyanov
Data Science Executive Director, Sberbank
Oleg Khomyuk
Head of R&D,
Lamoda
Alexander Filatov
Product Analytics Manager,
VISA
Vladimir Opanasenko
Executive Director, Gazprombank
Kirill Danilyuk
Engineering Manager,
Self-Driving Car, Yandex
Program infrastructure
What you will be working with every day
Cluster
Our program is about big data, that's why you will work with Hadoop-cluster that we administer, configure, and support.
GitHub
All the presentations, jupyter-notebooks, labs and manuals we upload to a private repository on GitHub. This tool has become the standard of work among programmers and data professionals.
Our portal
Here you can check the correct execution of labs using automatic checkers and also watch live broadcasts and previous videos of classes.
Slack
All the communication during the program is in Slack - a convenient messenger for teams. There you can ask questions during the live broadcast, communicate with instructors, organizers and with each other, follow updates on GitHub and be informed of program news.
Infrastructure partner
You need to know
Program prerequisites
Python 3
This is the main programming language used within the program. It's good if you are already familiar with the basic syntaxis, loops, conditional statements, functions, reading and writing files.
Linux basic knowledge
You'll spend lots of time in Linux command line working with the cluster. Great if you already familiar with navigation through directories, creating and editing files, connecting to the remote server via ssh.
SQL
During the program you will use tools such as Hive and Apache Spark. To work with them, you need to know how to write queries in this language: selects, joins, filters, subqueries.
Statistics and linear algebra
At the program we will consider advanced methods of data analysis, so it's good if you know the basics of statistics and linear algebra: average, variance, probability, Bayes theorem, correlation, matrix rank.
Reviews
Where our alumni work
Here they live and work

Our principles in teaching
To make learning effective and interesting we use andragogy
-1-
The material is focused on specific tasks
Our goal is to teach you to solve problems from real life, and not just cover the list of topics. Theory is only a tool necessary for solving problems, and not a goal itself.
-2-
The ability to apply new knowledge immediately
After the first week you will learn how to deploy your Hadoop-cluster in cloud and will be able to use this knowledge for a pilot project at work.
-3-
Independence in lab tasks

Our lab tasks are composed in such a way so that you often need to google something. After the program, you will have your 'luggage' of quality resources to deal with different tasks.
Will be happy to answer your questions
Leave your questions and contacts below, please
I have read and accept your Privacy Policy.