Előadások - Adattárház Fórum 2014

Arató Bence

Managing director, BI Consulting

Managing Director of BI Consulting Hungary. He has been in the BI industry since 1995 as an analyst, architect and consultant. He advises companies on general BI strategy, project and architecture planning, and vendor and tool selection. Also provides QA and on-the-job mentoring services.
He leads the research activities of the yearly run BI-TREK and DW-TREK surveys collecting information and user feedback about the local BI&DW. He also teaches several BI and DW related classes for the Hungarian BI Academy.
He writes about business analytics on the BI.hu website since 1998 and tweets as @bencearato

The State of Data

The last few years have been a very intense period in the data world, with numerous new technologies emerging and taking central place in some application. A vast amount of new data sources also became easily available from detailed activity logs to sensor data.
Based on these new data sources and technologies, we are now collecting, using, and analyzing data in new, creative, inspiring and sometimes alarming ways.
The talk gives an overview of the current technology trends in data warehousing and Big Data, and gives a few examples of how these trends affect our everyday lives.

Tóth Zoltán

Tech Lead Data Services, Prezi

Prior to joining Prezi Zoltán worked as a developer/architect for pharmaceutical market research companies. Now, as Senior Data Engineer, he helps Prezi build and operate a world-class data infrastructure.

Lessons learned building a petabyte-scale data infrastructure

Back in 2011 at Prezi we started off with a single SQL query that worked on a few megabytes of data and produced somewhat accurate numbers satisfying basic business needs. This used to be our BI platform. Today we run a data infrastructure with around 70 high-performance servers that crunch hundreds of gigabytes of data and feed hundreds of reports day by day.
Along this journey we used standard Unix and statistical software, later on-premise Hadoop clusters, NoSQL databases and third-party BI tools. Learning from our mistakes we rebuilt our data infrastructure and ETL systems many times.
I’ll share the successes and misses we encountered throughout this journey with a special focus on our current experiences with managed solutions such as Amazon’s Elastic MapReduce Hadoop solution and Redshift, Amazon’s hosted data warehouse solution.

Luis Moreno Campos

EMEA Big Data Solutions Lead, Oracle

Luis is a Big Data Solutions director at Oracle for EMEA doing Business Development, Marketing
Campaigns, Partner development and Sales Enablement.
Regular speaker at Industry and technology events, CIO roundtables, technology user groups, University
seminars, and marketing events, Luis has international experience in all phases of Big Data projects.
Prior joining Oracle he held roles in consulting, training and business development with expertise in Telco, Financial Services and Media.
Enthusiast about what data driven solutions can do to trigger innovation in every sector as well as the social impact of technology, Luis travels the globe helping businesses innovate and take competitive edge out of Big Data.
When not traveling, Luis is in Lisbon with his family, dogs and friends. Tennis lover, stand-up comedy fan, and passionate about cooking, Luis is also known as a regular blogger, book publisher and a twitter addict!

Big Data at Work

Big Data is a phenomenon. Big Data is the datafication of everything in business, government and even private life.
We are in the very early stages of datafication, and already we’ve seen big changes. But there is a base issue as you know. The world’s ability to produce data has outstripped most companies’ ability to use it.
Companies need not only more processing power to get value out of big data, but a new way of thinking about what value they can get.
To change the business, you have to take the data available to you and figure out what you can learn from it. And as data grows exponentially, you need new technologies to dramatically reduce the time, cost and effort of forming and testing hypotheses. But these two approaches are more powerful together than either alone.
Combining these two approaches into a seamless deployment is Big Data at work.
This session will show how Big Data is forcing top-to-bottom re-evaluation of several industries, and how organisations using Oracle are currently reaping value to the investments made.

Wouter de Bie

Team Lead Data Infrastructure, Spotify

Wouter started his career at an early age as a Linux consultant in the Netherlands during the dot-com era. The company, Stone-IT was the largest Linux integrator of the Netherlands at the time and Wouter integrated Linux in different environments, ranging from SMB to enterprise.
When the dot-com bubble bursted Wouter decided to get a bachelor degree in computer science while working for various companies as a developer and system administrator. After finishing studies he worked at McNolia where he was responsible for their hosting environment and gradually became a project manager.
In 2007 Wouter decided to pursue a more technical career as a freelance developer and technical lead. Late 2008 he co-founded Jewel Labs, a company focussed on developing software for cinema and festival ticketing, where he acted as CTO.
In 2009 Wouter decided to move to Sweden for personal reasons and worked as a Ruby developer and system administrator at Delta Projects, one of Sweden’s biggest online ad serving companies, before he decided to join Spotify in 2011.
Wouter is currently working as a team lead for Data Infrastructure at Spotify where he manages a team of 10 data engineers that take care of building the infrastructure that enables the rest of Spotify to work with data. Next to that, he’s responsible for Spotify’s data architecture.

Using Big Data for fast product iterations to drive use growth

In this talk we’ll look at how Spotify uses data and Big Data technology to make fast iterations on the Spotify product. Some of the questions we’ll try to answer are ”Why is fast product iteration important for us?”, ”How does data tie into this?” and ”What is it we do to achieve this”?

Stephen Brobst

Chief Technology Officer, Teradata

Stephen performed his graduate work in Computer Science at the Massachusetts Institute of Technology where his Masters and PhD research focused on high-performance parallel processing. He also completed an MBA with joint course and thesis work at the Harvard Business School and the MIT Sloan School of Management. Stephen has been on the faculty of The Data Warehousing Institute since 1996. During Barack Obama’s first term he was also appointed to the Presidential Council of Advisors on Science and Technology (PCAST) in the working group on Networking and Information Technology Research and Development (NITRD).

Best Practices in data warehouse architecture

Proper architecture of a data warehouse has a significant impact on the return on investment obtained from its deployment. This seminar provides a taxonomy of data warehouse topologies and discussion of best practices for enterprise data warehouse deployment. Implementation techniques using integrated, federated, and data mart architectures are discussed along with rules of thumb for when and how to implement these structures as required by analytic applications. A framework for understanding cost and value implications of the various approaches will be described.

Learn about the performance tradeoffs of different data warehouse architectures.
Learn about the speed-of-delivery tradeoffs between different data warehouse architectures.
Learn about the cost tradeoffs between different data warehouse architectures.
Learn how to use business requirements to choose best-of-breed architecture.

Todd Goldman

Vice president for Data Integration, Informatica

Todd Goldman is the Vice President and General Manager for Enterprise Data Integration at Informatica, the world’s number one independent provider of data integration software. Under his direction, Informatica delivers software that allows organizations to increase business agility by accelerating data integration agility and by managing data integration as a strategic process and competitive differentiator. Before joining Informatica Mr. Goldman held leadership roles at Nlyte Software, Exeros, ScaleMP, America Online/Netscape and Hewlett-Packard. Todd has an M.B.A. from the Kellogg Graduate School of Management and a B.S.E.E. from Northwestern University.

Great Data Isn't an Accident, It Happens by Design

Great data isn’t an accident. Great data happens by design. The challenge is that data is becoming increasingly fragmented. The explosion of technologies around the Internet of things and Hadoop, along with the ability of lines of business to purchase their own applications in the cloud, makes managing data and achieving great insights a challenge. This session will focus on:

The effects of the evolving data landscape
How achieving great insight happens by integrating across fragmented data boundaries and automating data quality process
The characteristics of companies that are leading in use of quality information to transform their organizations (vs those that don’t)

Marcin Bednarski

CRM & Retail Controlling Dep. Director, PKO BP

Marcin has been active in the financial/banking sector for more than 20 years. Working in several different positions he has gathered vast experience in Accounting, Retail Controlling, Market Price Risk Controlling, Call Center management and in Internet banking development. From 2010 he has been the Head of the CRM and Retail Controlling Department at PKO BK. He is married and has 2 daughters. The whole family is a big fan of Apple and modern technology.

Interactive CRM at PKO Bank

This presentation tells the story of the real time interactive CRM journey of PKO Bank Polski, the leading bank in Poland and CEE. This is a real business case of a company that believes in the power of modern technology while being a traditional Bank with old history. This is the story of how the new approaches merge with traditions and bring value to everybody. The presentation includes also some good examples of innovative banking solutions from the all over the world. The aim is to send a message to the entire banking world – the Banks must change or die because customers expectations are changing.

Stephan Ewen

Ph.D. Student, TU Berlin

Stephan Ewen is a Ph.D. student at the Berlin University of Technology. He is working on the Stratosphere project that creates a next-generation system for BigData analysis. Stephan has architeced and co-architected many components of Stratosphere, including programming abstractions, compiler and optimizer, query runtime and the support for iterative algorithms. Aside from his work on Stratosphere, Stephan was an intern in the field of data analytics at Microsoft Research, IBM Research, and IBM Germnany Development. He holds a masters degree in computer science from the University of Stuttgart.

The Stratosphere big data analytics platform

Stratosphere is a next-generation Apache licensed platform for Big Data Analysis, and the only one originated in Europe. Stratosphere offers an alternative runtime engine to Hadoop MapReduce, but uses HDFS for data storage and runs on top of Yarn. Stratosphere features a scalable and very efficient backend, architected using the principles of MPP databases, but not restricted to the relational model or SQL. Stratosphere`s runtime streams data rather than processing them in batch, uses out-of-core implementations for data-parallel processing tasks, gracefully degrading to disk if main memory is not sufficient. Stratosphere is programmable via a Java or Scala API similar to Cascading, that include the common operators like map, reduce, join, cogroup, and cross. Analysis logic is specified without the need of linking user-defined functions. Stratosphere includes a cost-based program optimizer that automatically picks data shipping strategies, and reuses prior sorts and partitions. Finally, Stratosphere features end-to-end first class support for iterative programs, achieving similar performance to Giraph while still being a general (not graph-specific) system. Stratosphere is a mature codebase, developed by a growing developer community, and is currently witnessing its first commercial installations and use cases.

Papp Lajos

devops, SequenceIQ

Lajos is a co-founder of SequenceIQ. He is is a fetishist of automation, let it be cluster provisioning, or sending a birthday sms. Lately he become a docker evangelist, promoting virtualization into every aspect of a software project. His topics of interest are continuous integration, service-discovery, and dockerizing everything. Lajos has a master’s degree from Technical University of Budapest.

Provisioning hadoop cluster on … anywhere

Installing a Hadoop cluster is far from trivial. In this speech we show our approach, processes and toolsets we use for all the environments where we provision Hadoop. This covers the process of provisioning targeted from a developer laptop, through QA environments, cloud and large production systems running on physical hardware – all done the same way.
We have built a provisioning framework based on Docker and Apache Ambari – and provide a simple REST API to create a cluster of any size on different cloud providers (Amazon EC2, Rackspace), VM’s and physical hardware. Also we will speak about best practices of managing and monitoring Hadoop clusters and dynamic cluster resizing.

Simon Gregory

Director - Business Development & Strategic Alliances - EMEA, Hortonworks

Simon is responsible for supporting Hortonworks strategic partners in driving the broad adoption of Hadoop across EMEA. In addition he is helping build out the wider supporting community of partners who deliver value added services to the Hadoop eco-system; from Education partners and Systems Integrators to Visualisation vendors and pure-play analytics platform companies.

Apache Hadoop, current state and future prospects

Apache Hadoop is moving rapidly just as this session will need to as there’s a lot to cover. I’ll provide a tour of the current Hadoop eco-system, an update on customer adoption methods, what’s being worked on and ultimately what all of that means to you. I’ll also try and cover the latest trends of interactive query, security, data governance and also some of the new areas of interest in projects such as Spark and Storm.

Papp Tamás

BI Architect, Ustream

Tamás works for Ustream in it’s Hungarian office at Budapest as Business Intelligence Architect, which mostly means having fun with data processing. He has about 10 years experience gathered in several BI projects at consulting companies as developer, consultant and project manager. Tamás also likes to describe himself in the third person.

How we didn't build a traditional DW for viewership numbers at Ustream

Live video platform Ustream has about 100 M live viewers per month. For several years we used a 3rd party tool to gather and display information about viewership metrics (which means breakdown of view numbers by geography, device, etc.). This system was fast and reliable but expensive and not flexible enough for us, so you bet we wanted a replacement instead.
When we started to think about building an own solution based on our multiple TBs of viewership logs per day, we had two choices: one is to build a „traditional” datawarehouse which can give us near-realtime and historical reports as well or the other choice is to build something that… well, not that expensive („do more with less…” as managers like to put it).
In Oct 2013 we’ve had no idea about LAMBDA architectures at all, but we designed and built a similar architecture to that which is (relatively) cost-effective by using Redis key-value store, Elastic Map Reduce, MySQL and Tableau. Does it have disadvantages compared to a near-realtime datawarehouse? Sure it has. In this case study talk we’ll go through the whole solution and check what compromises we had to make to reach our goal.

Dr. Lóránd Balázs

Szenior üzleti konzultáns, T-Systems

Balázs Lóránd works as a senior consultant at the Business Intelligence and Information Management Solutions Division of T-Systems Hungary Ltd, and actually supports the Location Based Mobile Advertising project as project manager and project coordinator. Balázs Lóránd has extensive experience in the field of business analytics. He previously took active part in many research, consulting and training projects for multinational companies and in the public sector as well.
Balázs Lóránd holds a PhD in Economics from the University of Pécs, Hungary and has published several research articles with anlytical results in the previous years.

Location-based Mobile Advertising at Magyar Telekom

The greatest challenge in connection to Big Data solutions is the real-time processing and utilization of large amounts of data. The tools available – which are able to send instant advertising offers right after data-analysis – provide several opportunities in connection with the realization of location-based mobile advertising. In this presentation I will outline a project in which we have developed a system of sending real-time, location-based offers in collaboration with several partners (TSI, EMC, MI6App, Magyar Telekom).

Fábián Zsolt

Database/Security Engineer, Spil Games

Working in IT security industry for 10+ years, information retrieval for 5 years, Big-Data enthusiast. Working with Spil Games to make the Internet a safe place to play, also writing loads of Map/Reduce jobs in Python for Disco and HIVE. Spil Games is a rapidly growing online game publisher and transformed itself from an Internet startup to a world leader in casual gaming.

Experiencing migration among Map/Reduce platforms

SpilGames is a leading publisher of HTML5, Flash and mobile games. Our main revenue driver is advertising where our system is heavily relying on Big-Data processing. This session will explain how our map/reduce systems were maturing and how we took the challenge of migrating Python based map/reduce Disco jobs to HIVE. The talk will tell the story of developers who was involved in the transition between Disco & Hadoop and how we kept our business online while we changed tires in the box street.

Alex Dean

Co-founder, Snowplow Analytics Ltd

Alex Dean is the co-founder and technical lead at Snowplow Analytics. Snowplow is a web and event analytics platform with a difference: rather than tell our users how they should analyze their data, we deliver their event-level data in their own data warehouse, on their own Amazon Redshift or Postgres database, so they can analyze it any way they choose.
At Snowplow Alex is responsible for Snowplow’s technical architecture, stewarding the open source community and evaluating new technologies such as Amazon Kinesis. Prior to Snowplow, Alex was a partner at technology consultancy Keplar, where the idea for Snowplow was conceived. Before Keplar Alex was a Senior Engineering Manager at OpenX, the open source ad technology company.
Alex lives in London, UK.

Continuous data processing with Kinesis at Snowplow

Since its inception, the Snowplow open source event analytics platform (https://github.com/snowplow/snowplow) has always been tightly coupled to the batched-based Hadoop ecosystem, and Elastic MapReduce in particular. With the release of Amazon Kinesis in late 2013, we set ourselves the challenge of porting Snowplow to Kinesis, to give our users access to their Snowplow event stream in near-real-time.
With this porting process nearing completion, Alex Dean, Snowplow Analytics co-founder and technical lead, will share Snowplow’s experiences in adopting stream processing as a complementary architecture to Hadoop and batch-based processing.
In particular, Alex will explore:

“Hero” use cases for event streaming which drove our adoption of Kinesis
Why we waited for Kinesis, and thoughts on how Kinesis fits into the wider streaming ecosystem
How Snowplow achieved a lambda architecture with minimal code duplication, allowing Snowplow users to choose which (or both) platforms to use
Key considerations when moving from a batch mindset to a streaming mindset – including aggregate windows, recomputation, backpressure

Gábor Zoltán

Research Engineer, Falkstenen AB Hungarian Branch Office

Zoltán Gábor works as research engineer at Falkstenen AB Hungarian Branch Office (formerly known as GusGus Capital LLC) for 10 years. He took part in most of the developments of the Company’s purpose built tools (trading platforms, data management tools, distributed computation tools). He has been working with Hadoop for the last three years. The main responsibilities are to develop, integrate and maintain Hadoop based technologies that are used in the Company’s ETL.

Managing Financial Big Data on Hadoop

Falkstenen AB Hungarian Branch Office (formerly known as GusGus Capital LLC) is a subsidiary of a family office that is trading financial assets in different asset classes (e.g. ForeignExchange, Equity, Commodities, and their derivatives). We are processes large amount of exchange generated data using Hadoop technologies.
The exponential growth, the lack of complicated structures of the data and the huge number of records makes traditional databases unsuitable (unnecessary) to store and process market data. Hadoop and it’s ecosystem fits naturally into the typical processing scheme of this field.
In this talk we will present our (Big!) data and the problems that we had before moving to Hadoop. Insight into the structures of exchange generated events will be given.
The hardware architecture of our 480 node Hadoop cluster will be shown. You will also learn something about our ETL and the tools that we have built or integrated to handle hundreds of TBs of data.
By the end of this talk you will know something about the problems that we are facing today, and the future development plans that we have as well.

Christoph Boden

Research Associate, TU Berlin

Christoph Boden is currently a research associate and Ph.D Student at the Database Systems and Information Management group at TU Berlin . His research foci are scalable machine learning and parallel data processing systems. He received my masters degree (”Diplom-Ingenieur”) at the Technische Universität Berlin in 2011, where he designed and implemented a Fact Prediction system using supervised learning techniques.He studied Industrial Engineering at Technische Universität Dresden, Technische Universität Berlin and UC Berkeley. He is also a laurate of the Software Campus program. He has published several refereed scientific papers at international conferences and workshops and held graduate lectures on Scalable Data Analytics and Text Mining at TU Berlin

Cooccurrence-based recommendations with Mahout, Scala & Spark

This talk will give a preview to the latest developments and future plans in Apache Mahout. Mahout features a new scala DSL for linear algebraic computations. Programs written in this DSL will be automatically parallelized and executed on Apache Spark. I will give an introduction to the DSL and show how Mahout uses it to implement a cooccurrence-based recommender system.

Enrico Berti

UI Engineer, Cloudera

Enrico Berti is a Vienna based UI engineer working for Cloudera with more than fifteen years of experience in several IT fields, from very critical banking applications to non-profit. He’s the main UI engineer of the open source Hue project (http://gethue.com).

Open up interactive big data analysis for your enterprise

Hadoop brings many data crunching possibilities but also comes with a lot of complexity: the ecosystem is large and continuously changing, interactions happens on the command line, interfaces are built for engineers…
This talk describes how Hue can be integrated with existing Hadoop deployments with minimal changes/disturbances. Enrico covers details on how Hue can leverage the existing authentication system and security model of your company.
This talk describes through an interactive demo and dialog based on open source Hue how users can get started with Hadoop. We will detail how one can start or use an existing Hadoop cluster to setup Hue. The best practices about how to integrate your company directory and security will be shared. Moreover, the underlying technical details about interact with the ecosystem.
The presentation will continue with real life analytics business use cases. It will show how data can be imported and loaded into the cluster for then being queried interactively with SQL or a search dashboard. All through your Web Browser!
To sum-up, attendees of this talk will learn how Hadoop can be made more accessible and why Hue is the ideal gateway for quickly getting started or using the platform more efficiently.

Balassi Márton

Big Data Solutions Developer, MTA - SZTAKI

Márton Balassi is big data enthusiast working for the Datamining and Search Group of the Hungarian Academy of Sciences (MTA SZTAKI). His expertise includes experience with numerous distributed processing frameworks such as Hadoop, Spark and Storm to name a few. Besides big data architecture his main interest is distributed graph processing.
Currently he is mainly working on designing and implementing an an open-source, low-latency distributed stream processing framework called Stratosphere Streaming.

Challenges of real-time distributed stream processing

Real time - i.e. low latency - processing is one of the main challenges of the big data community. A variety of frameworks have been proposed for distributed stream processing including S4, Storm, Spark Streaming or Samza, all trying to respond in their own way.
The talk draws a sketch of the current stream processing scene supported by examples from the topic of recommender systems. Demonstration of the application of stream processing and its possible role to complement batch processing is highlighted.
The context of the research enabling this talk was the planning phase of a streaming framework augmenting the European big data analytics paltform, Stratosphere. Stratosphere Streaming aims supporting both very low latency processing and high-throughput minibatch processing in a fault-tolerant fashion. The basic stream processing engine is the contribution of the Budapest team.

Sander Kieft

Manager Core Services, Sanoma Media

Sander is responsible for the common services within Sanoma, were his teams designs and builds (web)services for some of the largest websites and most popular mobile applications in The Netherlands and Finland. Sanoma is amongst the largest media and learning companies in Europe. With key markets in Finland and The Netherlands with titles ranging from Libelle, Margriet, Kieskeurig.nl, Autotrader.nl and Startpagina.nl to donaldduck.nl and national news brands Helsingin Sanomat, Ilta Sanomat and NU.nl.
Sander has been working with large scale data in media for 15 years and working for the largest websites in The Netherlands as a developer, architect and technology manager.

Does Big Data self service scale as well as Hadoop?

This talk will take you through the past, present and future of the data platform in use by Sanoma. A few years ago Sanoma set out to build a self service data platform, using a mixture of open source and commercial technology. Centre to the platform are Hadoop, Hive and Python. But nowadays the platform consist of real time data ingestion, real time data processing and integration with other enterprise BI and Data systems.
Central question: How did this platform came to be and does Big Data self service deliver on the promise of freeing the for everyone to use?

Kasler Lóránd Péter

Tech Lead Architect, Virgo Systems

The speaker has more than 10 years of experience in IT. Worked as architect of IWIW, the biggest Hungarian social networking site before Facebook came along. Interested in highly scalable distributed systems and big data platforms.
When he is not coding or architecting he spends all his time with his 2 year old daughter.

Building a recommendation engine using the Lambda Architecture

We will present our hybrid recommendation engine based on collaborative filtering and text retrieval techniques.
Our goal is to present a valid use case of the Lambda Architecture and how did we tailor it to our needs. We will show how are we leveraging Hadoop and related technologies (Avro, Pail, Cascading, Mahout, SOLR) to bring a highly scalable and customisable recommendation platform to our customers and pave the way for further applications of big data. We will share our learnings of deploying Cloudera Hadoop stack on Amazon Web Services and also the shortcomings of the cloud based approach we have found so far. Also highlight our usage of Amazon Auto Scaling Groups together with key parts of our platform.
Besides the technological part we will also present our novel usage of collaborative filtering (powered by Mahout) combined with SOLR search engine, so that we can cover a wide range of recommendation related needs.

Claudio Martella

PhD candidate, VU University Amsterdam

Claudio Martella is a fetishist of graphs. He is a researcher at the Large-scale Distributed Systems group of the VU University Amsterdam. His topics of interest are large-scale distributed systems, graph processing, and complex networks. He has been a contributor to Apache Giraph since its incubation, where he is a committer and a member of the PMC. He is also lead-author of Giraph in Action for Manning.

Apache giraph: large-scale graph processing on hadoop

We are surrounded by graphs. Graphs are used in various domains, such as the Internet, social networks, transportation networks, bioinformatics etc. They are successfully used to discover communities, to detect frauds, to analyse the interactions between proteins, to uncover social behavioral patterns. As these graphs grow larger and larger, no single computer can timely process this data anymore. Apache Giraph is a large-scale graph processing system that can be used to process Big Graphs. Giraph is part of the Hadoop ecosystem, and it is a loose open-source implementation of the Google Pregel system. Originally developed at Yahoo!, it is now a top top-level project at the Apache Foundation, and it enlists contributors from companies such as Facebook, LinkedIn, and Twitter. In this talk we will present the programming paradigm and all the features of Giraph. In particular, we focus on how to write Giraph programs and run them on Hadoop.

Dionysios Logothetis

Associate Researcher, Telefonica Research

Dionysios is an Associate Researcher with the Telefonica Research lab in Barcelona, Spain. His research interests lie in the areas of large scale data management with a focus on graph mining, cloud computing and distributed systems. He holds a PhD in Computer Science from the University of California, San Diego and a Diploma in Electrical and Computer Engineering from the National Technical University of Athens.

Grafos.ML: Tools for large-scale graph mining and machine learning

Large-scale graph mining and machine learning is becoming an increasingly important area of big data analytics with applications from Online Social Network analysis to recommendations. In this talk, I will describe grafos.ml, an umbrella project with the goal of building tools and systems for graph mining and ML analytics.
In the first part of the talk, I will describe Okapi, an open source library of graph mining and ML algorithms built on top of the Giraph graph processing system. The goal of Okapi is to provide a rich toolkit of graph mining algorithms that will simplify the development of applications, such as OSN analysis at scale.
In the second part, I will talk about RT-Giraph, a system for mining large dynamic graphs. In many real-world scenarios, graphs are naturally dynamic and several applications, such as sybil detection in OSNs, require real-time updates upon changes in the underlying graph. However, existing graph processing systems are designed for batch, offline processing, making the analysis of dynamic graphs hard and costly. RT-Giraph is explicitly designed for dynamic graphs, allowing fast updates and making the deployment of real-time applications easier.

Balogh György

CTO, LogDrill

György Balogh is the CTO of LogDrill Kft. György received his computer science degree from the University of Szeged and has 20 years of data mining and machine learning experience. György spent 6 years at Vanderbilt University in Tennessee as a researcher and developed the sensor fusion algorithms of the first distributed shooter localization system. Currently György is working on the LogDrill product family specialized for log and big data analytics.

Introduction to Modern Big Data Technologies

Big Data platforms such as Hadoop has evolved and got mature in recent years. We provide a brief history of Big Data evolution focusing on reasons of the current paradigm shift in data processing.
Next we present the latest open source technologies and their capabilities such as Hadoop 2.0, Cloudera Impala and Apache Spark.
Finally we show how these technologies compare to traditional data warehouse systems.

Domaniczky Lajos

Independent Expert

I’m a Software Architect working with Java and related technologies, mainly in western Europe for the last 14 years. My goal is to conquer new IT technologies. I have worked through standardising the Enterprise Java development stack in various companies. I have mastered most of the relevant technologies (including JDBC, EJB and then went lightweight with Spring and Hibernate).
After some side-projects I have specialised in UI technologies and mobile devices (JavaScript and Android UI’s).
I have also submerged myself in big data technologies for a big multinational IT company, enabling them to implement a sustainable system and to align their KPIs to match management expectations, using all of the technologies that we have discussed in this presentation.

Analyzing Big Data using Hadoop and Hive

Writing map/reduce programs using Hadoop to analyze your Big Data can get complex and cumbersome. Hive can help make querying your data easier: a data warehouse system that facilitates easy data summarisation, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using HiveQL, an SQL-like language.
This presentation will show you how to get started with Hive and HiveQL. We'll discuss Hive's various file- and record formats, partitioned tables, etc. We'll start with a simple data-set stored in Hadoop HDFS: a set of files with a well defined structure, and show how to map these files to a scheme using DDL, as well as how to query this scheme using some useful HiveQL queries.

Dr. Horváth Gábor

Professional Services Manager, Teradata

Gabor Horvath has been involved in datawarehousing in various roles more than 15 years. Since 2012 he has been leading the Professional Services organization of Teradata Hungary. Prior joining Teradata he had gained experience in location intelligence projects in Europe, and has been involved to build up and run the datawarehousing practice of Oracle Hungary Consulting.

Data Warehousing in the Big Data era

In the traditional data intense industries, like banking or telecommunications datawarehouses are established long time ago. These DW systems evolved over time, however they face a number of problems to solve, in order to effectively support the decision making of these companies. On top of these problems, the IT/DW organizations are also facing the changes of the data managament industry, and should cope with the technology and process related requirements due to the emerging ”Big Data” paradigm.
This presentation attempts to provide a summary of the status that typical Hungarian telco companies or banks are in, and suggests an approach that would on one hand address some of the issues related with the current DW systems and starts the journey to gain business benefits of ”Big Data”, on the other.
Some real-life, practical examples will also be shown to support this approach.

Luis Moreno Campos

EMEA Big Data Solutions Lead, Oracle

Luis is a Big Data Solutions director at Oracle for EMEA doing Business Development, Marketing
Campaigns, Partner development and Sales Enablement. Regular speaker at Industry and technology events, CIO roundtables, technology user groups, University seminars, and marketing events, Luis has international experience in all phases of Big Data projects.
Prior joining Oracle he held roles in consulting, training and business development with expertise in Telco, Financial Services and Media.
Enthusiast about what data driven solutions can do to trigger innovation in every sector as well as the social impact of technology, Luis travels the globe helping businesses innovate and take competitive edge out of Big Data.
When not traveling, Luis is in Lisbon with his family, dogs and friends. Tennis lover, stand-up comedy fan, and passionate about cooking, Luis is also known as a regular blogger, book publisher and a twitter addict!

Data First Framework - How to build a Big Data architecture

Data First approach is a fundamental shift in data management mindset from Model First approach: Two environments, one difference
Organize data to do something specific (Run the Business) and figure out what the data can do for you (Change the Business). In this session we will learn the architectural components and implications to this new approach.

Nagy Zoltán

Senior Analyst, Data Solutions

Senior analyst of Data Solutions Ltd. Graduated and worked briefly as a macroeconomic analyst, he stepped into the data mining field in 2008. Since then he worked and conducted in many different domestic and international projects, covering various topics from customer level predictive analytic modelling and application through system maintainance to product demand forecasting. This way he gained expertise mostly in telco but also in banking, pharmacy, insurance, retail industries. At this time he works at as the lead telco analyst of Data Solutions.

Big methods on not-so-big data - a telco churn case study

Big data methods could be useful even without big data.
This is a lesson learnt by a telco willing to handle MNP churn. Traditional methods leave such migrating customers unidentified and beyond reach but a fresh perspective can turn the tide altogether. Big data methods developed mainly for clickstream and weblog analysis has been applied on – many times previously completely unutilized – data of a traditional data warehouse.
As a result, more precise and up-to-date target group selection could be archived. Timing remains a crucial factor though. Results also suggest that further developments could be exploited by including traditional churn prediction methods and also by utilizing ”real” big data.

Borsodi Szilárd

Senior BI/DW consultant, T-Systems

Small data: the burden of manual DWH input

There is a common feature of all data warehouse projects: manual and disaparate sources of business data – all missing from core systems. If the project fails on comfortably accomodating these data into the BI framework then the solution delivered will less likely stand the test of sympathy for the analysts and operators. This talk will give an overview of the real-world issues and possible solutions.

Pocsarovszky Károly

Research manager, eNet

Big Data in Mobile Network Analysis

We are going to present the highlights and key findings of working with Big Data in a mobile network environment. Our project aim was to identify the mobile cells’ data transfer capacity and load by measuring a given set of mobile towers’ activity only through the air interface. The estimated raw data is more than 3 TB. In order to analyze such amount of data, a proper processing framework is needed. We will present our experiences with different Big Data tools (Hadoop, MongoDB, low level scripting) to show how they are compete with the traditional approaches and also how they are complement each other.

Biró Attila

Üzleti megoldások igazgató, Areus

Biró Attila az Areus Infokommunikációs Zrt fejlesztési igazgatója 2009 óta. Munkája során elsősorban a cég banki fejlesztési munkáit és az adatintegrációs projektjeit fogja össze magas szinten, egyúttal a részleg stratégiai irányvonalainak tervezésében, követésében is részt vesz. Korábban a Telenornál és Oraclenél is dolgozott, ahol mind technikai, mind menedzsment téren jelentős tapasztalatokat szerzett. Évek óta tart előadásokat különböző szakmai rendezvényeken.

Top 3 adatintegrációs probléma a nagyvállalatoknál

Az olyan nagyvállalatok (pl.: bankok, biztosítók, telco-k, állami és egyéb szervezetek), amelyek kiterjedt és heterogén alkalmazás-környezettel valamint nagyméretű adatbázisokkal rendelkeznek, általában komoly nehézségekkel találják szembe magukat az alábbi esetekben:

Adatok migrációja és az adatminőség biztosítása: Új vállalati alkalmazások bevezetése, alkalmazások frissítése vagy alkalmazás példányok konszolidációja során az adatmigráció a teljes projekt ráfordításnak tipikusan akár 30-40%-át is jelentheti. A vállalatok gyakran alulbecsülik a projekt terjedelmét, ugyanakkor egy migrációs projekt átlagosan kb. 10-szerese egy adattárház projektnek. Ezek a problémák sok esetben ahhoz vezetnek, hogy a cél rendszerbe rossz adatok kerülnek, a migrációs projekt csúszik, vagy nem tudja teljesíteni az elvárásokat.
Kisméretű, kompakt tesztkörnyezetek kialakítása. A nagyvállalatoknál a tesztelésre fenntartott rendszerek mérete körülbelül 5-7-10-szerese az éles adatbázisok méreteinek. Ennek oka, hogy általában több szinten folyik a tesztelés és a legtöbb vállalat teljes másolatokkal dolgozik. A tesztkörnyezetek ily módon hatalmas helyet foglalnak, tovább tart az előállításuk, és komplikáltabb karbantartani őket.
Tesztadatok hatékony védelme: Még a legfejlettebb cégek esetében is a teszteléshez használt adatok az éles rendszerből származó valódi, éles ügyfél adatok. Ez több szempontból is problémát jelent: egyrészt jogszabályokba ütközik, valamint adatlopás és adatszivárgás lehetősége hatványozottan fennáll. Emiatt magas az ügyfél és presztízsvesztés kockázata is.

Fenti problémákra bemutatásra kerülnek a professzionális Informatica megoldások, amelyek több magyarországi nagyvállalatnál már sikeresen bizonyítottak. Magyarországon az Areus képviseli az Informatica-t, tehát a helyi szakértelem, bevezetési és projekttapasztalat is rendelkezésre áll.

Daume Zénó

Alkalmazás-támogatás osztályvezető, Erste Bank

Daume Zénó 2008 óta vezeti az alkalmazás-támogatási osztályt az Erste Bankban. Ez idő alatt jelentős tapasztalatot szerzett az adatbázisok, adattárházak, Weblogic és Java alapú alkalmazások valamint kártyarendszerek üzemeltetése és felügyelete területén. Az Ersténél eltöltött 10 éve során számos migrációs projektet vitt sikerre, köztük az akkoriban kiemelt figyelmet kapott és rekord idő alatt teljesített Posta Bank migrációt. Szabadidejében szívesen kerékpározik, barkácsol és kertészkedik.

Tesztadat kezelési kihívások és trendek az Erste bankban

Az előadás az Erste Bank tesztadat menedzsment tapasztalatairól fog szólni, azokról az adatkezelési és környezeti kihívásokról, amelyekkel a nagyvállalatoknak rendszerint szembe kell nézni. Az előadás emellett az eddig bejárt út tapasztalataiból építkezve megvilágít egy előremutató jövőképet is a témába vágóan.

Otti Levente

Emarsys

Adattárház szakértő, Data Scientist. 15 éve foglalkozik szoftver tervezéssel, optimalizálással, adatbányászati módszerek tervezésével, alkalmazásávál a gyakorlatban. Széleskörű tapasztalattal rendelkezik az adattárházak, BI eszközök, ETL folyamatok tervezése, optimalizálása és fejlesztése területén. Projekt tapasztalatai szerteágazóak: az államigazgatási, pénzügyi, banki, biztosítói, marketing vagy telekommunikációs szektorok mellett olajipari megoldások kivitelezésében is szerepet vállalt.

Emarsys Technologies - Data Warehouse As a Service

Az Emarsys Technologies a kezdeti Email Service Provider-ből vált a Marketing Automatizáción keresztül egy teljesen integrált Customer Engagement Platform szolgáltatóvá SaaS megoldással. Ez lefedi a marketinghez kapcsolódó döntések adatvezérelt támogatását, a döntések egyszerű végrehajtását és azok visszamérését, a marketing kampányok követését is.
Az előadás során körvonalazzuk egy szolgáltatásként működő adattárház felépítését, kihangsúlyozva a fejlesztés és üzemeltetés során felmerült kihívásokat, ebből összegyűlt tapasztalatokat, tanulságokat, elsősorban az adatintegráció és tudás kinyerés kérdéseit körbejárva

Gollnhofer Gábor

Adattárház üzletágigazgató, Jet-Sol

Az adattárházak tapasztalt szakembere, 1996. óta foglalkozik magyar és külföldi DW/BI rendszerek kialakításával és ehhez kapcsolódó tanácsadással. Kiemelt szakterülete a rendszertervezés és az adatmodellezés, mind az adattárházak, mind a hagyományos informatikai rendszerek terén.
A The Data Warehouse Institute (TDWI) és az Association for Computing Machinery (ACM) tagja, Certified Data Vault Data Modeler.

Bevezetés a DW automatizálásba

Az előadás az adattárház automatizálást járja körül. Bemutatja, hogy mit és miért és hogyan érdemes, illetve nem érdemes automatizálni.

Csonka Zoltán

Adattárház architekt, Generali Biztosító

Csonka Zoltán Adattárház architekt a Generali Biztosítónál. A Generalinál töltött 9 év alatt sok tapasztalatot gyűjtött adattárház optimalizálási projektekkel, adattárház technológiai váltással, fejlesztési eszközök és irányok meghatározásával. Tapasztalatait színesíti az Angliában adattárház és BI megoldásfejlesztéssel eltöltött idő. Szabadidejét a 7 hónapos kislányával, 4 éves fiával és feleségével, a maradék időt pedig legszívesebben nyelvtanulással tölti.

Adattárház automatizálási tapasztalatok a Generali Biztosítóban

Az előadás egy az Oracle Warehouse Builder-el töltött adattárház betöltő folyamatainak automatikus fejlesztésére fókuszál.
A bemutató lefedi, hogy:

miért döntöttünk a fejlesztési folyamatok automatizálásáról, miért egyedi megoldást választottunk
a folyamatokat milyen mértékig érni meg automatizálni, hol lehet ezzel nyerni
kihívások és buktatók a projekt során
a tapasztalatok alapján mit csinálnánk másképpen

Csippán János

IT Director, Partner in Pet Food

KKV adattárház

Az előadás a Partner in Pet Food csoport regionális adattárházának kiépítése kapcsán szerzett tapasztalatokat mutatja be. Kitér arra, hogy mik a KKV adattárházas sajátosságok, mennyiben hasonlít és mennyiben tér el a „klasszikus” nagyvállalati adattárházaktól.
Az előadásban bemutatom a PPF-ben kialakított megoldás fontosabb elemeit és hogy miért ilyen megoldást választottunk.

Dr. Nizalowski Attila, Fekszi Csaba

Vezető főtanácsadó, NFM; Ügyvezető, Omnit

Dr. Nizalowski Attila
Jogász, informatikai szakjogász és kodifikációs szakjogász. 1990-től az ELTE Állam- és Jogtudományi Karán, Jogi Továbbképző Intézetében és Rektori Hivatalában dolgozott oktatóként és informatikusként. A 2000-es években jogszabály-nyilvántartókat fejlesztett, leghosszabb ideig a piacvezető Complex Jogtár felelős szerkesztőjeként. 2011-től a Nemzeti Fejlesztési Minisztériumban közbeszerzési alkalmazásokat tervez, ezek elkészítésében projektvezetőként is részt vesz.
Fekszi CSaba
40 éves adattárház- és BI szakértő. A KKVMF informatika szakának és a Budapesti PSZF számvitel szakának elvégzése után a Veszprémi egyetemen szerzett mesterfokozatot informatikából. Pályája során számos, főleg banki és pénzügyi folyamatokat kiszolgáló rendszer (bankkártya rendszerek, internetes fizetés, banki adattárházak és BI megoldások, alkalmazások, portálok) tervezésében és fejlesztésében vett részt. 2007-ben alapította az Omnit Solutions Kft-t, melynek napjainkig is ügyvezetője és többségi tulajdonosa.

Közbeszerzési adattárház open source alapokon

Adattárház és Üzleti intelligencia rendszer kiépítése a kormányzati informatikában, open source alapokon.
Érdemes-e megpróbálni, lehet-e egyáltalán, akár ipari mércével mérve is komoly rendszert építeni ingyenes eszközök használatával? Milyen előnyökkel jár és milyen hátrányokkal járhat egy ingyenes termék használata? Mi van a nyúlon túl? Mi szükséges a terméken kívül egy sikeres DW/BI projekt befutásához?

Kővári Attila

Ügyvezető, BI Projekt

Kővári Attila a BI Projekt kft. üzleti intelligencia szakértője. 1997 óta foglalkozik adattárházak és BI rendszerek tervezésével, BI csapatok felépítésével, mentorálásával. Széleskörű üzleti tapasztalata és közgazdász végzettsége megfelelő alapot ad az üzleti problémák lefordítására a technológia nyelvére. Szakmai munkásságát a redmondi Microsoft is elismeri és azon kevesek közé tartozik, akik nyolcadszor is megkapták a Most Valuable Professional szakmai kitüntetést. Szakértői blogját (BI projekt blog, BI jegyzetek) havonta átlagosan több mint 4000 ember olvassa. Évek óta tart előadásokat egyetemeken és rangosabb szakmai rendezvényeken.

Adattárházak minőségbiztosítása

Az előadás az adattárházak minőség biztosításról fog szólni és olyan kérdésekre keresi a választ, mint a miért van szükség adattárház minőségbiztosításra, hogyan minőségbiztosítsuk az adattárházakat, vagy a mennyit költsünk minőségbiztosításra.

Rékasi László, Szücs Imre

Dataminer/Analyst, Erste; Head of Research and Development, United Consult

László is Analyst/Dataminer at Erste Bank, Retail CRM. Working on near real-time marketing engine for NBO and Event Driven Marketing. Before Erste he has gained experience in financial and IT services sector as analyst/dataminer. Graduated in Computer Science.

Imre is acting as a director at United-Consult Ltd., one of the leading Hungarian consultancy company. He has got more than 10 years experience in business intelligence and data mining. Before starting his consultant career he has worked in the financial and FMCG sector.
Imre has got a strong academic background, he holds MSC in Physics, Astronomy and Computer Science and he is also a PhD candidate at Eötvös Loránd University.

Ügyfélviselkedés elemzése szociális hálók segítségével az Erste Banknál

Az ERSTE Bank Magyarország Zrt. egy alternatív megoldást keresett az ügyfélviselkedés elemzésére, miután a tradícionális BI megoldások nem támogatták megfelelően a közel valós idejű, többcsatornás marketing tevékenységek igény szerinti kialakítását. Egy pilot projekt keretében az OrientDB - NoSQL adatbázis kezelő - és egy teljes szociális hálózat elemzési keretrendszer került kialakításra, melyben számos technológiát integráltak - mint Gephi, R, Gremlin, ... - a cél elérése érdekében. Az előadás során az ERSTE Bank képviselője fogja bemutatni a projekt üzleti hátterét, míg a United Consult a gráfkezelő alapú analitikus rendszer technológiai aspektusait ismerteti.

Kóspál Eszter Sára

Adattárház szervező- és modellező, CIB Bank

2002-től foglalkozik adattárház szervezéssel, adatmodellezéssel. Első tapasztalatait a KFKI Isys Kft majd az IQSYS Rt színeiben az MKB, HVB, OTP, Raiffreisen Bankokban szerezte, majd 2004-től a CIB Bank adattárház rendszerszervezőjeként dolgozik.
Részt vett a meglévő adattárház továbbfejlesztésében, majd 2006-tól a CIB Bank új adattárházának modell alapú bevezetésében. 2013-tól az adattárház szervezői csapatának szakmai vezetője.

Adatmodellezés a gyakorlatban

Az előadás a adatmodellezés mindennapok gyakorlatában felmerülő kérdésekkel és problémákkal foglalkozik.
Az érintett témák között szerepelni fog többek között:

Hogyan lesz adatmodellje egy alkalmazásnak/adattárháznak?
Vásárolt kontra saját fejlesztésű adatmodellek
Lokalizációs technikák és logikák
Minőségbiztosítás az adatmodellezésben