Tuesday, 24 November 2015 00:00
Let me start with a twist to the saying; I am the “New kid on the blog“. This is my first blog post after joining Widas. In the short 4 months since I joined the company I have learned a lot of things about how the Big Data industry works. Then Widas offered me the opportunity to take part at the data2day conference which gave me a first hand experience of Big Data and Data Science technologies.
In this blog post I would like to share my experience at the conference. I had a great time at the conference, got the chance to meet new acquaintances, find out what’s new in the Big Data domain and to attend a lot of really good presentations. Because I possibly can’t go on about all the interesting talks at the two day conference, I sat down and shortlisted three favorites that were not only engaging but also enlightening.
Note: The talks are listed in no order of preference. I picked each one because they all had different things to offer.
Keynote – Machinelles Lernen in Amazon
In this talk, Dr. Ralph Herbrich, presented how a services company such as Amazon uses data analytics and machine learning in its daily business. He presented examples of how machine learning is helping Amazon bring benefits to users in its retail, digital and AWS cloud services.
Dr. Herbrich discussed how in retail sector machine learning enables Amazon to forecast demand and offer products at lowest prices possible. Machine learning algorithms make comparisons with the pre-orders of the customers and with the prices of the same products being sold by other online retailers.
Machine learning algorithms are also used for digital content linkage. The X-Rayfeature which was offered by its Kindle e-book reader is now available also for video content. The feature enables access to actor bios, background information, and more from the Internet Movie Database (IMDb) directly onscreen.
Machine learning is also harnessed for machine translation of product descriptions in various languages. It provides customers with features that can help them search for products and go through the product description without having to manually translate it themselves.
The Amazon Machine Learning is also offered as service. It is designed to analyse large amounts of big data then make predictions about information stored in AWS cloud. The machine learning service will allow developers to visualize the statistical properties of the datasets that will be used to “train” the model to find patterns in the data. Amazon Machine Learning then uses the ‘training’ to optimize algorithms in order to use data in order to return the best possible predictive models.
Dr. Herbrich, then took a deep dive into the science of machine learning. He showed how machine learning techniques brings scientific methods of learning into artificial intelligence problems. He explained how probability is the central concept of machine learning. He also showed how probability comes into play in each step of the infer-predict-decide cycle of the machine learning process.
Machine learning algorithms are also largely implemented in projects at Widas, especially in the Fraud Detection domain. It was interesting to find out how we could widen the spectrum of possibilities at Widas.
In the next part, I would like to present another talk that I picked, especially because of the technology that has been depicted as the answer to many data analytics workflow bottlenecks.
Clickstream-Analyse mit Apache Spark: Website-Besucher in Echtzeit verstehen
Apache Spark is the new catchword in the Big Data world today and this was reflected at the conference too. There were quite a few talks on Apache Spark and its different use cases. In this talk Andreas Zitzelsberger, Josef Adersberger, Qaware presented how Apache Spark was implemented in a project. The goal of the project was to realize a real-time reaction to website traffic using click stream analysis to analyze different user journeys and control appearance of advertisements on websites accordingly. For example, to decide whether to place the ad at the home page or to place ads for women’s products or to invest more money in a popular campaign?
In the talk the presenters showed how they traveled through the Big Data wonderland to arrive at the decision to use Apache Spark which turned out to be the best solution for their target architecture. The first approach considered was a data warehouse with SQL database which was rejected due to inflexible and cumbersome replays and inefficient performance with large quantities of data (>> 1 TB). The second approach based on
Hadoop batch processor and Hive Analytics DB which was also given up due to the intricate programming model and non-interactive interfaces. A solution based on k-architecture with Storm stream processor, Cassandra and Impala could also not prove its benefits to the full due to the complex programming model and the classic persistence issue of stateful and long aggregations. The lambda architecture model was also rejected as the complexity and redundancy were too high.
Then came Apache Spark to the rescue! The Apache Spark programming model which supports both batch and stream processing is a batch framework that can also model micro batches. The end architecture implemented by the presenters is a series connection of stream and batch processing on the basis of Apache Spark. Raw event streams collected and queued using Kafka are ingested as atomic event frames into the Data Lake using Spark streaming. The Spark micro batches ensure high throughput and are pre-programmed for restart in case of failures. Spark APIs process the data and push them to SQL DB that are queried using Spark SQL. The various Spark components offer high processing speed, uniform and simple programming and operations model, fast retrieval times and good connectivity. The Spark packages also come with R-interface, machine learning and NLP libraries etc. thus really extending its horizon of possible use cases.
In conclusion, Apache Spark seems to be delivering what it proposes to do. There were also other talks about projects where Apache Spark was a part of the work flow pipeline. There are developers who strongly recommend and advertise the use of Spark technologies and at Widas, we already have some in-house projects running on Spark.
In the last part of this blog, I would like to share my thoughts on an interesting talk on pitfalls of data analytics.
How smart is Football Data Analytics today?
In the world of data analytics, sports analytics is a big business opportunity in itself. It was probably kick started after the book Moneyball by Michael Lewis, published in 2003, which was a sports business biography of sorts that introduced analytics to sports data. After that, businesses have sprung up that offer sports clubs detailed statistics and predictions, particularly around recruitment. The interest around the topic is not only for sports clubs but it is so much that each year the MIT holds a Sports Analytics conference, and it gets bigger and more prestigious every year.
I particularly liked this talk because Dr. Stefan Kühn, codecentric, had very interesting insights about how in data analytics, ignorance could lead to drawing analyses that maybe commercially viable but could be totally wrong. Often due to pressure of producing impressive reports with severe time constraints data analysts overlook fundamental questions about the data itself.
He drew the point home using examples from the football analytics world. The first example was based on the bestseller “The Numbers Game: Why Everything You Know About Soccer Is Wrong.” In the book the authors talk about how much soccer relies on luck a lot more than people think and how statistical analysis of games could help derive winning decisions.
However, Dr. Kühn pointed out how some of the conclusions drawn in the book are largely ignorant and do not take alternatives into account. For example, the authors claim that long corners are overrated and that short corners are better based on the statistics that the average corner is worth about 0.022 goals. At the same time, Dr. Kühn pointed out that similar statistics when applied to penalties was equally bad, that average team scores once every ten games from a penalty, and so he asks whether teams should give up on penalties as well!
Another interesting example is based on the blog post by Dan Altman, founder of North Yard Analytics.
The claim here is that substitutions score more goals than expected and therefore coaches should substitute forwards every 45 minutes of the game. Dr. Kühn showed that even though the claim may be correct it is still weak to be taken seriously. He points out that the statistics were measured only for forwards in winning teams, the opponents could have scored more as well. Also fatigue is considered to be the cause of this effect, however a closer look reveals that there is no control possible for fatigue. In the study, fatigue has been measured in terms of the time spent on field however this cannot be accepted as a direct measure of fatigue.
The insights presented were relevant for any form of analytics. It reminds how important the pitfalls of raw statistics such as preconceptions, confirmation bias and lack of reflection about data without questioning the results would lead to producing incorrect reports.
To sum it all up
At the end of the first day there was a nice get together arranged with live music and fresh ‘Flammkuchen’. It provided a great atmosphere to get to know fellow conference attendees and to exchange information. The conference was well organized and on the last day I got the chance to take part in a workshop on ‘Introduction to Data Science’. The workshop walked us through the various Python based machine learning libraries using the Anaconda distribution. Hands on training of decision tree algorithm with the example of hand written digit recognition was very useful. To sum it all up, it was a successful conference and I am sure that everything I learned would help me at my work at Widas.