Nfq Sponsored PyData Barcelona 2017
For yet another year, we have been proud sponsors of PyData, the most important data analysis tools conference in Python and other languages.
Several members of our team attended the PyData conference held in Barcelona between the 19th and the 21st of may of 2017. The topics covered by the conference are of the highest interest: statistical methods, machine learning, artificial intelligence, distributed computing, neuroscience, genomics…
Nfq sponsored the event, but we were not just a logo on the website, as our team has been involved in the conference organization and has been among the key speakers as well.
The conference was dominated by deep neural networks using Google’s Tensorflow tool, one of the trending topics for data analysis worldwide. However, other machine learning techniques and libraries were mentioned, particularly random forests and gradient boosting with xgboost.
The PyData community is particularly welcoming, and we felt part of a huge family from the beginning. The level of the talks and keynotes was impressive. The keynoters were also key people within the ecosystem of data analysis. Travis Oliphant is the Chief Data Scientist of Continuum Analytics, one of the fastest growing companies in analytics.
The main driver of the big data approach in the Data Science field has been parallel computing, with technologies like Hadoop and Spark commanding the list of the most used
These are the main conclusions that we extracted after attending the event, and they can be of the highest interest in the further development of the Data Science & Machine Learning practices:
-The Open Source community in Data Science is an area of the biggest interest, where people all around the world are innovating and using languages like Python to build new solutions in the Data Science field.
-Python is a high-level language with libraries that cover all the Data Science projects lifecycle from Data Ingestion, Data Cleaning, Data Analysis, Advanced Data Mining, Visualization and DevOps. The community is growing at a huge pace and new libraries, approaches and techniques appear almost on a daily basis. Besides, the learning curve for Python is lower in comparison with other languages and is widely used in universities, thus closing the gap between the Data Science skills taught in colleges and the skills needed in the job market.
-Data Science products are developed using agile approaches, with full interaction between Data Science teams (Data Scientists, Data Engineers, Devops) and Domain Experts. Tools like Jupyter notebooks are appropriate for building prototypes and ensemble pieces of code that cover the different steps of the project lifecycle.
-There is a hype involving terms like Deep Learning and Tensor Flow, but it is still unclear how to do a good hyperparameter tuning approach without contaminating the training set.
-The main driver of the big data approach in the Data Science field has been parallel computing, with technologies like Hadoop and Spark commanding the list of the most used. Other technologies like Data Compression have been relegated to play a secondary role.
–Feature engineering is one of the most complex problems to build a machine learning model, so it is of the utmost importance to look for a good approach in dealing with this issue. Prior to feed a machine learning algorithms with lots of feature variables, some filter should be made using PCA, boosting trees or the likes. A promising way of choosing the right parameters and the right model could be found in TPOT. ;TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.