Transactional Machine Learning with Data Streams and AutoML: Build Frictionless and Elastic Machine Learning Solutions with Apache Kafka in the Cloud Using Python

Ebook477 pages3 hours

Transactional Machine Learning with Data Streams and AutoML: Build Frictionless and Elastic Machine Learning Solutions with Apache Kafka in the Cloud Using Python

Name: Transactional Machine Learning with Data Streams and AutoML: Build Frictionless and Elastic Machine Learning Solutions with Apache Kafka in the Cloud Using Python
Author: Sebastian Maurice
ISBN: 9781484270233

By Sebastian Maurice

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Understand how to apply auto machine learning to data streams and create transactional machine learning (TML) solutions that are frictionless (require minimal to no human intervention) and elastic (machine learning solutions that can scale up or down by controlling the number of data streams, algorithms, and users of the insights). This book will strengthen your knowledge of the inner workings of TML solutions using data streams with auto machine learning integrated with Apache Kafka.

Transactional Machine Learning with Data Streams and AutoML introduces the industry challenges with applying machine learning to data streams. You will learn the framework that will help you in choosing business problems that are best suited for TML. You will also see how to measure the business value of TML solutions. You will then learn the technical components of TML solutions, including the reference and technical architecture of a TML solution.

This book also presents a TML solution template that will make it easy for you to quickly start building your own TML solutions. Specifically, you are given access to a TML Python library and integration technologies for download. You will also learn how TML will evolve in the future, and the growing need by organizations for deeper insights from data streams.

By the end of the book, you will have a solid understanding of TML. You will know how to build TML solutions with all the necessary details, and all the resources at your fingertips.

What You Will Learn

Discover transactional machine learning
Choose TML use cases
Design technical architecture of TML solutions with Apache Kafka
Work with the technologies used to build TML solutions
Build transactional machine learning solutions with hands-on code togetherwith Apache Kafka in the cloud

Who This Book Is For

Data scientists, machine learning engineers and architects, and AI and machine learning business leaders.

Skip carousel

LanguageEnglish

PublisherApress

Release dateMay 19, 2021

ISBN9781484270233

Author

Sebastian Maurice

Related authors

Skip carousel

Related to Transactional Machine Learning with Data Streams and AutoML

Related ebooks

Skip carousel

Implementing AI Systems: Transform Your Business in 6 Steps
Ebook
Implementing AI Systems: Transform Your Business in 6 Steps
byTom Taulli
Rating: 0 out of 5 stars
0 ratings
Operating AI: Bridging the Gap Between Technology and Business
Ebook
Operating AI: Bridging the Gap Between Technology and Business
byUlrika Jagare
Rating: 0 out of 5 stars
0 ratings
Data Science Fundamentals for Python and MongoDB
Ebook
Data Science Fundamentals for Python and MongoDB
byDavid Paper
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning with Python: A Problem-Solver's Guide to Building Real-World Intelligent Systems
Ebook
Practical Machine Learning with Python: A Problem-Solver's Guide to Building Real-World Intelligent Systems
byDipanjan Sarkar
Rating: 0 out of 5 stars
0 ratings
Cognitive Computing Recipes: Artificial Intelligence Solutions Using Microsoft Cognitive Services and TensorFlow
Ebook
Cognitive Computing Recipes: Artificial Intelligence Solutions Using Microsoft Cognitive Services and TensorFlow
byAdnan Masood
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Azure: Building and Deploying Artificial Intelligence Solutions on the Microsoft AI Platform
Ebook
Deep Learning with Azure: Building and Deploying Artificial Intelligence Solutions on the Microsoft AI Platform
byMathew Salvaris
Rating: 0 out of 5 stars
0 ratings
Building Intelligent Systems: A Guide to Machine Learning Engineering
Ebook
Building Intelligent Systems: A Guide to Machine Learning Engineering
byGeoff Hulten
Rating: 0 out of 5 stars
0 ratings
MLOps Engineering at Scale
Ebook
MLOps Engineering at Scale
byCarl Osipov
Rating: 0 out of 5 stars
0 ratings
Beginning Machine Learning in iOS: CoreML Framework
Ebook
Beginning Machine Learning in iOS: CoreML Framework
byMohit Thakkar
Rating: 0 out of 5 stars
0 ratings
Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python
Ebook
Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with Python
byDavid Paper
Rating: 0 out of 5 stars
0 ratings
Computer Vision with Maker Tech: Detecting People With a Raspberry Pi, a Thermal Camera, and Machine Learning
Ebook
Computer Vision with Maker Tech: Detecting People With a Raspberry Pi, a Thermal Camera, and Machine Learning
byFabio Manganiello
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning for Streaming Data with Python: Design, Develop, and Validate Online Learning Models
Ebook
Practical Machine Learning for Streaming Data with Python: Design, Develop, and Validate Online Learning Models
bySayan Putatunda
Rating: 0 out of 5 stars
0 ratings
Practical Java Machine Learning: Projects with Google Cloud Platform and Amazon Web Services
Ebook
Practical Java Machine Learning: Projects with Google Cloud Platform and Amazon Web Services
byMark Wickham
Rating: 0 out of 5 stars
0 ratings
Data Science Solutions with Python: Fast and Scalable Models Using Keras, PySpark MLlib, H2O, XGBoost, and Scikit-Learn
Ebook
Data Science Solutions with Python: Fast and Scalable Models Using Keras, PySpark MLlib, H2O, XGBoost, and Scikit-Learn
byTshepo Chris Nokeri
Rating: 0 out of 5 stars
0 ratings
Python Data Science Essentials
Ebook
Python Data Science Essentials
byLuca Massaron
Rating: 0 out of 5 stars
0 ratings
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Ebook
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
byNeal Fishman
Rating: 0 out of 5 stars
0 ratings
Implementing Machine Learning for Finance: A Systematic Approach to Predictive Risk and Performance Analysis for Investment Portfolios
Ebook
Implementing Machine Learning for Finance: A Systematic Approach to Predictive Risk and Performance Analysis for Investment Portfolios
byTshepo Chris Nokeri
Rating: 0 out of 5 stars
0 ratings
Learning .NET High-performance Programming
Ebook
Learning .NET High-performance Programming
byAntonio Esposito
Rating: 0 out of 5 stars
0 ratings
PYTHON MACHINE LEARNING: A Comprehensive Guide to Building Intelligent Applications with Python (2023 Beginner Crash Course)
Ebook
PYTHON MACHINE LEARNING: A Comprehensive Guide to Building Intelligent Applications with Python (2023 Beginner Crash Course)
byGlen Jennings
Rating: 0 out of 5 stars
0 ratings
Effective Data Science Infrastructure: How to make data scientists productive
Ebook
Effective Data Science Infrastructure: How to make data scientists productive
byVille Tuulos
Rating: 0 out of 5 stars
0 ratings
Machine Learning for Decision Makers: Cognitive Computing Fundamentals for Better Decision Making
Ebook
Machine Learning for Decision Makers: Cognitive Computing Fundamentals for Better Decision Making
byPatanjali Kashyap
Rating: 0 out of 5 stars
0 ratings
Deploy Machine Learning Models to Production: With Flask, Streamlit, Docker, and Kubernetes on Google Cloud Platform
Ebook
Deploy Machine Learning Models to Production: With Flask, Streamlit, Docker, and Kubernetes on Google Cloud Platform
byPramod Singh
Rating: 0 out of 5 stars
0 ratings
Learn Microservices with Spring Boot: A Practical Approach to RESTful Services Using an Event-Driven Architecture, Cloud-Native Patterns, and Containerization
Ebook
Learn Microservices with Spring Boot: A Practical Approach to RESTful Services Using an Event-Driven Architecture, Cloud-Native Patterns, and Containerization
byMoisés Macero García
Rating: 0 out of 5 stars
0 ratings
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Ebook
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
byAlok Kumar
Rating: 0 out of 5 stars
0 ratings
Practical Machine Learning in JavaScript: TensorFlow.js for Web Developers
Ebook
Practical Machine Learning in JavaScript: TensorFlow.js for Web Developers
byCharlie Gerard
Rating: 0 out of 5 stars
0 ratings
Designing Machine Learning Systems with Python
Ebook
Designing Machine Learning Systems with Python
byDavid Julian
Rating: 0 out of 5 stars
0 ratings
Testing and Tuning Market Trading Systems: Algorithms in C++
Ebook
Testing and Tuning Market Trading Systems: Algorithms in C++
byTimothy Masters
Rating: 3 out of 5 stars
3/5
Optimizing AI and Machine Learning Solutions: Your ultimate guide to building high-impact ML/AI solutions (English Edition)
Ebook
Optimizing AI and Machine Learning Solutions: Your ultimate guide to building high-impact ML/AI solutions (English Edition)
byMirza Rahim Baig
Rating: 0 out of 5 stars
0 ratings
Python Machine Learning: A Practical Beginner's Guide to Understanding Machine Learning, Deep Learning and Neural Networks with Python, Scikit-Learn, Tensorflow and Keras
Ebook
Python Machine Learning: A Practical Beginner's Guide to Understanding Machine Learning, Deep Learning and Neural Networks with Python, Scikit-Learn, Tensorflow and Keras
byBrandon Railey
Rating: 0 out of 5 stars
0 ratings
Deep Learning with Applications Using Python: Chatbots and Face, Object, and Speech Recognition With TensorFlow and Keras
Ebook
Deep Learning with Applications Using Python: Chatbots and Face, Object, and Speech Recognition With TensorFlow and Keras
byNavin Kumar Manaswi
Rating: 0 out of 5 stars
0 ratings

Intelligence (AI) & Semantics For You

Skip carousel

Artificial Intelligence: A Guide for Thinking Humans
Ebook
Artificial Intelligence: A Guide for Thinking Humans
byMelanie Mitchell
Rating: 4 out of 5 stars
4/5
The Secrets of ChatGPT Prompt Engineering for Non-Developers
Ebook
The Secrets of ChatGPT Prompt Engineering for Non-Developers
byCea West
Rating: 5 out of 5 stars
5/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
ChatGPT Millionaire 2024 - Bot-Driven Side Hustles, Prompt Engineering Shortcut Secrets, and Automated Income Streams that Print Money While You Sleep. The Ultimate Beginner’s Guide for AI Business
Ebook
ChatGPT Millionaire 2024 - Bot-Driven Side Hustles, Prompt Engineering Shortcut Secrets, and Automated Income Streams that Print Money While You Sleep. The Ultimate Beginner’s Guide for AI Business
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
2084: Artificial Intelligence and the Future of Humanity
Ebook
2084: Artificial Intelligence and the Future of Humanity
byJohn C. Lennox
Rating: 4 out of 5 stars
4/5
101 Midjourney Prompt Secrets
Ebook
101 Midjourney Prompt Secrets
byMarcus Byrne
Rating: 3 out of 5 stars
3/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Summary of Super-Intelligence From Nick Bostrom
Ebook
Summary of Super-Intelligence From Nick Bostrom
bySummary Station
Rating: 5 out of 5 stars
5/5
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
Ebook
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
byAlexander Cooper
Rating: 1 out of 5 stars
1/5
Midjourney Mastery - The Ultimate Handbook of Prompts
Ebook
Midjourney Mastery - The Ultimate Handbook of Prompts
byAndreea Todinca
Rating: 5 out of 5 stars
5/5
ChatGPT for Beginners: How to Make Money Online and 10x Your Productivity Using ChatGPT Even if You’re an Absolute Beginner (The Complete Up-to-Date ChatGPT Guide)
Ebook
ChatGPT for Beginners: How to Make Money Online and 10x Your Productivity Using ChatGPT Even if You’re an Absolute Beginner (The Complete Up-to-Date ChatGPT Guide)
byMatthew Hayes
Rating: 0 out of 5 stars
0 ratings
ChatGPT For Dummies
Ebook
ChatGPT For Dummies
byPam Baker
Rating: 0 out of 5 stars
0 ratings
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
ChatGPT For Fiction Writing: AI for Authors
Ebook
ChatGPT For Fiction Writing: AI for Authors
byNova Leigh
Rating: 5 out of 5 stars
5/5
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
The Age of AI: Artificial Intelligence and the Future of Humanity
Ebook
The Age of AI: Artificial Intelligence and the Future of Humanity
byJason Thacker
Rating: 0 out of 5 stars
0 ratings
Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures
Ebook
Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures
byThe Passive Income Strategist
Rating: 4 out of 5 stars
4/5
Enterprise AI For Dummies
Ebook
Enterprise AI For Dummies
byZachary Jarvinen
Rating: 3 out of 5 stars
3/5
10 Great Ways to Earn Money Through Artificial Intelligence(AI)
Ebook
10 Great Ways to Earn Money Through Artificial Intelligence(AI)
byAli Musa
Rating: 3 out of 5 stars
3/5
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
Ebook
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
byHadelin de Ponteves
Rating: 0 out of 5 stars
0 ratings
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
Ebook
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
byTJ Books
Rating: 3 out of 5 stars
3/5
Humans Need Not Apply: A Guide to Wealth & Work in the Age of Artificial Intelligence
Ebook
Humans Need Not Apply: A Guide to Wealth & Work in the Age of Artificial Intelligence
byJerry Kaplan
Rating: 3 out of 5 stars
3/5
Rise of Generative AI and ChatGPT: Understand how Generative AI and ChatGPT are transforming and reshaping the business world (English Edition)
Ebook
Rise of Generative AI and ChatGPT: Understand how Generative AI and ChatGPT are transforming and reshaping the business world (English Edition)
byUtpal Chakraborty
Rating: 0 out of 5 stars
0 ratings
A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going
Ebook
A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going
byMichael Wooldridge
Rating: 4 out of 5 stars
4/5
Hacking With Linux 2020:A Complete Beginners Guide to the World of Hacking Using Linux - Explore the Methods and Tools of Ethical Hacking with Linux
Ebook
Hacking With Linux 2020:A Complete Beginners Guide to the World of Hacking Using Linux - Explore the Methods and Tools of Ethical Hacking with Linux
byJoseph Kenna
Rating: 0 out of 5 stars
0 ratings
The Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications
Ebook
The Business Case for AI: A Leader's Guide to AI Strategies, Best Practices & Real-World Applications
byKavita Ganesan
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling: For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
Podcast episode
Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling: For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.
byData Engineering Podcast
0 ratings
0% found this document useful
Microsoft’s Next Evolution
Podcast episode
Microsoft’s Next Evolution
byInsights Tomorrow
0 ratings
0% found this document useful
#192 Lukas Biewald: How Weights and Biases Supercharges Machine Learning: This episode is sponsored by Oracle. AI is revolutionizing industries, but needs power without breaking the bank. Enter Oracle Cloud Infrastructure (OCI): the one-stop platform for all your AI needs, with 4-8x the bandwidth of other clouds. Train AI...
Podcast episode
#192 Lukas Biewald: How Weights and Biases Supercharges Machine Learning: This episode is sponsored by Oracle. AI is revolutionizing industries, but needs power without breaking the bank. Enter Oracle Cloud Infrastructure (OCI): the one-stop platform for all your AI needs, with 4-8x the bandwidth of other clouds. Train AI...
byEye On A.I.
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Designing Data Platforms For Fintech Companies: Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
Podcast episode
Designing Data Platforms For Fintech Companies: Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector.
byData Engineering Podcast
0 ratings
0% found this document useful
Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics: Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.
Podcast episode
Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics: Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.
byData Engineering Podcast
0 ratings
0% found this document useful
Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh: Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.
Podcast episode
Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh: Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.
byData Engineering Podcast
0 ratings
0% found this document useful
Cost/Performance Optimization with LLMs [Panel]
Podcast episode
Cost/Performance Optimization with LLMs [Panel]
byMLOps.community
0 ratings
0% found this document useful
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
Podcast episode
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
byData Engineering Podcast
0 ratings
0% found this document useful
How Column-Aware Development Tooling Yields Better Data Models: Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.
Podcast episode
How Column-Aware Development Tooling Yields Better Data Models: Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.
byData Engineering Podcast
0 ratings
0% found this document useful
The Role of Infrastructure in ML // Niels Bantilan // #197
Podcast episode
The Role of Infrastructure in ML // Niels Bantilan // #197
byMLOps.community
0 ratings
0% found this document useful
Practical First Steps In Data Governance For Long Term Success: Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. Nicola Askham found her way into data governance by accident, and stayed because of the benefit that she was able to provide by serving as a bridge between the technology and business. In this episode she shares the practical steps to implementing a data governance practice in your organization, and the pitfalls to avoid.
Podcast episode
Practical First Steps In Data Governance For Long Term Success: Modern businesses aspire to be data driven, and technologists enjoy working through the challenge of building data systems to support that goal. Data governance is the binding force between these two parts of the organization. Nicola Askham found her way into data governance by accident, and stayed because of the benefit that she was able to provide by serving as a bridge between the technology and business. In this episode she shares the practical steps to implementing a data governance practice in your organization, and the pitfalls to avoid.
byData Engineering Podcast
0 ratings
0% found this document useful
Fast.ai, AutoML, and Software Engineering for ML: Jeremy Howard // Coffee Session #47
Podcast episode
Fast.ai, AutoML, and Software Engineering for ML: Jeremy Howard // Coffee Session #47
byMLOps.community
0 ratings
0% found this document useful
Building LLM Products Panel // LLMs in Production Conference Part II
Podcast episode
Building LLM Products Panel // LLMs in Production Conference Part II
byMLOps.community
0 ratings
0% found this document useful
A "AI & ML" Look Ahead for 2020
Podcast episode
A "AI & ML" Look Ahead for 2020
byThe Cloudcast
0 ratings
0% found this document useful
Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach: Artificial intelligence has dominated the headlines for several months due to the successes of large language models. This has prompted numerous debates about the possibility of, and timeline for, artificial general intelligence (AGI). Peter Voss has dedicated decades of his life to the pursuit of truly intelligent software through the approach of cognitive AI. In this episode he explains his approach to building AI in a more human-like fashion and the emphasis on learning rather than statistical prediction.
Podcast episode
Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach: Artificial intelligence has dominated the headlines for several months due to the successes of large language models. This has prompted numerous debates about the possibility of, and timeline for, artificial general intelligence (AGI). Peter Voss has dedicated decades of his life to the pursuit of truly intelligent software through the approach of cognitive AI. In this episode he explains his approach to building AI in a more human-like fashion and the emphasis on learning rather than statistical prediction.
byData Engineering Podcast
0 ratings
0% found this document useful
MLOps - Design Thinking to Build ML Infra for ML and LLM Use Casess // Amritha Arun Babu & Abhik Choudhury // #221
Podcast episode
MLOps - Design Thinking to Build ML Infra for ML and LLM Use Casess // Amritha Arun Babu & Abhik Choudhury // #221
byMLOps.community
0 ratings
0% found this document useful
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
Podcast episode
Unpacking The Seven Principles Of Modern Data Pipelines: Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.
byData Engineering Podcast
0 ratings
0% found this document useful
An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch: Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.
Podcast episode
An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch: Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.
byData Engineering Podcast
0 ratings
0% found this document useful
Explainability in the MLOps Cycle // Dattaraj Rao // MLOps Podcast #138
Podcast episode
Explainability in the MLOps Cycle // Dattaraj Rao // MLOps Podcast #138
byMLOps.community
0 ratings
0% found this document useful
Great Data Models Need Great Features
Podcast episode
Great Data Models Need Great Features
byThe Cloudcast
0 ratings
0% found this document useful
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
Podcast episode
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
byData Engineering Podcast
0 ratings
0% found this document useful
Practitioners Guide to MLOps // Donna Schut and Christos Aniftos // Coffee Sessions #82
Podcast episode
Practitioners Guide to MLOps // Donna Schut and Christos Aniftos // Coffee Sessions #82
byMLOps.community
0 ratings
0% found this document useful
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
Podcast episode
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
byData Engineering Podcast
0 ratings
0% found this document useful
20VC: AI's Biggest Questions: The Commoditisation of LLMs, Open vs Closed: Who Wins, Model Size vs Data Quality, Why Google are Vulnerable and Apple are the Dark Horse: Des Traynor is a Co-Founder of , and has built and led many teams within the company, including Product, Marketing, and Customer Support. Yann LeCun is VP & and Silver Professor at NYU affiliated with the Courant Institute of...
Podcast episode
20VC: AI's Biggest Questions: The Commoditisation of LLMs, Open vs Closed: Who Wins, Model Size vs Data Quality, Why Google are Vulnerable and Apple are the Dark Horse: Des Traynor is a Co-Founder of , and has built and led many teams within the company, including Product, Marketing, and Customer Support. Yann LeCun is VP & and Silver Professor at NYU affiliated with the Courant Institute of...
byThe Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch
0 ratings
0% found this document useful
Robotic and Intelligent Process Automation
Podcast episode
Robotic and Intelligent Process Automation
byThe Cloudcast
100%
100% found this document useful
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
Podcast episode
Let The Whole Team Participate In Data With The Quilt Versioned Data Hub: Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.
byData Engineering Podcast
0 ratings
0% found this document useful
The Grand Vision And Present Reality of DataOps: A conversation about the grand vision and current realities of DataOps and how you can start on the journey toward more maintainable and reliable data systems.
Podcast episode
The Grand Vision And Present Reality of DataOps: A conversation about the grand vision and current realities of DataOps and how you can start on the journey toward more maintainable and reliable data systems.
byData Engineering Podcast
0 ratings
0% found this document useful
MLOps at DoorDash // Hien Luu and DoorDash Leads // Coffee Sessions #119
Podcast episode
MLOps at DoorDash // Hien Luu and DoorDash Leads // Coffee Sessions #119
byMLOps.community
0 ratings
0% found this document useful
2 tools to get you 90% operational // Michael Del Balso - Willem Pienaar - David Aronchick // MLOps Meetup #50
Podcast episode
2 tools to get you 90% operational // Michael Del Balso - Willem Pienaar - David Aronchick // MLOps Meetup #50
byMLOps.community
0 ratings
0% found this document useful

Skip carousel

Leadership Forum: Investing in Disruption
Rotman Management
Article
Leadership Forum: Investing in Disruption
Jan 1, 2019
10 min read
There Was Never Such a Thing as ‘Open’ AI
The Atlantic
Article
There Was Never Such a Thing as ‘Open’ AI
Jan 4, 2024
6 min read
Generative AI: What Leaders Need To Know
Rotman Management
Article
Generative AI: What Leaders Need To Know
Jan 1, 2024
12 min read
In Conversation with Surbhi Rathore
Techfastly
Article
In Conversation with Surbhi Rathore
Oct 1, 2021
4 min read
Forward Thinking
Racecar Engineering
Article
Forward Thinking
Feb 4, 2022
8 min read
01 Ready Or Not, AI Is Here To Assist You
HWM Singapore
Article
01 Ready Or Not, AI Is Here To Assist You
Jul 11, 2023
4 min read
2024: What Is The Near Future Of Generative AI?
The European Business Review
Article
2024: What Is The Near Future Of Generative AI?
Jan 26, 2024
8 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
How Can AI Help Your Business?
PC Pro Magazine
Article
How Can AI Help Your Business?
Jun 8, 2023
7 min read
Data-driven Decision Making That Uses Data, Mind And Heart
The European Business Review
Article
Data-driven Decision Making That Uses Data, Mind And Heart
Jan 31, 2020
14 min read
Bots And Robbers What Is AI, And Will It Make Us All Redundant?
Guardian Weekly
Article
Bots And Robbers What Is AI, And Will It Make Us All Redundant?
Nov 3, 2023
What is artificial intelligence? The term was coined in 1955 by a team including Harvard computer scientist Marvin Minsky. With no strict definition of the phrase, almost anything more complex than a calculator has been called artificial intelligence
3 min read
In Conversation with Rajesh Dhuddu Global Head, Blockchain & Metaverse Practice, Tech Mahindra
Techfastly
Article
In Conversation with Rajesh Dhuddu Global Head, Blockchain & Metaverse Practice, Tech Mahindra
Nov 1, 2022
6 min read
“The Biggest Problem I See When People Are Working From Home Is A Poorly Designed Network”
PC Pro Magazine
Article
“The Biggest Problem I See When People Are Working From Home Is A Poorly Designed Network”
Jun 8, 2023
6 min read
Why We Need To Fear The Risk Of AI Model Collapse
Evening Standard
Article
Why We Need To Fear The Risk Of AI Model Collapse
Dec 17, 2023
4 min read
Arnab PANDEY
Techfastly
Article
Arnab PANDEY
Apr 1, 2021
11 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
Empowering Small And Medium Enterprises Through The Synergy Of AI And Blockchain
The European Business Review
Article
Empowering Small And Medium Enterprises Through The Synergy Of AI And Blockchain
Jan 25, 2021
10 min read
Will Generative AI Disrupt Your Company And Your need For Workers?
The European Business Review
Article
Will Generative AI Disrupt Your Company And Your need For Workers?
Jul 31, 2023
5 min read
Cognitive Enterprise
Techfastly
Article
Cognitive Enterprise
Dec 1, 2021
6 min read
Decoding The Impact Of AI
Her World Singapore
Article
Decoding The Impact Of AI
May 5, 2023
6 min read
The Vigilance of Satya Nadella
Fortune
Article
The Vigilance of Satya Nadella
Jun 4, 2024
18 min read
AI As A Service
PC Pro Magazine
Article
AI As A Service
Jul 9, 2020
2 min read
What Have Humans Just Unleashed?
The Atlantic
Article
What Have Humans Just Unleashed?
Mar 16, 2023
9 min read
Mining Actionable Information with Smart Capture
The European Business Review
Article
Mining Actionable Information with Smart Capture
May 22, 2018
4 min read
The Big Tech Boost
Business Today
Article
The Big Tech Boost
Jan 5, 2024
5 min read
Top Five AI-ML Books For Business Leaders
Techfastly
Article
Top Five AI-ML Books For Business Leaders
Aug 2, 2021
5 min read
The Significance of Machine Learning
Techfastly
Article
The Significance of Machine Learning
Mar 1, 2022
3 min read
Five Technology Tips For Dark Factories Installation
Techfastly
Article
Five Technology Tips For Dark Factories Installation
Jun 1, 2021
6 min read
Quantum Jump
Business Today
Article
Quantum Jump
Dec 25, 2018
2 min read
Web App Security
Linux Format
Article
Web App Security
Jun 29, 2021
8 min read

Related categories

Skip carousel

Reviews for Transactional Machine Learning with Data Streams and AutoML

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Transactional Machine Learning with Data Streams and AutoML - Sebastian Maurice

S. MauriceTransactional Machine Learning with Data Streams and AutoMLhttps://doi.org/10.1007/978-1-4842-7023-3_1

1. Introduction: Big Data, Auto Machine Learning, and Data Streams

Sebastian Maurice¹

(1)

Toronto, ON, Canada

Data streams are a class of data that is continuously updated and captured and grows in volume and is largely unbounded [Aggarwal, 2007; Wrench et al., 2016]. Consider how our everyday lives contribute to data streams. Every time we purchase something with a credit card, the purchasing event information about your name, purchase amount, product purchased, time and date purchased, location where it was purchased, quantity, product code, and so on are all captured in real time and stored in a data storage platform capable of storing large amounts of data. Browsing the Web also results in enormous amounts of data flowing through IP networks that are being captured by your Internet service providers (ISPs). Even the cars we drive are becoming more connected to the Internet. The car manufacturers are capturing and storing all of the telemetry and GPS data.

Data continues to seep into all facets of our lives. Everyday items that we use today such as refrigerators, cars, washing machines, TVs, and so on create massive amounts of data each day. By some estimates, we create 2.5 quintillion bytes of data each day. And, most of the world’s data was created in just the past few years. This is impressive in terms of scale and shows that data is flooding our world in ways we never imagined 10 or 15 years ago. Most of us are familiar with data that exists in database tables, flat files, and dataframes, but a new category of data that is creating new challenges for data engineers, scientists, and analysts is massive, fast-moving streams of data, driven by a digitally connected world. We are all aware of the growth of data and its value for organizations [Read et al., 2019; Read et al., 2020; Guzy and Wozniak, 2020; Lang et al., 2020], but we are still in the early stages of managing and analyzing fast-moving streams of data, along with managing the load that comes with varying spikes of events that trigger large data flows which could also affect the performance of machine learning models. Data streams, or continuous flows of data, discussed in detail later, are being generated from multiple sources such as humans or machines, and the technology to manage these streams is growing. Effectively managing, and analyzing, streaming data is becoming a necessary capability in high-volume transaction industries like financial services, social technology, retail, media, health care, and manufacturing. Think of all the data generated each second (or faster) from Facebook, Twitter, LinkedIn, Netflix, IoT devices, financial technologies, and the like. These types of fast flowing data that accumulate quickly, and if permitted can grow in size to an unlimited amount, can be analyzed by data stream scientists using transactional machine learning (TML) in the following ways:

Data streams can be rolled back and joined to form a consolidated dataset in real time that can be used as a training dataset for TML. By analyzing windows of training datasets in real time, we avoid the need to analyze all the data at once; rather we can analyze all the data in transactions of time continuously. For example, if you want to analyze every retail transaction for credit card fraud, applying TML on streams of credit card transactions would help you to make decisions in time at the point of purchase.

Data streams can be named to help form a machine learning model. Specifically, by naming data streams, we can identify a dependent variable stream and independent variable streams to construct a model that can be estimated by TML.

Data streams can be repurposed to store information on the optimal algorithm that is chosen by TML. These algorithms can then be used for predictive analytics and optimization, which can be stored in other data streams and used by humans, or machines, in reports and dashboards for decision-making.

We will further show how TML leads to frictionless machine learning which can accelerate conventional machine learning approaches that operate on nontransactional data. Specifically, a conventional machine learning process requires human intervention when preparing data, formulating a mathematical model with the dependent and independent variables, estimating the model and fine-tuning the hyperparameters in the model, and finally deploying the model for real-world use. All of these processes cause friction that can add days or weeks to the machine learning process [Yao et al., 2019]. We show in this book how TML can significantly reduce this friction when dealing with data streams using AutoML.

We will also show how TML solutions are elastic. TML solutions are elastic because you can adjust the number of data streams and machine learning models that are created, as well as adjust the number of producers of data and consumers of insights from the machine learning models. This is important for several reasons:

Allows organizations to quickly meet the analytic needs of a fast-changing business area

Allows organizations to control costs for solutions that are no longer being used by deactivating TML solutions quickly

Allows organizations to scale up or down solutions based on user demand

A core component of any machine learning process is data. Traditionally, a central concern for a CIO or CDO, before even doing machine learning, is developing a data strategy. But, the increase in the speed of data creates another layer of complexity for data management and analysis that is not easily incorporated in conventional data strategies. This book will provide ways to address this challenge and show how data streams can be incorporated into data strategies that will align to the goals and objectives of your organization. Before we discuss that, a question we need to ask is: what is data? A quick search for data on Google will bring up millions of hits on data. In this book, we will assume data that are digitally created. There are three forms of data:

Structured data

Semi-structured data

Unstructured data

Structured Data

Structured data are data that are neatly organized in some database. This structure enables developers or users to access data in a way that can be standardized and repeated for use in various types of technological solutions. Structured data is a common type of data because it makes accessing data easier for analysis and reporting. To impose structure on data, we have to do the following:

Classify it – Is it a number or text or image?

Size it – How big is the data? Or how big is it likely to get?

Name it – What name should we give it? Let’s assume that all data with names are called variables.

By classifying, sizing, and naming data, we are not only structuring the data but making it easier for others to use it and access it. This is important for analyzing and visualizing the data in reports and dashboards.

Semi-structured Data

These types of data have some structure to them, but not all of them are structured. Think of data that cannot completely fit into a tabular form but can be tagged and identified by keys and values. An example would be data that are called JSON¹ or XML.² These are generally accepted industry standard forms of labeling data by keys and values, but they do not fit in a standard, structured, relational database. Semi-structured is an important form of data because it does not require a database schema for storage. The storage of data can be defined at the application level in the form of JSON or XML. This makes semi-structured data very flexible to use, and exchange, between diverse applications which makes it easier to consume and visualize in reports and dashboards. TML solutions use JSON data formats.

Unstructured Data

These types of data have no structure. They cannot be easily classified in tabular form or put in a key-value format like JSON or XML. Unstructured data are probably the most abundant form of data because they can be created by almost any digital device. Think of emails, data from websites, video, images, sensor readings, and so on. There can be value in imposing a structure on unstructured data. For example, say you have thousands of emails and you want to structure the emails by keywords. Assigning a keyword to an email allows you to classify all emails with that keyword and improve the speed of searching through emails. The common thread between all these types of data is volume. The enormous growth in these types of data is referred to as big data, discussed in the next section.

A Quick Take on Big Data

The term Big Data has been used since the 1990s to describe data that are too large to be analyzed using conventional methods. Specifically, data that are not big data can easily be curated, prepped, and analyzed on a laptop or a home computer. A common definition, and origin, of the term Big Data is likely to be attributed to John Mashey [Mashey, 1999; Lohr, 2013]. His use of the term Big Data in the context of computers was the first time that someone had documented data growth and its growing demands on computer hardware such as disk space, CPU, and infrastructure, which he referred to as InfraStress.

Big data was then characterized as [Sagiroglu, 2013]

Volume – This refers to the size of data that are measured in growing terabytes, petabytes, and beyond. This volume will impact the choice of storage hardware used to store these data and the types of analysis that can be done.

Variety – This refers to the different types of data that can be classified as big data. The types of data will impact not only the storage choice but how data are analyzed, prepped, and curated for analysis. For example, if Big Data are textual, then to perform analysis on these data using machine learning techniques will require that data be converted to numerical form for analysis.

Velocity – This refers to the creation speed of data. The speed of data creation will directly impact the volume of these data. This will also impact how data are processed and how they can be analyzed for insights.

Veracity – This refers to the quality of data. The quality of data will impact the quality of the insights that are extracted from these data. While it is difficult to gauge data quality, statistical methods and algorithms are available to determine data quality; more on data quality later.

Value – The value of the insights extracted from data. The value of data should be gauged within the context of the problem or area of investigation. For example, if one is trying to answer the question of why car insurance premiums are higher for younger people than they are for older people, then using data that captures the driving patterns of different demographics could add a lot of value in answering this question: optimally pricing insurance premiums for different age groups. Choosing the right data to address the right problem can offer considerable value in many business domains.

The preceding characteristics are not a complete list, but they give us guidance in understanding, characterizing, and classifying big data. Specifically, volume, variety, and velocity of data present challenges in ensuring data quality and finding ways to analyze high-speed data with machine learning that offers quality insights for decision-making. These challenges with high-speed data are exactly the ones that are addressed and resolved by TML.

Data streams can lead to big data, but big data does not necessarily lead to data streams [Jayanthiladevi et al., 2018]. Specifically, continuous flows of data will accumulate in your storage platform leading to big data. However, big data does not need to flow continuously and can be static and disk resident. Within the context of data streams (discussed later), big data characteristics that apply to data streams are velocity, volume, veracity, and variety. The value characteristic can be further applied if performing TML: when using data streams together with auto machine learning, as we will discuss in Chapter 2.

From a TML solution and infrastructure perspective, it will be important to establish an environment where data can grow and not be limited in any way. Organizations that embrace a limitless data mentality will promote a greater emphasis on data analysis to extract insights to make better data-driven decisions [Sagiroglu, 2013]. However, making good data-driven decisions will be dependent on using data with high quality. We will discuss data quality in the next section.

Data Quality

Insights that are extracted from data are directly dependent on the quality of data. Data quality concerns do not change between conventional, static data and continuously flowing data streams. However, how quality issues are determined and identified does vary between the two types of data, and this is directly a function of the velocity of the data. For example, the higher velocity of data will give rise to faster changes in the underlying structure of the data. Detecting and improving data quality in data streams present further challenges. Given the continuous flow of data, assessing quality requires automated and real-time processing [Gudivada et al., 2017]. How can you perform data imputation or detect duplicate data in a data stream? This is still an outstanding issue but is starting to get more attention [Gudivada et al., 2017]. TML can offer some help in this area, as discussed in Chapter 2. Specifically, if trying to identify outliers, or anomalies, in the data, conventional anomaly detection mechanisms may not pick up all outliers. The issue with conventional approaches is that they do not take into account transactional data that vary with time. We will show how TML uses unsupervised learning algorithms to detect outliers, or anomalies, in fast flow data streams that vary quickly with time.

The adage of garbage in, garbage out is true. Good data, as opposed to bad data, is a critical requirement for good insights. But how does one determine whether data they have is of good quality? Using the International Organization for Standardization (ISO) definition on quality [Heravizadeh et al., 2009]: the totality of the characteristics of an entity that bear on its ability to satisfy stated and implied needs. These stated and implied needs will vary based on the environment and context in which these data are being used. This would further imply that data quality thresholds will vary with the environment and context. For example, it is likely that the data quality threshold needed to measure someone’s risk of cancer would be higher than the quality threshold needed to measure the likelihood of people preferring Coca-Cola over Pepsi.

In fact, the dimensions of data quality will vary as well. For example, in accounting and auditing, accuracy, relevancy, and timeliness are three important data quality dimensions [Sidi et al., 2012]. In the area of Information Systems, reliability, precision, relevancy, usability, and independency are important. Table 1-1 shows a consolidated list of data quality dimensions [Sidi et al., 2012, pp. 302].

Table 1-1

Data Quality Dimensions

Using data mining and statistical techniques, together with finding dependencies between dimensions, allows us to determine the level of data quality [Sidi et al., 2012]. But, assessing data quality in Big Data offers some challenges such as dealing with complex factors, missing data, data duplication, and data heterogeneity all resulting from data being sourced from multiple sources [Gudivada et al., 2017] and generated by multiple types: humans and machines. Several data mining and statistical techniques are available to improve data quality such as data imputation to fill in missing data, outlier detection using machine learning algorithms like regression analysis, and duplicate data detection using natural language processing [Gudivada et al., 2017].

Finding dependencies between dimensions and then using data mining and statistical techniques to objectively measure the level of quality offers promise. We can apply a framework that shows how dimensions are related to data variables that would lead to a data quality improvement [McGilvray, 2008]. However, while it

Enjoying the preview?

Page 1 of 1

Transactional Machine Learning with Data Streams and AutoML: Build Frictionless and Elastic Machine Learning Solutions with Apache Kafka in the Cloud Using Python

About this ebook

Sebastian Maurice

Related authors

Related to Transactional Machine Learning with Data Streams and AutoML

Related ebooks

Intelligence (AI) & Semantics For You

Related podcast episodes

Related articles

Related categories

Reviews for Transactional Machine Learning with Data Streams and AutoML

What did you think?

Book preview

Transactional Machine Learning with Data Streams and AutoML - Sebastian Maurice

1. Introduction: Big Data, Auto Machine Learning, and Data Streams

Structured Data

Semi-structured Data

Unstructured Data

A Quick Take on Big Data

Data Quality