Ebook491 pages6 hours

Mastering Java for Data Science

Name: Mastering Java for Data Science
Brand: Packt Publishing
Rating: 5.0 (1 reviews)

By Alexey Grigorev

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

About This Book

An overview of modern Data Science and Machine Learning libraries available in Java
Coverage of a broad set of topics, going from the basics of Machine Learning to Deep Learning and Big Data frameworks.
Easy-to-follow illustrations and the running example of building a search engine.

Who This Book Is For

This book is intended for software engineers who are comfortable with developing Java applications and are familiar with the basic concepts of data science. Additionally, it will also be useful for data scientists who do not yet know Java but want or need to learn it.

If you are willing to build efficient data science applications and bring them in the enterprise environment without changing the existing stack, this book is for you!

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateApr 27, 2017

ISBN9781785887390

Author

Alexey Grigorev

Related authors

Skip carousel

Related to Mastering Java for Data Science

Related ebooks

Skip carousel

Distributed Computing in Java 9
Ebook
Distributed Computing in Java 9
byRaja Malleswara Rao Pattamsetti
Rating: 0 out of 5 stars
0 ratings
Building a Recommendation System with R
Ebook
Building a Recommendation System with R
byGorakala Suresh K.
Rating: 0 out of 5 stars
0 ratings
Getting Started with Python Data Analysis
Ebook
Getting Started with Python Data Analysis
byVo.T.H Phuong
Rating: 0 out of 5 stars
0 ratings
R Machine Learning By Example
Ebook
R Machine Learning By Example
byDipanjan Sarkar
Rating: 0 out of 5 stars
0 ratings
Mastering Scala Machine Learning
Ebook
Mastering Scala Machine Learning
byAlex Kozlov
Rating: 0 out of 5 stars
0 ratings
Apache Spark Graph Processing
Ebook
Apache Spark Graph Processing
byRamamonjison Rindra
Rating: 0 out of 5 stars
0 ratings
Java 9 Programming By Example
Ebook
Java 9 Programming By Example
byPeter Verhas
Rating: 4 out of 5 stars
4/5
Learning Data Mining with Python - Second Edition
Ebook
Learning Data Mining with Python - Second Edition
byRobert Layton
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Spark - Second Edition
Ebook
Machine Learning with Spark - Second Edition
byNick Pentreath
Rating: 0 out of 5 stars
0 ratings
Troubleshooting PostgreSQL
Ebook
Troubleshooting PostgreSQL
byHans-Jürgen Schönig
Rating: 5 out of 5 stars
5/5
PostgreSQL Development Essentials
Ebook
PostgreSQL Development Essentials
byManpreet Kaur
Rating: 5 out of 5 stars
5/5
Apache Cassandra Essentials
Ebook
Apache Cassandra Essentials
byPadalia Nitin
Rating: 4 out of 5 stars
4/5
Everyday Data Structures
Ebook
Everyday Data Structures
byWilliam Smith
Rating: 0 out of 5 stars
0 ratings
Python for Google App Engine
Ebook
Python for Google App Engine
byMassimiliano Pippi
Rating: 0 out of 5 stars
0 ratings
JasperReports 3.5 for Java Developers
Ebook
JasperReports 3.5 for Java Developers
byDavid R. Heffelfinger
Rating: 0 out of 5 stars
0 ratings
Deep Learning with TensorFlow
Ebook
Deep Learning with TensorFlow
byMd. Rezaul Karim
Rating: 5 out of 5 stars
5/5
R Object-oriented Programming
Ebook
R Object-oriented Programming
byKelly Black
Rating: 3 out of 5 stars
3/5
Learn D3.js: Create interactive data-driven visualizations for the web with the D3.js library
Ebook
Learn D3.js: Create interactive data-driven visualizations for the web with the D3.js library
byHelder da Rocha
Rating: 0 out of 5 stars
0 ratings
Python Data Structures and Algorithms
Ebook
Python Data Structures and Algorithms
byBenjamin Baka
Rating: 5 out of 5 stars
5/5
Java for Data Science
Ebook
Java for Data Science
byJennifer L. Reese
Rating: 0 out of 5 stars
0 ratings
Scala for Machine Learning
Ebook
Scala for Machine Learning
byNicolas Patrick R.
Rating: 0 out of 5 stars
0 ratings
Hadoop MapReduce v2 Cookbook - Second Edition
Ebook
Hadoop MapReduce v2 Cookbook - Second Edition
byThilina Gunarathne
Rating: 0 out of 5 stars
0 ratings
Python: Deeper Insights into Machine Learning
Ebook
Python: Deeper Insights into Machine Learning
byJohn Hearty
Rating: 0 out of 5 stars
0 ratings
Apache Spark for Data Science Cookbook
Ebook
Apache Spark for Data Science Cookbook
byPadma Priya Chitturi
Rating: 0 out of 5 stars
0 ratings
Mastering Spark for Data Science
Ebook
Mastering Spark for Data Science
byAndrew Morgan
Rating: 0 out of 5 stars
0 ratings
MongoDB Cookbook - Second Edition
Ebook
MongoDB Cookbook - Second Edition
byDasadia Cyrus
Rating: 0 out of 5 stars
0 ratings
Mastering Apache Cassandra - Second Edition
Ebook
Mastering Apache Cassandra - Second Edition
byNishant Neeraj
Rating: 0 out of 5 stars
0 ratings
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Ebook
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
byAlok Kumar
Rating: 0 out of 5 stars
0 ratings
Python Data Science Essentials
Ebook
Python Data Science Essentials
byBoschetti Alberto
Rating: 0 out of 5 stars
0 ratings
JUnit Recipes: Practical Methods for Programmer Testing
Ebook
JUnit Recipes: Practical Methods for Programmer Testing
byScott Stirling
Rating: 4 out of 5 stars
4/5

Intelligence (AI) & Semantics For You

Skip carousel

Artificial Intelligence: A Guide for Thinking Humans
Ebook
Artificial Intelligence: A Guide for Thinking Humans
byMelanie Mitchell
Rating: 4 out of 5 stars
4/5
2084: Artificial Intelligence and the Future of Humanity
Ebook
2084: Artificial Intelligence and the Future of Humanity
byJohn C. Lennox
Rating: 4 out of 5 stars
4/5
ChatGPT For Dummies
Ebook
ChatGPT For Dummies
byPam Baker
Rating: 0 out of 5 stars
0 ratings
Midjourney Mastery - The Ultimate Handbook of Prompts
Ebook
Midjourney Mastery - The Ultimate Handbook of Prompts
byAndreea Todinca
Rating: 5 out of 5 stars
5/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
Summary of Super-Intelligence From Nick Bostrom
Ebook
Summary of Super-Intelligence From Nick Bostrom
bySummary Station
Rating: 5 out of 5 stars
5/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
101 Midjourney Prompt Secrets
Ebook
101 Midjourney Prompt Secrets
byMarcus Byrne
Rating: 3 out of 5 stars
3/5
ChatGPT for Beginners: How to Make Money Online and 10x Your Productivity Using ChatGPT Even if You’re an Absolute Beginner (The Complete Up-to-Date ChatGPT Guide)
Ebook
ChatGPT for Beginners: How to Make Money Online and 10x Your Productivity Using ChatGPT Even if You’re an Absolute Beginner (The Complete Up-to-Date ChatGPT Guide)
byMatthew Hayes
Rating: 0 out of 5 stars
0 ratings
Impromptu: Amplifying Our Humanity Through AI
Ebook
Impromptu: Amplifying Our Humanity Through AI
byReid Hoffman
Rating: 5 out of 5 stars
5/5
The Secrets of ChatGPT Prompt Engineering for Non-Developers
Ebook
The Secrets of ChatGPT Prompt Engineering for Non-Developers
byCea West
Rating: 5 out of 5 stars
5/5
ChatGPT For Fiction Writing: AI for Authors
Ebook
ChatGPT For Fiction Writing: AI for Authors
byNova Leigh
Rating: 5 out of 5 stars
5/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Our Final Invention: Artificial Intelligence and the End of the Human Era
Ebook
Our Final Invention: Artificial Intelligence and the End of the Human Era
byJames Barrat
Rating: 4 out of 5 stars
4/5
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
Ebook
ChatGPT Money Machine 2024 - The Ultimate Chatbot Cheat Sheet to Go From Clueless Noob to Prompt Prodigy Fast! Complete AI Beginner’s Course to Catch the GPT Gold Rush Before It Leaves You Behind
byAlec Rowe
Rating: 0 out of 5 stars
0 ratings
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
Ebook
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
byAlexander Cooper
Rating: 1 out of 5 stars
1/5
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
Hacking With Linux 2020:A Complete Beginners Guide to the World of Hacking Using Linux - Explore the Methods and Tools of Ethical Hacking with Linux
Ebook
Hacking With Linux 2020:A Complete Beginners Guide to the World of Hacking Using Linux - Explore the Methods and Tools of Ethical Hacking with Linux
byJoseph Kenna
Rating: 0 out of 5 stars
0 ratings
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
The Algorithm of the Universe (A New Perspective to Cognitive AI)
Ebook
The Algorithm of the Universe (A New Perspective to Cognitive AI)
byAncient Philosophy
Rating: 5 out of 5 stars
5/5
Large Language Models
Ebook
Large Language Models
byA. Scholtens
Rating: 2 out of 5 stars
2/5
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings
A Quickstart Guide To Becoming A ChatGPT Millionaire: The ChatGPT Book For Beginners (Lazy Money Series®)
Ebook
A Quickstart Guide To Becoming A ChatGPT Millionaire: The ChatGPT Book For Beginners (Lazy Money Series®)
byS M Howard
Rating: 4 out of 5 stars
4/5
Rise of Generative AI and ChatGPT: Understand how Generative AI and ChatGPT are transforming and reshaping the business world (English Edition)
Ebook
Rise of Generative AI and ChatGPT: Understand how Generative AI and ChatGPT are transforming and reshaping the business world (English Edition)
byUtpal Chakraborty
Rating: 0 out of 5 stars
0 ratings
Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures
Ebook
Chat-GPT Income Ideas: Pioneering Monetization Concepts Utilizing Conversational AI for Profitable Ventures
byThe Passive Income Strategist
Rating: 4 out of 5 stars
4/5
ChatGPT: The Future of Intelligent Conversation
Ebook
ChatGPT: The Future of Intelligent Conversation
byCea West
Rating: 4 out of 5 stars
4/5
THE CHATGPT MILLIONAIRE'S HANDBOOK: UNLOCKING WEALTH THROUGH AI AUTOMATION
Ebook
THE CHATGPT MILLIONAIRE'S HANDBOOK: UNLOCKING WEALTH THROUGH AI AUTOMATION
byLogan Rivers
Rating: 5 out of 5 stars
5/5
AI for Educators: AI for Educators
Ebook
AI for Educators: AI for Educators
byMatt Miller
Rating: 5 out of 5 stars
5/5
Make Money with ChatGPT: Your Guide to Making Passive Income Online with Ease using AI: AI Wealth Mastery
Ebook
Make Money with ChatGPT: Your Guide to Making Passive Income Online with Ease using AI: AI Wealth Mastery
byBen Preston
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

Nest, Node.js, & F.Secure - Application Security Weekly #None: In the news, the entire Nest ecosystem of smart home devices goes offline, how Alphabet plans to keep hackers away from this year's election, the Node.js Ecosystem is chaotic and insecure, open-source vulnerabilities plague enterprise codebase...
Podcast episode
Nest, Node.js, & F.Secure - Application Security Weekly #None: In the news, the entire Nest ecosystem of smart home devices goes offline, how Alphabet plans to keep hackers away from this year's election, the Node.js Ecosystem is chaotic and insecure, open-source vulnerabilities plague enterprise codebase...
bySecurity Weekly Podcast Network (Video)
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
Podcast episode
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
byNew Books in Science, Technology, and Society
0 ratings
0% found this document useful
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
Podcast episode
State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
MLA 015 SageMaker 1: Part 1 of deploying your ML models to the cloud with SageMaker (MLOps) MLOps is deploying your ML models to the cloud. See for an overview of tooling (also generally a great ML educational run-down.) And I forgot to...
Podcast episode
MLA 015 SageMaker 1: Part 1 of deploying your ML models to the cloud with SageMaker (MLOps) MLOps is deploying your ML models to the cloud. See for an overview of tooling (also generally a great ML educational run-down.) And I forgot to...
byMachine Learning Guide
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
Podcast episode
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
byThe Web Platform Podcast
0 ratings
0% found this document useful
The Real E2E RAG Stack // Sam Bean // #217
Podcast episode
The Real E2E RAG Stack // Sam Bean // #217
byMLOps.community
0 ratings
0% found this document useful
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
Podcast episode
Build Your Second Brain One Piece At A Time: Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.
byData Engineering Podcast
0 ratings
0% found this document useful
How Data Platforms Affect ML & AI // Jake Watson // #207
Podcast episode
How Data Platforms Affect ML & AI // Jake Watson // #207
byMLOps.community
0 ratings
0% found this document useful
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
Podcast episode
Harnessing Generative AI For Creating Educational Content With Illumidesk: Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.
byData Engineering Podcast
0 ratings
0% found this document useful
MLOps Meetup #24 // How to Become a Better Data Scientist: The Definite Guide // Alexey Grigorev
Podcast episode
MLOps Meetup #24 // How to Become a Better Data Scientist: The Definite Guide // Alexey Grigorev
byMLOps.community
0 ratings
0% found this document useful
Agile Development for Data Scientists, Part 1: The Good: If you're a data scientist at a firm that does a …
Podcast episode
Agile Development for Data Scientists, Part 1: The Good: If you're a data scientist at a firm that does a …
byLinear Digressions
0 ratings
0% found this document useful
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel: Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
Podcast episode
Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel: Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
byData Engineering Podcast
0 ratings
0% found this document useful
Potluck - Courses for Kids × Sub-Components × Recursion × DB Hosting × Frameworks × Data Structures & Algorithms × More!: It’s another potluck! In this episode, Scott and Wes answer your questions about kids learning to code, React sub-components, why it’s so hard to scale, new frameworks, data structures, and more! LogRocket - Sponsor LogRocket lets you replay what...
Podcast episode
Potluck - Courses for Kids × Sub-Components × Recursion × DB Hosting × Frameworks × Data Structures & Algorithms × More!: It’s another potluck! In this episode, Scott and Wes answer your questions about kids learning to code, React sub-components, why it’s so hard to scale, new frameworks, data structures, and more! LogRocket - Sponsor LogRocket lets you replay what...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
70: Web Components at Microsoft: Summary Daniel Buchner (@csuwildcat), former Mozillian & Program Manager at Microsoft takes us through the plans for Web Components at Microsoft. Daniel is the creator of the Web Components free open source library, X-Tag which Microsoft is now...
Podcast episode
70: Web Components at Microsoft: Summary Daniel Buchner (@csuwildcat), former Mozillian & Program Manager at Microsoft takes us through the plans for Web Components at Microsoft. Daniel is the creator of the Web Components free open source library, X-Tag which Microsoft is now...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
Podcast episode
Making Email Better With AI At Shortwave: Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.
byData Engineering Podcast
0 ratings
0% found this document useful
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
Podcast episode
Building Linked Data Products With JSON-LD: A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.
byData Engineering Podcast
0 ratings
0% found this document useful
The Birth and Growth of Spark: An Open Source Success Story // Matei Zaharia // MLOps Podcast #155
Podcast episode
The Birth and Growth of Spark: An Open Source Success Story // Matei Zaharia // MLOps Podcast #155
byMLOps.community
0 ratings
0% found this document useful
Building Cody, an Open Source AI Coding Assistant // Beyang Liu // MLOps Podcast #173
Podcast episode
Building Cody, an Open Source AI Coding Assistant // Beyang Liu // MLOps Podcast #173
byMLOps.community
0 ratings
0% found this document useful
Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI: Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.
Podcast episode
Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI: Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.
byData Engineering Podcast
0 ratings
0% found this document useful
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
Podcast episode
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
byData Engineering Podcast
0 ratings
0% found this document useful
Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach: Artificial intelligence has dominated the headlines for several months due to the successes of large language models. This has prompted numerous debates about the possibility of, and timeline for, artificial general intelligence (AGI). Peter Voss has dedicated decades of his life to the pursuit of truly intelligent software through the approach of cognitive AI. In this episode he explains his approach to building AI in a more human-like fashion and the emphasis on learning rather than statistical prediction.
Podcast episode
Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach: Artificial intelligence has dominated the headlines for several months due to the successes of large language models. This has prompted numerous debates about the possibility of, and timeline for, artificial general intelligence (AGI). Peter Voss has dedicated decades of his life to the pursuit of truly intelligent software through the approach of cognitive AI. In this episode he explains his approach to building AI in a more human-like fashion and the emphasis on learning rather than statistical prediction.
byData Engineering Podcast
0 ratings
0% found this document useful
332 — How to choose a learning platform: How do you pick from the hundreds of platforms out there? What questions might you ask to refine your options? If you’re looking for a learning platform, then you’ve got quite the decision to make! Not only is the market huge and complicated, but...
Podcast episode
332 — How to choose a learning platform: How do you pick from the hundreds of platforms out there? What questions might you ask to refine your options? If you’re looking for a learning platform, then you’ve got quite the decision to make! Not only is the market huge and complicated, but...
byThe Mind Tools L&D Podcast
0 ratings
0% found this document useful
#08 - Tech stack: Metabase, Superset, Redash, Grafana
Podcast episode
#08 - Tech stack: Metabase, Superset, Redash, Grafana
byTOPP - The Open Podcast Podcast
0 ratings
0% found this document useful
Zenlytic Is Building You A Better Coworker With AI Agents: The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions. Unfortunately this often turns into an exercise in frustration for everyone involved due to complex workflows and hard-to-understand dashboards. The team at Zenlytic have leaned on the promise of large language models to build an AI agent that lets you converse with your data. In this episode they share their journey through the fast-moving landscape of generative AI and unpack the difference between an AI chatbot and an AI agent.
Podcast episode
Zenlytic Is Building You A Better Coworker With AI Agents: The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions. Unfortunately this often turns into an exercise in frustration for everyone involved due to complex workflows and hard-to-understand dashboards. The team at Zenlytic have leaned on the promise of large language models to build an AI agent that lets you converse with your data. In this episode they share their journey through the fast-moving landscape of generative AI and unpack the difference between an AI chatbot and an AI agent.
byData Engineering Podcast
0 ratings
0% found this document useful
Foundational Embeddings for Transfer Learning in RecSys // Sanket Gupta // #232
Podcast episode
Foundational Embeddings for Transfer Learning in RecSys // Sanket Gupta // #232
byMLOps.community
0 ratings
0% found this document useful
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
Podcast episode
Tackling Real Time Streaming Data With SQL Using RisingWave: Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.
byData Engineering Podcast
0 ratings
0% found this document useful
API First, Lifecycles and Governance
Podcast episode
API First, Lifecycles and Governance
byThe Cloudcast
0 ratings
0% found this document useful

Skip carousel

» Stochastic Algorithms
Linux Format
Article
» Stochastic Algorithms
Dec 14, 2021
If you’re up for some relatively maths-heavy computer-science reading (and who isn’t?), then consider looking into stochastic algorithms. Sometimes lumped together with machine-learning, stochastic algorithms is a loosely defined category that you co
1 min read
Types Of Databases
Linux Format
Article
Types Of Databases
Aug 27, 2019
NoSQL databases provide the performance, scalability and stability that’s required by the modern data-driven apps we interact with these days. But that is where the similarity between NoSQL systems end. In fact, it wouldn’t be wrong to say that the o
1 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
Next Month
PC Pro Magazine
Article
Next Month
Aug 10, 2023
1 min read
Make AI Work For You
Linux Format
Article
Make AI Work For You
Apr 2, 2024
8 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
An Expert Speaks Up on What You Should Know About Programming Languages
Entrepreneur
Article
An Expert Speaks Up on What You Should Know About Programming Languages
Oct 1, 2015
1 min read
Mac Writing Apps
MacFormat
Article
Mac Writing Apps
Nov 15, 2022
5 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
Interview// From Kyiv, With Love
Essential Apple User Magazine
Article
Interview// From Kyiv, With Love
Nov 21, 2019
3 min read
Inside APC
APC
Article
Inside APC
Sep 6, 2021
2 min read
Inside APC
APC
Article
Inside APC
Nov 29, 2021
2 min read
Inside APC
APC
Article
Inside APC
Nov 1, 2021
2 min read
Inside APC
APC
Article
Inside APC
Jan 24, 2022
2 min read
Inside APC
APC
Article
Inside APC
Feb 21, 2022
2 min read
Inside APC
APC
Article
Inside APC
Dec 27, 2021
2 min read
Inside APC
APC
Article
Inside APC
May 16, 2022
2 min read
Getting The edge
The European Business Review
Article
Getting The edge
Feb 25, 2021
7 min read
Inside APC
APC
Article
Inside APC
Mar 21, 2022
2 min read
“If ‘Show Password’ Is Enabled, The Feature Sends Your Password To Their Third-party Servers”
PC Pro Magazine
Article
“If ‘Show Password’ Is Enabled, The Feature Sends Your Password To Their Third-party Servers”
Dec 8, 2022
Like most people who write for a living, I lean heavily on my spoil chicken to get me through the day. Sorry, I mean spell checker. It’s not just professional writers, either: spell checkers have become de rigueur for business users and consumers ali
7 min read
Inside APC
APC
Article
Inside APC
Feb 20, 2023
APC is Australia’s oldest consumer technology magazine – having been consistently in print for over forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on
2 min read
Inside APC
APC
Article
Inside APC
Feb 20, 2023
APC is Australia’s oldest consumer technology magazine – having been consistently in print for over forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on
2 min read
Inside APC
APC
Article
Inside APC
Nov 28, 2022
APC is Australia’s oldest consumer technology magazine – having been consistently in print for forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on the
2 min read
Inside APC
APC
Article
Inside APC
Dec 29, 2022
APC is Australia’s oldest consumer technology magazine – having been consistently in print for forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on the
2 min read
Inside APC
APC
Article
Inside APC
Jan 23, 2023
APC is Australia’s oldest consumer technology magazine – having been consistently in print for over forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on
2 min read
Inside APC
APC
Article
Inside APC
Sep 11, 2023
2 min read
Inside APC
APC
Article
Inside APC
Aug 14, 2023
2 min read
Inside APC
APC
Article
Inside APC
Oct 9, 2023
2 min read
Inside APC
APC
Article
Inside APC
Oct 4, 2021
2 min read
Inside APC
APC
Article
Inside APC
Mar 20, 2023
APC is Australia’s oldest consumer technology magazine – having been consistently in print for over forty years, since our first issue way back in May 1980 – and we take that heritage and responsibility very seriously. While our focus is obviously on
2 min read

Related categories

Skip carousel

Reviews for Mastering Java for Data Science

Rating: 5 out of 5 stars

5/5

1 rating0 reviews

Book preview

Mastering Java for Data Science - Alexey Grigorev

Title Page

Mastering Java for Data Science

Building data science applications in Java

Alexey Grigorev

BIRMINGHAM - MUMBAI

Copyright

Mastering Java for Data Science

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: April 2017

Production reference: 1250417

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78217-427-1

www.packtpub.com

Credits

About the Author

Alexey Grigorev is a skilled data scientist, machine learning engineer, and software developer with more than 7 years of professional experience.

He started his career as a Java developer working at a number of large and small companies, but after a while he switched to data science. Right now, Alexey works as a data scientist at Searchmetrics, where, in his day-to-day job, he actively uses Java and Python for data cleaning, data analysis, and modeling.

His areas of expertise are machine learning and text mining, but he also enjoys working on a broad set of problems, which is why he often participates in data science competitions on platforms such as kaggle.com.

You can connect with Alexey on LinkedIn at https://de.linkedin.com/in/agrigorev.

I would like to thank my wife, Larisa, and my son, Arkadij, for their patience and support while I was working on the book.

About the Reviewers

Stanislav Bashkyrtsev has been working with Java for the last 9 years. Last years were focused on automation and optimization of development processes.

Luca Massaron is a data scientist and a marketing research director specialized in multivariate statistical analysis, machine learning, and customer insight with over a decade of experience in solving real-world problems and in generating value for stakeholders by applying reasoning, statistics, data mining, and algorithms. From being a pioneer of Web audience analysis in Italy to achieving the rank of top ten Kaggler, he has always been passionate about everything regarding data and analysis and about demonstrating the potentiality of data-driven knowledge discovery to both experts and nonexperts. Favoring simplicity over unnecessary sophistication, he believes that a lot can be achieved in data science just by doing the essential. He is the coauthor of five recently published books and he is just working on the sixth. For Packt Publishing he contributed as an author to Python Data Science Essentials (both 1st and 2nd editions), Regression Analysis with Python, and Large Scale Machine Learning with Python.

You can find him on LinkedIn at https://it.linkedin.com/in/lmassaron.

Prashant Verma started his IT carrier in 2011 as a Java developer in Ericsson working in telecom domain. After a couple of years of JAVA EE experience, he moved into big data domain, and has worked on almost all the popular big data technologies such as Hadoop, Spark, Flume, Mongo, Cassandra, and so on. He has also played with Scala. Currently, he works with QA Infotech as lead data engineer, working on solving e-learning domain problems using analytics and machine learning.

Prashant has worked for many companies such as Ericsson and QA Infotech, with domain knowledge of telecom and e-learning. Prashant has also been working as a freelance consultant in his free time.

I want to thank Packt Publishing for giving me the chance to review the book as well as my employer and my family for their patience while I was busy working on this book.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1782174273.

If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions

Data Science Using Java

Data science

Machine learning

Supervised learning

Unsupervised learning

Clustering

Dimensionality reduction

Natural Language Processing

Data science process models

CRISP-DM

A running example

Data science in Java

Data science libraries

Data processing libraries

Math and stats libraries

Machine learning and data mining libraries

Text processing

Summary

Data Processing Toolbox

Standard Java library

Collections

Input/Output

Reading input data

Writing ouput data

Streaming API

Extensions to the standard library

Apache Commons

Commons Lang

Commons IO

Commons Collections

Other commons modules

Google Guava

AOL Cyclops React

Accessing data

Text data and CSV

Web and HTML

JSON

Databases

DataFrames

Search engine - preparing data

Summary

Exploratory Data Analysis

Exploratory data analysis in Java

Search engine datasets

Apache Commons Math

Joinery

Interactive Exploratory Data Analysis in Java

JVM languages

Interactive Java

Joinery shell

Summary

Supervised Learning - Classification and Regression

Classification

Binary classification models

Smile

JSAT

LIBSVM and LIBLINEAR

Encog

Evaluation

Accuracy

Precision, recall, and F1

ROC and AU ROC (AUC)

Result validation

K-fold cross-validation

Training, validation, and testing

Case study - page prediction

Regression

Machine learning libraries for regression

Smile

JSAT

Other libraries

Evaluation

MSE

MAE

Case study - hardware performance

Summary

Unsupervised Learning - Clustering and Dimensionality Reduction

Dimensionality reduction

Unsupervised dimensionality reduction

Principal Component Analysis

Truncated SVD

Truncated SVD for categorical and sparse data

Random projection

Cluster analysis

Hierarchical methods

K-means

Choosing K in K-Means

DBSCAN

Clustering for supervised learning

Clusters as features

Clustering as dimensionality reduction

Supervised learning via clustering

Evaluation

Manual evaluation

Supervised evaluation

Unsupervised Evaluation

Summary

Working with Text - Natural Language Processing and Information Retrieval

Natural Language Processing and information retrieval

Vector Space Model - Bag of Words and TF-IDF

Vector space model implementation

Indexing and Apache Lucene

Natural Language Processing tools

Stanford CoreNLP

Customizing Apache Lucene

Machine learning for texts

Unsupervised learning for texts

Latent Semantic Analysis

Text clustering

Word embeddings

Supervised learning for texts

Text classification

Learning to rank for information retrieval

Reranking with Lucene

Summary

Extreme Gradient Boosting

Gradient Boosting Machines and XGBoost

Installing XGBoost

XGBoost in practice

XGBoost for classification

Parameter tuning

Text features

Feature importance

XGBoost for regression

XGBoost for learning to rank

Summary

Deep Learning with DeepLearning4J

Neural Networks and DeepLearning4J

ND4J - N-dimensional arrays for Java

Neural networks in DeepLearning4J

Convolutional Neural Networks

Deep learning for cats versus dogs

Reading the data

Creating the model

Monitoring the performance

Data augmentation

Running DeepLearning4J on GPU

Summary

Scaling Data Science

Apache Hadoop

Hadoop MapReduce

Common Crawl

Apache Spark

Link prediction

Reading the DBLP graph

Extracting features from the graph

Node features

Negative sampling

Edge features

Link Prediction with MLlib and XGBoost

Link suggestion

Summary

Deploying Data Science Models

Microservices

Spring Boot

Search engine service

Online evaluation

A/B testing

Multi-armed bandits

Summary

Preface

Data science has become a quite important tool for organizations nowadays: they have collected large amounts of data, and to be able to put it into good use, they need data science--the discipline about methods for extracting knowledge from data. Every day more and more companies realize that they can benefit from data science and utilize the data that they produce more effectively and more profitably.

It is especially true for IT companies, they already have the systems and the infrastructure for generating and processing the data. These systems are often written in Java--the language of choice for many large and small companies across the world. It is not a surprise, Java offers a very solid and mature ecosystem of libraries that are time proven and reliable, so many people trust Java and use it for creating their applications.

Thus, it is also a natural choice for many data processing applications. Since the existing systems are already in Java, it makes sense to use the same technology stack for data science, and integrate the machine learning model directly in the application's production code base.

This book will cover exactly that. We will first see how we can utilize Java’s toolbox for processing small and large datasets, then look into doing initial exploration data analysis. Next, we will review the Java libraries that implement common Machine Learning models for classification, regression, clustering, and dimensionality reduction problems. Then we will get into more advanced techniques and discuss Information Retrieval and Natural Language Processing, XGBoost, deep learning, and large scale tools for processing big datasets such as Apache Hadoop and Apache Spark. Finally, we will also have a look at how to evaluate and deploy the produced models such that the other services can use them.

We hope you will enjoy the book. Happy reading!

What this book covers

Chapter 1, Data Science Using Java, provides the overview of the existing tools available in Java as well and introduces the methodology for approaching Data Science projects, CRISP-DM. In this chapter, we also introduce our running example, building a search engine.

Chapter 2, Data Processing Toolbox, reviews the standard Java library: the Collection API for storing the data in memory, the IO API for reading and writing the data, and the Streaming API for a convenient way of organizing data processing pipelines. We will look at the extensions to the standard libraries such as Apache Commons Lang, Apache Commons IO, Google Guava, and AOL Cyclops React. Then, we will cover most common ways of storing the data--text and CSV files, HTML, JSON, and SQL Databases, and discuss how we can get the data from these data sources. We will finish this chapter by talking about the ways we can collect the data for the running example--the search engine, and how we prepare the data for that.

Chapter 3, Exploratory Data Analysis, performs the initial analysis of data with Java: we look at how to calculate common statistics such as the minimal and maximal values, the average value, and the standard deviation. We also talk a bit about interactive analysis and see what are the tools that allow us to visually inspect the data before building models. For the illustration in this chapter, we use the data we collect for the search engine.

Chapter 4, Supervised Learning - Classification and Regression, starts with Machine Learning, and then looks at the models for performing supervised learning in Java. Among others, we look at how to use the following libraries--Smile, JSAT, LIBSVM, LIBLINEAR, and Encog, and we see how we can use these libraries to solve the classification and regression problems. We use two examples here, first, we use the search engine data for predicting whether a URL will appear on the first page of results or not, which we use for illustrating the classification problem. Second, we predict how much time it takes to multiply two matrices on certain hardware given its characteristics, and we illustrate the regression problem with this example.

Chapter 5, Unsupervised Learning – Clustering and Dimensionality Reduction, explores the methods for Dimensionality Reduction available in Java, and we will learn how to apply PCA and Random Projection to reduce the dimensionality of this data. This is illustrated with the hardware performance dataset from the previous chapter. We also look at different ways to cluster data, including Agglomerative Clustering, K-Means, and DBSCAN, and we use the dataset with customer complaints as an example.

Chapter 6, Working with Text – Natural Language Processing and Information Retrieval, looks at how to use text in Data Science applications, and we learn how to extract more useful features for our search engine. We also look at Apache Lucene, a library for full-text indexing and searching, and Stanford CoreNLP, a library for performing Natural Language Processing. Next, we look at how we can represent words as vectors, and we learn how to build such embeddings from co-occurrence matrices and how to use existing ones like GloVe. We also look at how we can use machine learning for texts, and we illustrate it with a sentiment analysis problem where we apply LIBLINEAR to classify if a review is positive or negative.

Chapter 7, Extreme Gradient Boosting, covers how to use XGBoost in Java and tries to apply it to two problems we had previously, classifying whether the URL appears on the first page and predicting the time to multiply two matrices. Additionally, we look at how to solve the learning-to-rank problem with XGBoost and again use our search engine example as illustration.

Chapter 8, Deep Learning with DeepLearning4j, covers Deep Neural Networks and DeepLearning4j, a library for building and training these networks in Java. In particular, we talk about Convolutional Neural Nets and see how we can use them for image recognition--predicting whether it is a picture of a dog or a cat. Additionally, we discuss data augmentation--the way to generate more data, and also mention how we can speed up the training using GPUs. We finish the chapter by describing how to rent a GPU server on Amazon AWS.

Chapter 9, Scaling Data Science, talks about big data tools available in Java, Apache Hadoop, and Apache Spark. We illustrate it by looking at how we can process Common Crawl--the copy of the Internet, and calculate TF-IDF of each document there. Additionally, we look at the graph processing tools available in Apache Spark and build a recommendation system for scientists, we recommend a coauthor for the next possible paper.

Chapter 10, Deploying Data Science Models, looks at how we can expose the models to the rest of the world in such a way they are usable. Here we cover Spring Boot and talk how we can use the search engine model we developed to rank the articles from Common Crawl. We finish by discussing the ways to evaluate the performance of the models in the online settings and talk about A/B tests and Multi-Armed Bandits.

What you need for this book

You need to have any latest system with at least 2GB RAM and a Windows 7 /Ubuntu 14.04/Mac OS X operating system. Further, you will need to have Java 1.8.0 or above and Maven 3.0.0 or above installed.

Who this book is for

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Here, we create SummaryStatistics objects and add all body content lengths.

A block of code is set as follows:

Any command-line input or output is written as follows:

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "If, instead, our model outputs some score such that the higher the values of the score the more likely the item is to be positive, then the binary classifier is called a ranking classifier."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Hover the mouse pointer on the SUPPORT tab at the top.

Click on Code Downloads & Errata.

Enter the name of the book in the Search box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Java-for-Data-Science. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringJavaforDataScience_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address the problem.

Data Science Using Java

This book is about building data science applications using the Java language. In this book, we will cover all the aspects of implementing projects from data preparation to model deployment.

The readers of this book are assumed to have some previous exposure to Java and data science, and the book will help to take this knowledge to the next level. This means learning how to effectively tackle a specific data science problem and get the most out of the available data.

This is an introductory chapter where we will prepare the foundation for all the other chapters. Here we will cover the following topics:

What is machine learning and data science?

Cross Industry Standard Process for Data Mining (CRIPS-DM), a methodology for doing data science projects

Machine learning libraries in Java for medium and large-scale data science applications

By the end of this chapter, you will know how to approach a data science project and what Java libraries to use to do that.

Data science

Data science is the discipline of extracting actionable knowledge from data of various forms. The name data science emerged quite recently--it was invented by DJ Patil and Jeff Hammerbacher and popularized in the article Data Scientist: The Sexiest Job of the 21st Century in 2012. But the discipline itself had existed before for quite a while and previously was known by other names such as data mining or predictive analytics. Data science, like its predecessors, is built on statistics and machine learning algorithms for knowledge extraction and model building.

The science part of the term data science is no coincidence--if we look up science, its definition can be summarized to systematic organization of knowledge in terms testable explanations and predictions. This is exactly what data scientists do, by extracting patterns from available data, they can make predictions about future unseen data, and they make sure the predictions are validated beforehand.

Nowadays, data science is used across many fields, including (but not limited to):

Banking: Risk management (for example, credit scoring), fraud detection, trading

Insurance: Claims management (for example, accelerating claim approval), risk and losses estimation, also fraud detection

Health care: Predicting diseases (such as strokes, diabetes, cancer) and relapses

Retailande-commerce: Market basket analysis (identifying product that go well together), recommendation engines, product categorization, and personalized searches

This book covers the following practical use cases:

Predicting whether an URL is likely to appear on the first page of a search engine

Predicting how fast an operation will be completed given the hardware specifications

Ranking text documents for a search engine

Checking whether there is a cat or a dog on a picture

Recommending friends in a social network

Processing large-scale textual data on a cluster of computers

In all these cases, we will use data science to learn from data and use the learned knowledge to solve a particular business problem.

We will also use a running example throughout the book, building a search engine. We will use it to illustrate many data science concepts such as, supervised machine learning, dimensionality reduction, text mining, and learning to rank models.

Machine learning

Machine learning is a part of computer science, and it is at the core of data science. The data itself, especially in big volumes, is hardly useful, but inside it hides highly valuable patterns. With the help of machine learning, we can recognize these hidden patterns, extract them, and then apply the learned information to the new unseen items.

For example, given the image of an animal, a machine learning algorithm can say whether the picture is a dog or a cat; or, given the history of a bank client, it will say how likely the client is to default, that is, to fail to pay the debt.

Often, machine learning models are seen as black boxes that take in a data point and output a prediction for it. In this book, we will look at what is inside these black boxes and see how and when it is best to use them.

The typical problems that machine learning solves can be categorized in the following groups:

Supervised learning: For each data point, we have a label--extra information that describes the outcome that we want to learn. In the cats versus dogs case, the

Enjoying the preview?

Page 1 of 1

Mastering Java for Data Science

About this ebook

Alexey Grigorev

Related authors

Related to Mastering Java for Data Science

Related ebooks

Intelligence (AI) & Semantics For You

Related podcast episodes

Related articles

Related categories

Reviews for Mastering Java for Data Science

What did you think?

Book preview

Mastering Java for Data Science - Alexey Grigorev

Title Page

Mastering Java for Data Science

Building data science applications in Java

Alexey Grigorev

Copyright

Mastering Java for Data Science

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

Credits

About the Author

About the Reviewers

www.PacktPub.com

Why subscribe?

Customer Feedback

Table of Contents

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Data Science Using Java

Data science

Machine learning