The Future of Big Data: Next-Generation Database Management Systems
In 2009, the U.S. Army Intelligence and Security Command wanted the ability to track, in real-time, national security threats. Potential solutions had to provide instant results, and use graphics to provide insight into their extremely large streaming datasets. At the time, there was nothing available to meet their needs. Both NoSQL solutions and classical relational systems couldn’t handle the scaling requirements.
In response, Nima Negahban and Amit Vij, of Kinetica, designed and built a new database. It was based around the Graphic Processing Unit(GPU), and allows its users to explore and present data, in real-time. A modern GPU is very efficient at manipulating and organizing computer graphics. They use a parallel structure that makes them much more efficient than the general-purpose CPUs used for algorithms.
As an additional plus, the database was useful for other purposes. The United States Postal Serviceapplied Kinetica’s database to optimize routes and increase accuracy. Then businesses started using the system. As data from the Internet of Things (IoT) increased, businesses started dealing with the challenge of analyzing streaming data in real time. At present, GPUs offer the most cost-effective solution for large amounts of data being streamed in real time, and “for processing Big Data.”
Kinetica has developed an In-Memory Database Management System using Graphics Processing Units. The software it markets is also called Kinetica. Nima Negahban is Kinetica’s CTO and a co-founder. He oversees Kinetica’s technical strategy and also manages the engineering team. Negahban has designed and developed cutting-edge Big Data systems for a variety of market sectors, from high-speed trading systems to biotechnology.
Negahban made four predictions about Big Data research in 2018:
- An emphasis will be placed on Predictive Analytics and Location Intelligence. There will be a dramatic increase in streaming data analysis use cases as businesses demand a return on their IoT investments. “While it is a good start for enterprises to collect and store IoT data,” said Negahban. “What is more meaningful is understanding it, analyzing it and leveraging the insights to improve efficiency.” Common practices such as saving energy, package route delivery optimization, and faster pizza deliveries all fit into this trend.
- Investments in Artificial Intelligence “life cycle management” and the technologies housing the data will continue to increase, and the supervision of this process will mature. Enterprises have spent the past few years educating themselves on various Artificial Intelligence frameworks and tools. As AI goes more mainstream, “it will move beyond just small-scale experiments run by Data Scientists in an ad hoc manner to being automated and operationalized.” According to Negahban, the daily workloads of Data Scientists have centered more on Database Management and administration and less time on coding and developing algorithms. This is changing though as more automation is entering the market.
- Enterprises will re-think their traditional approach of storing Big Data and consider moving it to a next-generation database (GPU or SIMD). With workloads of up to 100x faster with these newer technologies, this is going to “require a complete re-write of the traditional Data Warehouse.”
- The tracking of data points within advanced systems such as AI frameworks, especially in terms of decision-making impacts and detection of errors is increasingly important. Auditing and tracking computer input, and following the decision-making path will help in determining what ultimately caused the incorrect decisions around data made by such frameworks, and will continue to gain further focus by organizations.
Thinking Outside of the Box
When describing the beginnings of Kinetica, Mr. Negahban said:
“We had hundreds of different real-time data streams coming in, and this desire to have an analytics engine that could do a large query inventory very quickly. The traditional approach to solving that problem was not working. I had a background in GPUs, and I thought there’s this incredible new device where computing is an infinite resource. With approaches of the past, they assumed compute was a really limited resource.”
Negahban and his co-founder worked to make better data structures, so that when a query comes out of a system the actual amount of computer time figuring out the query is considerably less. “Computing is now an unlimited resource,” said Negahban. “Let’s make a database that’s about leveraging this abundant resource across many nodes, and creating the processing logic to do things in parallel.”
Such an idea essentially was “flipping the database equation on its head.” With that basic vision, and a kind of “usability ethos,” the developer doesn’t have to worry about the scale of the work anymore. Developers can have a world-class database that can continue to scale as a company, or project grows, “without having to change a thing.”
Leveraging the Concept
Negahban also described the present and immediate future of where Big Data and databases stand:
“What we saw back then is becoming more and more prevalent, now. One thing that is interesting about V3 data (volume, velocity, and variety), is that a lot of the time it is very geospatially related. And so, what you’re starting to see is this convergence of the classical OLAP workloads with the geospatial world. What they both have in common is the answer to their queries isn’t something that lends itself well to classic data structures, if you want flexibility.”
That’s the biggest thing for analysts. Today they want flexibility. They don’t want to be stuck with a limited query inventory, where if they start to have an insight, that insight process has to be stopped, because someone has to write a smart job or write an index, so they can ask the next question. “You lose that creative cycle and that insight cycle that really drives a return on investments,” remarked Negahban.
“And you really want to get that return. Figuring out how to operate the business better, how to understand your customers better. To do that an analyst must have that creative flexibility to go in different directions and have that kind of real time feedback.”
The databases of the past use a popular approach that works very well for a certain type of data, with certain characteristics. That may have worked for several years, now there is a new class of data being created, and there isn’t a clear winner for processing that kind of data.
“Yes, it’s V3,” said Negahban. “But the type of workload being asked of it is also different than your classic data sets. It’s more aggregate, it’s geospatially related, and a lot of the time, you want to generate features that you model against.”
This is causing a convergence of workload and data types, and there is no clear solution where everyone says, “That’s my go to!” Instead, there are a hodgepodge of solutions that people are throwing together.
“If they want to do Machine Learning against it, they are again moving data to someplace else, a lot of the time,” he commented. “So, I think there’s this hunger for a converged solution that we can provide.”
Profiting from Earlier Investments
Negahban went on to explain the financial importance of GPU-based processing,
“Because we’re at this nexus of data processing capabilities, and having the GPU which is the premier device for machine learning, we’re in a kind of in a sweet spot. We’re able to provide all the necessary capabilities an enterprise might need as they try to increase or rely on this type of data.”
The amount of investment that went into collecting all this data is tremendous. “From a hardware standpoint, from a training standpoint” he said. “They’re giving every single carrier a device, and they have huge data lakes to capture all these bread crumbs.”
He said the Hadoop movement should be thanked for making Data Lakes a common operation for enterprises. There has been a tremendous amount of investment in that. And folks are saying,
“Hey! We’re spending millions of dollars a year on licenses and hardware to stare all these breadcrumbs (bits of Big Data), and what are we doing with it? People are saying, ‘Let’s get a Return On Investment’ off this, or let’s do something different with it. Let’s not grab it, but of course, everyone is pushing to do something with it.”
Big Data and Machine Learning
Big Data is also directly tied into the arena of Machine Learning (the “process” of teaching machines to make decisions). Machine Learning networks are massively parallel. Consequently, GPUs provide a remarkably good fit when it comes to Machine Learning. The ability to have an analytical, in-memory database that can utilize multiple GPUs across a variety of nodes to accomplish massively parallel queries (both analytical and statistical) is becoming a vital necessity for many organizations grappling with their Big Data assets.
Kinetica utilizes custom code that can be applied for analytical processing, by leveraging user-defined functions. This allows the database to integrate with GPU-accelerated Machine Learning libraries, such as BIDMach, Caffe, Torch, and TensorFlow.
The combination of Machine Learning and GPUs is becoming one of the more intriguing areas in Data Science, with systems learning patterns in data and acting on those lessons. The potential for applications is wide-ranging and includes: autonomous robots, image recognition, drug discovery, fraud detection, and so much more yet to be imagined. The future of Big Data hinges on the continued development of new technologies that can leverage and utilize the growing datasets with ease.
Photo Credit: Panchenko Vladimir/Shutterstock.com