What is Big Data?
One of the frequently used buzzwords of the last decade is big data. To explain what is so different or unique about it, we should probably start with what data is and what happens when it gets big.
Data is a general term; we use everything we record to remember and use later. Under this broad definition, we ask certain specific conditions for a record to be recognized as data.
1. We need to know the provenance and chain of custody of the data to trust it. If you don't know where records come from, or if you can't be sure they are not manipulated, then it is not data.
2. Records have to be consistent in naming, scaling, and units. If documents are not compatible, then it is not data.
Recent technological advancements allowed us to collect more data on everything. We can now collect almost real-time detailed economic activity such as when, where, and what people buy. We can cache data on how, when, and where people go on vacations. We can even collect data on the personal interactions of people on social media platforms. The availability of such detailed data has profound implications from the banking sector to supply chains to public administration, policing, and policymaking. However, the volume of the data we collect also creates hard technological and economic limits on what we can do with the available data.
Engineering is said to be the art of solving a problem as a balancing act between cost, availability, and compromises. An engineer is someone who comes up with a good enough and cheap enough solution to a given technical problem. From this perspective, solving an issue using a machine learning algorithm on available data becomes a severe engineering problem, especially when the volume of the data at hand is significant and if the solution requires an almost real-time solution. Big data is a terminology that engineers use for data heaps that they have to collect, maintain and feed to a machine learning algorithm with hardware and software pipelines they construct. During this process, problem solvers have to make compromises depending on economic and technological conditions. In other words, there is data, and whether it is big or not depends on how much money you have available to spend on it.
What is Machine Learning?
Learning is a process by which individuals change their behavior or strategy based on new data or experiences. From this perspective, learning appears to be a profoundly human activity. However, in an age where automated or autonomous agents freely interact with humans and each other, learning became something non-human agents do as well. Machine Learning is now used as the umbrella term that covers such phenomena.
Before getting into how machines learn, let us briefly look into how humans approach problem-solving. We, humans, are excellent problem solvers mostly because our brains are very good at recognizing and distinguishing patterns and making generalizations. We also tend to solve problems interactively. A typical process goes something like this: gather data about a problem, make an analysis, propose a solution, test the answer, go back to the beginning to modify the proposed solution to fit the issue at hand better. Lather, rinse and repeat.
When it comes to automated or autonomous agent learning, one has to imagine a system where a machine changes a proposed solution independently based on the data it gathers from input to the proposed solution. To do that, a typical machine learning process uses two critical ingredients:
1. a class of proposed solutions that can be adjusted according to one or several parameters, and
2. a cost function that assigns a measure that determines how well a proposed solution solves the problem at hand.
With these two ingredients in our pocket, finding the most appropriate parameters that minimize the cost function is a well-defined mathematical problem for the class of proposed solutions. All machine learning algorithms essentially follow this script.
Prof. Dr. Atabey Kaygun