Recap: What is Big Data?
‘Big Data’ is the application of specialized techniques and technologies to process very large sets of data. These data sets are often so large and complex that it becomes difficult to process using on-hand database management tools. Examples include web logs, call records, medical records, military surveillance, photography archives, video archives and large-scale e-commerce.
By ‘very large’ we’re talking about datasets that require at least one terabyte – if not hundreds of petabytes – of storage (note that 1 petabyte = 1024 terabytes!). Facebook is estimated to store at least 100 petabytes of pictures and videos alone.
Implementing Big Data Techniques: 7 Things to Consider
According to IDC Canada, a Toronto-based IT research firm, Big Data is one of the top three things that will matter in 2013. With that in mind, there are 7 widely used Big Data analysis techniques that we’ll be seeing more of over the next 12 months:
- Association rule learning
- Classification tree analysis
- Genetic algorithms
- Machine learning
- Regression analysis
- Sentiment analysis
- Social network analysis
1. Association rule learning
Are people who purchase tea more or less likely to purchase carbonated drinks?
Association rule learning is a method for discovering interesting correlations between variables in large databases. It was first used by major supermarket chains to discover interesting relations between products, using data from supermarket point-of-sale (POS) systems.
Association rule learning is being used to help:
- place products in better proximity to each other in order to increase sales
- extract information about visitors to websites from web server logs
- analyze biological data to uncover new relationships
- monitor system logs to detect intruders and malicious activity
- identify if people who buy milk and butter are more likely to buy diapers
2. Classification tree analysis
Which categories does this document belong to?
Statistical classification is a method of identifying categories that a new observation belongs to. It requires a training set of correctly identified observations – historical data in other words.
Statistical classification is being used to:
- automatically assign documents to categories
- categorize organisms into groupings
- develop profiles of students who take online courses
3. Genetic algorithms
Which TV programs should we broadcast, and in what time slot, to maximize our ratings?
Genetic algorithms are inspired by the way evolution works – that is, through mechanisms such as inheritance, mutation and natural selection. These mechanisms are used to “evolve” useful solutions to problems that require optimization.
Genetic algorithms are being used to:
- schedule doctors for hospital emergency rooms
- return combinations of the optimal materials and engineering practices required to develop fuel-efficient cars
- generate “artificially creative” content such as puns and jokes
4. Machine Learning
Which movies from our catalogue would this customer most likely want to watch next, based on their viewing history?
Machine learning includes software that can learn from data. It gives computers the ability to learn without being explicitly programmed, and is focused on making predictions based on known properties learned from sets of “training data.”
Machine learning is being used to help:
- distinguish between spam and non-spam email messages
- learn user preferences and make recommendations based on this information
- determine the best content for engaging prospective customers
- determine the probability of winning a case, and setting legal billing rates
5. Regression Analysis
How does your age affect the kind of car you buy?
At a basic level, regression analysis involves manipulating some independent variable (i.e. background music) to see how it influences a dependent variable (i.e. time spent in store). It describes how the value of a dependent variable changes when the independent variable is varied. It works best with continuous quantitative data like weight, speed or age.
Regression analysis is being used to determine how:
- levels of customer satisfaction affect customer loyalty
- the number of supports calls received may be influenced by the weather forecast given the previous day
- neighbourhood and size affect the listing price of houses
- to find the love of your life via online dating sites
6. Sentiment Analysis
How well is our new return policy being received?
Sentiment analysis helps researchers determine the sentiments of speakers or writers with respect to a topic.
Sentiment analysis is being used to help:
- improve service at a hotel chain by analyzing guest comments
- customize incentives and services to address what customers are really asking for
- determine what consumers really think based on opinions from social media
7. Social Network Analysis
How many degrees of separation are you from Kevin Bacon?
Social network analysis is a technique that was first used in the telecommunications industry, and then quickly adopted by sociologists to study interpersonal relationships. It is now being applied to analyze the relationships between people in many fields and commercial activities. Nodes represent individuals within a network, while ties represent the relationships between the individuals.
Social network analysis is being used to:
- see how people from different populations form ties with outsiders
- find the importance or influence of a particular individual within a group
- find the minimum number of direct ties required to connect two individuals
- understand the social structure of a customer base
Whether your business wants to discover interesting correlations, categorize people into groups, optimally schedule resources, or set billing rates, a basic understanding of the seven techniques mentioned above can help Big Data work for you.
View Part 3 of our Big Data Series, which outlines the most popular big data tools being used.