IoT to Big Data Analytics Open Source Tech Stack
According to Gartner there will be nearly 26 billion devices on Internet of Things by 2020. These can be classified broadly into five categories - smart wearables, smart homes, smart cities, smart environment and smart enterprise.
As you can imagine IoT creates an opportunity to measure, collect, and analyze increasingly big data sets and take actions in real-time based upon insights they provide. Most of the IoT data has extremely short shelf life as the value it provides decreases with time. It is for this reason big data analytics and IoT suit each other and work well in conjunction.
Though there are quite a few enterprise IoT platforms, here I will describe some of the standards and technologies, which are open-source, almost free and popular for different parts of the IoT stack from embedded sensors to analytics. (The following list is not exhaustive)
IoT Sensors & Data Collection
Most embedded microprocessors are programmed in assembly or C/C++ and compiled into binary executable for specific hardware. Microprocessors usually save information onto flash memory, which can be read and interpreted as binary or text formatted as a Log, CSV, JSON, XML, Barcode, QR code etc.
If you need to transfer characters not present in ASCII; and UTF-8 is not an option then text format is not going to work but binary formats would be fine. Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – like XML, but smaller, faster, and simpler.
CSV is a way to pack a lot of data into an easily parsed format, so that's an extension of the text format. It's very limited in structure, though.
XML is a good option when complex data structures are used and transfer rates are not a problem. JSON is a little lighter on overhead and parsing.
Barcodes are ubiquitous and supported with many existing tools and libraries, which are now extending to QR codes for scale and larger size of information to be carried.
Data Transfer
It is very common for large embedded systems to use real-time operating system (RTOS) along with embedded middleware to talk with the outside world connected over cables or wireless via protocols and peripherals such as
SCI (Serial Communication Interfaces RS-232 etc.) need a physical cable and degrade signal quality with increased cable length.
USB (Universal Serial Bus) is de-facto standard to communicate and to supply electric power to devices. It comes with different types of connectors and varying speeds. It has replaced old SCI standard and separate power chargers.
Over time Ethernet has come to rule LANs with higher transfer rates and longer link distances. It now supports data transfer rate of 100 gigabits per second (Gbit/s), which is expected to go up to 400 Gbit/s soon.
RFID (Radio Frequency Identification) tags are attached to a thing and stores data on a microchip, which waits to be read wirelessly over a small range. Active tags have batteries whereas semi-passive tags receive power from readers. Both these types are reserved for costly items that are read over greater distances since they broadcast high frequencies that can be read 100 feet away. Passive tags are less expensive but have a range of a few feet. Readers are connected to Internet to transfer and store information in the cloud.
NFC (Near Field Communications) devices can read NFC and passive RFID tags within 4 cm distance and extract the information stored in them. NFC has become popular with smartphones and payment processors due to security it provides with encryption.
Bluetooth & Wi-Fi have some similar applications: setting up networks, printing, or transferring files. Wi-Fi is intended as a replacement for high speed cabling for general LAN access, sometimes called wireless local area networks (WLAN). Bluetooth is intended for portable equipment and its applications and it is outlined as the wireless personal area network (WPAN)
Wi-Fi is usually access point-centered, with an asymmetrical client-server connection with all high speed traffic routed through the access point for network access with a typical range of ~40m, while Bluetooth is usually symmetrical, between two Bluetooth devices requiring quick minimal configuration. Bluetooth range is power-class-dependent from about 1 meter to 100 meters.
Apple, Microsoft Windows, Linux and FreeBSD all have native Bluetooth & Wi-Fi support via standard libraries.
ZigBee is an IEEE specification for a suite of high-level communication protocols used to create personal area networks with small, low-power digital radios. The technology is intended to be simpler and less expensive than other wireless personal area networks (WPANs), such as Bluetooth or Wi-Fi.
LTE (Long-Term Evolution or 4G) is marketed as a 4G wireless service and it suits stand-alone device applications with short bursts of data over extended periods where an Internet connection is either unavailable or not desired. It can be pre-configured at manufacturer to connect to a specific carrier making it ready to ship and work out of the box. Most of the newer generation home security devices include this option for safety and user experience.
FTP is an old way of sending files between computers. It is not suitable for any real-time applications as setting up an FTP control connection is quite slow due to the round-trip delays of sending all of the required commands and awaiting responses. HTTP essentially fixes all the bugs in FTP that made it inconvenient to use for IoT.
HTTP is the foundation of data communication for the World Wide Web. The client (device) submits an HTTP request message to the server. The server, which sends or receives resources such as files and other content, or performs other functions on behalf of the client, returns a response message with completion status information about the request. HTTP provides sessions, authentication and encryptions making the transfer of information safe and secure over a public network.
REST is a software architectural style of the World Wide Web. RESTful systems typically communicate over HTTP with verbs such as GET, POST, PUT, DELETE, etc. to retrieve from and to send data to remote servers. For browsers or IoT devices REST systems provide a simple interface with external systems as web resources identified by Uniform Resource Identifiers.
Storage & Analytics
With billions of IoT collecting data all the time, the traditional file or relational database systems can’t handle the volume, velocity (throughput or speed), and variety of data sets. Big data technology stack suits these so-called 3 Vs with its distributed architecture (it removes the need to backup and restore data as it saves a number of copies of data on different servers and if one of them goes down it makes a new copy from others.)
Big data mining and analytics help uncover hidden patterns, unknown correlations, and other useful business information. Big data tools described below can analyze high-volume, high-velocity and high-variety information assets far better than conventional tools and relational databases that struggle to capture, manage, and process big data in near real-time and at an acceptable cost.
Hadoop encompasses Hadoop Distributed File System (HDFS) and MapReduce processing framework to use computing power of many servers to store, retrieve, aggregate or analyze data in parallel. HDFS is the backbone of number of higher-level frameworks and services.
HBase is a column-oriented key -value data store and it runs on top of HDFS. It is well suited for faster read and write operations on large datasets with high throughput and low latency.
Hive provides data warehousing services on top of an existing Hadoop cluster. It provides an SQL like interface for experienced SQL programmers.
Cassandra is a distributed database best suited for real time transactions (single-digit millisecond). Every write is fast due to its log-structured storage design, and is persisted with a commit log. Cassandra is an excellent choice for its ability to scale, perform, and offer continuous uptime. It has no single point of failure because it doesn’t work in master-slave mode.
MongoDB is a NoSQL document-oriented database with high availability. It is suited for applications dealing with highly unstructured data.
S3 is an online file storage web service offered by Amazon Web Services with scalability, high availability, low latency at commodity cost. It organizes objects and files into buckets. Buckets and objects can be created, listed, and retrieved using a REST-style interface or simply HTTP GET interface.
Redis provides an in-memory data structure store, which can be used as database, cache and message broker for better performance in real-time applications.
Spark gives us an advanced execution engine that is ~100 times faster than MapReduce because of in-memory computing. It offers over 80 high-level operators that make it easy to build parallel apps (to analyze data) in Java, Scala, and Python. It provides MLlib to work with Machine Learning on your data sets on Hadoop, Cassandra, HBase or S3 etc.
Flink is a new streaming analytics engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink also bundles libraries such as CEP - a complex event-processing library, Machine Learning library, and a graph processing API.
Kafka is a message broker written in Scala, originally developed by LinkedIn and now under the Apache Software Foundation. It can be used to provide a unified, high-throughput, low-latency message queue for handling real-time data feeds from IoT.
Visualization
In-memory computing helps business customers to visualize and analyze data sets on the fly. Some of the popular tools in this category are D3.js, Google Maps/Charts, SAS Visual Analytics, R, Tableau, CartoDB, AngularJS, etc. [More on these soon.]
Volume, velocity , variety aaaaaand VALUE. Usama Fayyad Yasaman Hadjibashi, thanks Yassie and usama for the 4th one -