Big Data Architecture, Goals and Challenges Coupons Jose Christianity Dakota State University Abstract Big Data inspired data analysis is matured from proof of concept projects to an influential tool for decision makers to make informed decisions. More and more organizations are utilizing their internally and externally available data with more complex analysis techniques to derive meaningful insights. This paper addresses some of the architectural goals and challenges for Big Data architecture in a typical organization.
Overview In this fast paced information age, there are many different sources on corporate outworks and internet is collecting massive amounts of data, but there is a significant difference in this data compared to the conventional data, much of this data is semi- structured or unstructured and not residing in conventional databases. “Big data” is essentially a huge data set that scales to multiple potables of capacity; it can be created, collected, collaborated, and stored in real-time or any other way. However, the challenge with big data is that it is not easily handled using traditional database management tools.
It typically consists of unstructured data, which includes text, audio and video files, photographs and other data (Kavas, 2012). The aim of this paper is to examine the concepts associated with the big data architecture, as well as how to handle, process, and effectively utilize big data internally and externally to obtain meaningful and actionable insights. How Big Data is Different? Big data is the latest buzzword in the tech industry, but what exactly makes it different from traditional Bal or data analysis?
According to MIT Sloan Management Review, big data is described as “data that is either too voluminous or too unstructured to be managed and analyzed through traditional meaner” (Davenport, Thomas, Berth, & Bean, 2012). Big data is unlike conventional mathematical intelligence, where a simple sum of a known value yields a result, such as order sales becoming year-to-date sales. With big data, the value is discovered through a complex, refined modeling process as follows: make a hypothesis, create statistical models, validate, and then make a new hypothesis (Oracle, 2012).
Additionally, data sources are another challenging and differentiating factor within big data analytics. Conventional, structured data sources like relational databases, spreadsheets, and yogis are further extended into social media applications (tweets, blobs, Faceable, linked posts, etc. ), web logs, sensors, RIFF tags, photos/videos, information-sensing mobile devices, geographical location information, and other documents. In addition to the unstructured data problem, there are other notable complexities for big data architecture.
First, due to sheer volume, the present system cannot move raw data directly to a data warehouse. Whereas, processing systems such as Unprepared, can further refine information by moving it to data warehouse environment, where invitational and familiar Bal reporting, statistical, semantic, and correlation applications can effectively implemented. Traditional data flow in Business Intelligence Systems can depict like this, (Oracle. (2012). An Oracle white paper in enterprise architecture) Architectural Goals The preeminent goal of architecture big data solutions is to create reliable, scalable and capable infrastructure.
At the same time, the analytics, algorithms, tools and user interfaces will need to facilitate interactions with users, specifically those in executive-level. Enterprise architecture should ensure that the business objectives remain clear throughout big data technology implementation. It is all about the effective utilization of big data, rather than big architecture. Traditional IT architecture is accustomed to having applications within its own space and performs tasks without exposing internal data to the outside world.
Big data on other hand, will consider any possible piece of information from any other application to be instated for analysis. This is aligned with big data’s overall philosophy: the more data, the better. Big Data Architecture Big data architecture is similar to any other architecture that originates or has a inundation from a reference architecture. Understanding the complex hierarchal structure of reference architecture provides a good background for understanding big data and how it complements existing analytics, 81, databases and other systems.
Organizations usually start with a subset of existing reference architecture and carefully evaluate each and every component. Each component may require modifications or alternative solutions based on the particular data set or enterprise environment. Moreover, a successful big data architecture will include many open- source software components; however, this may present challenges for typical enterprise architecture, where specialized licensed software systems are typically used.
To further examine big data’s overall architecture, it is important to note that the data being captured is unpredictable and continuously changing. Underlying architecture should be capable enough to handle this dynamic nature. Big data architecture is inefficient when it is not being integrated with existing enterprise data; the same way an analysis cannot be completed until big data correlates it with other structured and enterprise-De data. One of the primary obstacles observed in a Hoodoo adoption f enterprise is the lack of integration with an existing Bal echo-system.
Presently, the traditional Bal and big data ecosystems are separate entities and both using different technologies and ecosystems. As a result, the integrated data analyses are not effective to a typical business user or executive. As you can see that how the data architecture mentioned in the traditional systems is different from big data. Big data architectures taking advantage of many inputs compared to traditional systems. (Oracle. (2012). An Oracle white paper in enterprise architecture) Architectural Cornerstones Source In big data systems, data can come from heterogeneous data sources.
Typical data stores (SQL or Nouns) can give structured data. Any other enterprise or outside data coming through different application Apish can be semi-structured or unstructured. Storage The main organizational challenge in big data architecture is data storage: how and where the data can be stored. There is no one particular place for storage; a few options that currently available are HATS, Relation databases, Nouns databases, and In-memory databases. Processing Map-Reduce, the De facto standard in big data analysis for processing data, is one of any available options.
Architecture should consider other viable options that are available in the market, such as in-memory analytics. Data Integration Big data generates a vast amount of data by combining both structured and unstructured data from variety of sources (either real-time or incremental loading). Likewise, big data architecture should be capable of integrating various applications within the big data infrastructure. Various Hoodoo tools (Scoop, Flume, etc. ) mitigates this problem, to some extent. Analysis Incorporating various analytical, algorithmic applications will effectively process this cast amount of data.
Big data architecture should be capable to incorporate any type of analysis for business intelligence requirements. However, different types of analyses require varying types of data formats and requirements. Architectural Challenges Proliferation of Tools The market has bombarded with array of new tools designed to effectively and seamlessly organize big data. They include open source platforms such as Hoodoo. But most importantly, relational databases have also been transformed: New products have increased query performance by a factor of 1,000 and are capable of managing a wide variety of big data sources.
Likewise, statistical analysis packages are also evolving to work with these new data platforms, data types, and algorithms. Cloud-friendly Architecture Although not yet broadly adopted in large corporations, cloud-based computing is well-suited to work with big data. This will break the existing IT policies, enterprise data will move from its existing premise to third-party elastic clouds. However, there are expected to be challenges, such as educating management about the consequences and realities associated with this type of data movement. Nonparametric Data
Traditional systems only consider the data unique to its own system; public data never becomes a source for traditional analytics. This paradigm is changing, though. Many big data applications use external information that is not proprietary, such as social network modeling and sentiment analysis. Massive Storage Requirements Moreover, big data analytics are dependent on extensive storage capacity and processing power, requiring a flexible and scalable infrastructure that can be reconfigured for different needs. Even though Hoodoo-based systems work well with commodity hardware, there is huge investment involved on the part of management.
Data Forms Traditional systems have typically enjoyed their intrinsic data within their own vicinity; meaning that all intrinsic data is moved in a specified format to data warehouse for further analysis. However, this will not be the case with big data. Each application and service data will stay in its associated format according to what the specific application requires, as opposed to the preferred format of the data analysis application. This will leave the data in its original format and allow data scientists to share existing data without unnecessarily replicating it.
Privacy Without a doubt, privacy is a big concern with big data. Consumers, for example, often want to know what data an organization collects. Big data is making it more challenging to have secrets and conceal information. Because of this, there are expected to be privacy concerns and conflicts with its users. Alternative Approaches Hybrid Big Data Architecture As explained earlier, traditional Bal tools and infrastructure will seamlessly integrate with the new set of tools and technologies brought by a Hoodoo ecosystem.
It is expected that both systems can mutually work together. To further illustrate this incept, the detailed chart below provides an effective analysis (Arden, 2012): Relational Database, Data Warehouse Enterprises reporting of internal and external information for a broad cross section of stakeholders, both inside and beyond the firewall with extensive security, load balancing, dynamic workload management, and scalability to hundreds of terabytes. Hoodoo Capturing large amounts of data in native format (without schema) for storage and staging for analysis.
Batch processing is primarily reserved for data transformations as well as the investigation of novel, internal and external (though mostly external) ATA via data scientists that are skilled in programming, analytical methods, and data management with sufficient domain expertise to accordingly communicate the findings. Hybrid System, SQL-Unprepared Deep data discovery and investigative analytics via data scientists and business users with SQL skills, integrating typical enterprise data with novel, multi-structured data from web logs, sensors, social networks, etc. (Arden, N. (2012).
Big data analytics architecture) In-memory Analytics In-memory analytics, as its name suggests, performs all analysis in memory without enlisting much of its secondary memory, and is a relatively familiar concept. Procuring the advantages of RAM speed has been around for many years. Only recently; however, has this notion become a practical reality when the mainstream adoption of 64-bit architectures enabled a larger, more addressable memory space. Also noteworthy, were the rapid decline in memory prices. As a result, it is now very realistic to analyze extremely large data sets entirely in-memory.
The Benefits of In-memory Analytics One of the best incentives for in-memory analytics are the dramatic performance improvements. Users are constantly querying and interacting with data in-memory, which is significantly faster than accessing data from disk. Therefore, achieving real- time business intelligence presents many challenges; one of the main hurdles to overcome is slow query performance due to limitations of traditional Bal infrastructure, and in-memory analytics has the capacity to mitigate these limitations.
An additional incentive of in-memory analytics is that it is a cost effective alternative to data warehouses. SMB companies that lack the expertise and resources to build n appropriate data warehouse can take advantage of the in-memory approach, which provides a sustainable ability to analyze very large data sets (Yellowing, 2010). Conclusion Hoodoo Challenges Hoodoo may replace some of the analytic environment such as data integration and TTL in some cases, but Hoodoo does not replace relational databases.
Hoodoo is a poor choice when the work can be done with SQL and through the capabilities of a relational database. But when there is no existing schema or mapping for the data source into the existing schema, as well as very large volumes of unstructured or MME-structured data, then Hoodoo is the obvious choice. Moreover, a hybrid, relational database system that offers all the advantages of a relational database, but is also able to process Unprepared requests would appear to be ideal.