A large data processing algorithm for energy efficiency in a heterogeneous cluster

. It is reported i that the electricity cost to operate a cluster may well exceed its acquisition cost, and the processing of big data requires large scale cluster and long period. Therefore, energy efficient processing of big data is essential for the data owners and users. In this paper, we propose a novel algorithm MinBalance to processing I/O intensive big data tasks energy efficiently in heterogeneous cluster. In the former step, four greedy policies are used to select the proper nodes considering heterogeneity of the cluster. While in the latter step, the workloads of the selected nodes will be well balanced to avoid the energy wastes caused by waiting. MinBalance is a universal algorithm and cannot be affected by the data storage strategies. Experimental results indicate that MinBalance can achieve over 60% energy reduction for large sets over the traditional methods of powering down partial nodes.


Introduction
With the development and application of information technology, the data produced is presented.How to store, manage, and apply these data to become an explosive growth.A general concern of the business community and academia.You know, there is great value in big data, so research based on big data is also very much.Many scholars call it the fourth paradigm of scientific research [1]- [2].Cloud computing as a kind of Emerging economies based on economies of scale have become big data the first platform for storage and processing.The open source cloud meter is the platform Hadoop, HBase, and HadoopDB have been widely studied and the application.More and more businesses are building their own big data points the platform deals with growing business data and even offers Various services based on big data [3].A lot of hardware resources are required to handle big data.Include servers, PCS, and even mobile devices.The making of these devices takes a lot of energy, mainly electricity, to be used globally electricity is generated mainly by thermal power so the big data also has great challenges to energy and environment [4].In 2005, show that a server is within the lifetime of its use the total amount of electricity consumed has exceeded the purchase cost.And research show that, in 2008, the world's 4400 servers consumed electricity 0.8% percent, if you go like that, at that rate By 2020, that proportion will be 3.2%.Epa (US Environmental Protection Agency) issued a report statement in 2006, the total electricity consumption of American IT agencies was 61 billion KWh, the electricity bill alone is $4.5 billion [5]- [7].So that's a concern Big data storage and processing performance must also be used for energy consumption Give enough attention [8].
This paper mainly discusses the large data processing tasks of I/O intensive.The computationally intensive tasks are affected by the real-time running state of the processor large, and different hardware and operating system provided processor control machines.There are differences in system, so this paper does not consider computationally intensive large Numbers According to the task.Because of the data-intensive task for the processor with a small dependency, for a server, the processing of each data block is reduced the time and power consumed can be regarded as basically the same.A cluster consisting of n heterogeneous nodes processes a Map-Reduce tasks, assume that the nodes involved in task processing are C, the total energy consumed during task processing is The Ti and pi represent the processing time of the I node Power consumption.By type (1) the total energy is mainly affected by two factors: use the nodes that perform the tasks and the maximum processing time of the nodes.Type (1) The two kinds of high-efficiency data processing methods: 1) To select Some suitable nodes perform tasks to reduce total power consumption; 2) Equilibrium The load of a node reduces the maximum execution time.
According to the actual situation of the node, determine which tasks each node performs, namely Equalize the load of each node, reduce task execution time, and further Reduces total energy consumption of the system.The method has three distinct advantages: 1) Fully consider the heterogeneity of the nodes; 2) There is no copy storage strategy shut; 3) Comprehensively consider the total number of nodes and load balance the factor of consumption.

Problem description
I/O intensive large data processing tasks for heterogeneous clusters.Energy efficient processing problems can be formalized as follows: given a set, the group is composed of h isomeric nodes and N = {n1, n2, n3, …, nh}, of which A node ni (1≤i≤h) takes the time and work required to process a block of data.The consumption is Ti and pi.Given an I/ O intensive big data processing Task J, the data set that it needs to process contains m standard data blocks D = {d1, d2, ..., dm}, each block has r copies stored in h on a heterogeneous node.The actual storage location of each data block can be passed the metadata of the cluster is obtained, and the actual storage situation of the data set is L ={(n1.1,...,n1.r),(n2.1,..., n2.r),…,(nm.1,..., nm.r), }, ni.j ) , ( , which represents the j of data block di a copy is stored on the node without loss of generality, assuming ni.1 < ni.2<...< ni.r (1 ≤i≤ m), which can be known to be stored on node ni.The set of data blocks can be represented as Di= {dj | ni  (nj.1, …, nj.r),1≤i≤m}.The solution of this problem is from node set N, select part of the nodes to handle task J and make these nodes work, the total energy consumed during processing is the least.Clearly, data set D a data block must have at least one copy stored on the selected node, the sample task can be successfully executed.

Processing algorithm
Through formula (1) and the analysis of the aforementioned issues shown the optimal target of high energy efficient data processing under the construction node is two: 1) Select the appropriate coverage node, i.
The analysis of the problem description section can find the exact solution to the problem is difficult, and it's hard to find the complexity in polynomial time solution.
In this paper, the approximate solution method is adopted to optimize the target decompose the problem into two relatively independent sub-problems: first find the appropriate coverage node and then allow the selected coverage node distribution, namely, load balancing each node to reduce critical nodes point processing time, finally achieve energy efficient target.The traditional way to achieve energy efficiency is by closing some nodes.The method, in fact, is the question of finding suitable coverage nodes after decomposition the topic.But the traditional approach does not take into account the heterogeneity of the nodes, therefore can be classified as hypergraph coverage problem or classic collection coverage problem to solve.
The problem with the node selection phase is that it is a given isomerism select the appropriate node in the node set as the overlay node performing I/O intensive large data tasks.In this question, by the processing time of each node is determined by the number of new data blocks, the number of new data blocks is affected by the order of node selection the traditional WSCP solution does not apply to this problem.
In this section, we present a selection of selected cases based on the thinking of the greedy mind Greedy, as shown in algorithm 1.
If the heterogeneity of the node is not considered at all, you only need to select the nodes that currently contain the new data blocks.
Overwrite the node so that the final selection can be approximated The minimum number of cap nodes reduces energy consumption to a certain extent.But for heterogeneous nodes, the nodes that may be selected are inefficient and lead to eventual results, the energy expenditure is very high.The weight of each node in this scheme can be tabled shown for ( ) ( )

MinBalance algorithm
Specific load balancing methods: MinBalance as algorithm 2, The node selected by algorithm 1 is shown to load balancing to determine each what data is actually processed by the node to enable the task to process the energy consumed the least.The main idea of the MinBalance method is to find the current covering all the key nodes in the node, as described in algorithm 2, these key nodes are then loaded with other coverage nodes balance, then proceed to the next iteration, looking for key nodes again load balancing until all the key nodes are currently unavailable and other nodes are balanced by load, as shown in algorithm 2.
(1)Find the overlay node set NS; (2)all key node KN in NS; (

The energy consumption of MinBalance algorithm
This section focuses on the energy consumed by the MinBalance algorithm in handling big data tasks.In order to be easy to identify in the experimental results diagram, 4 kinds of implementation MinBalance algorithm are used in BMNF, BHPF,BLPF, BLEF. Figure 1 shows the energy consumed by the algorithm in dealing with small tasks when the data is randomly distributed.It can be seen that in the same situation, MinBalance can reduce energy consumption by about 10% when |D | = 100, and | D | = 1000, MinBalance can reduce energy consumption by about 60%.

Conclusion
This article addresses the I/O intensive large data points in heterogeneous clusters.In order to make a good calculation of the high efficiency MinBalance, this method is not subject to the shadow of the storage strategy.MinBalance decomposes complex problems into node selection and load.The equilibrium two steps are solved respectively, and the solution of the node selection is solved four different greedy strategies were adopted to fully consider the differences of the nodes Construct, select a small number of suitable nodes for task processing.
In the load balancing process focuses on the key section of the longest handling time point, as much as possible without increasing the maximum processing time.Its load is migrated to nodes with short processing time.The algorithm is in place when the data set is large, it is more than traditional closing node.The method reduces energy consumption by up to 60%.From the experimental results it can be seen that the MNF method with node-heterogeneity is the least energy efficient.
But the algorithm proposed in this paper is based on the idea of greed as an algorithm, it has not proved its exact approximate degree and above theoretically lower bound.The experiment in this paper is simulated, but the actual cluster is different time that nodes consume when processing big data is unpredictable and the property of the storage medium itself affects the same node.The power of data consumption of sample size may differ.So The next step will be to consider platforms such as real Hadoop conduct performance testing and analysis.

2 )
e. the work in reduction (1) Part of the consumption Reduce the actual processing time of critical nodes, reduction type (1) time

( 3 )
The weight of weight (ni) of each node is calculated; (4) The maximum value of the option value is the coverage node; (5) Update: DS = DS union NC (ni), NS = NS union {ni}; (6) If the DS then go Step (2); (7) Else Return NS.