- Introduction
This research concerns about the performance optimization or improvement of computation running on top of the platform, namely Apache Hadoop in particular. However, the methods and algorithms proposed in this research supposed to be applicable to other platforms as well without loss of generality. The diagram below [1] shows the computation flow on Hadoop, which primarily consists of one name node and multiple slave nodes.
A MapReduce process [2] is composed of two primary functions as follows: 1) Map function: takes a set of input/key value pairs and produces a set of intermediate ouput key/value pairs; 2) Reduce function: takes intermediate key2 and a set of values for the key as illustrated in the following. Map (Key1, Value1) -> List (Key2, Value 2) Reduce (Key2, List(Value2)) -> List(Value2 - Socket Programming
A naïve exhaustive method [7] has been proposed as an initial attempt to the problem of this research and its pros and cons are summarized in the following table. The pros of this approach is simple and exhaustive. Cons of this approach is, as more parameters are considered, the benchmark time grows exponentially. And the method is incapable of further optimization due to limited range of parameter values. The cost of this approach is asymptotically a^p, where a is the number of values of parameter under consideration and p is the number of parameters. Thus, the cost of the method is definitely upper bounded by exponential time. - Proposed Socket
In this section, it is explained the results of research and at the same time is given the comprehensive discussion. Results can be presented in figures, graphs, tables and others that make the reader understand easily. The discussion can be made in several sub-chapters.
A naïve exhaustive method has been proposed and a 2-Phased Exhaustive Sorting-based algorithm has been developed and implemented [7]. The algorithm consists of two phases of sorting: the first phase sorting is performed in order to obtain an extended list (e.g., top 50 parameter settings) of top performing parameter settings for a rough sorting against a small size of data; and then the second phase sorting is performed for detailed sorting against a large size of data instead for a short list (e.g., top 5 out of the above 50) of top performing parameter settings. The flow of the approach [8] is shown in the figure below. - Experimental Results
The random method [8] performs r rounds of random sampling and each round samples a number of values from the given range that is supposed to be greater than the a in the naïve exhaustive. The following is a pseudo code for the main routine to implement the proposed random algorithm. Each of the p parameters samples one from a number of values during each of the r rounds of selection.
Experimental Results Ω (0)
• i.s.r.p decreased by -20% and fixed out
• CPU time reduced by -17.68%
• Cost : a * p ( a=18, p =7 )
Experimental Results Ω (1)
• m.j.s.m.p decreased by -40% and fixed out
• CPU time reduced by -16.73%
• Cost : a * (p -1) ( a=18, p =7 )
- Conclusion
This research has proposed various parametric performance optimization algorithms to achieve the possible best performance in MapReduce computing on a Hadoop platform. There are three representative algorithms proposed with their pros and cons concluded as follows:
- Naive Exhaustive: O (a^p) Infeasibly best performance yet very expensive
- Random: O (1pr) ≈ O(p^2) Feasibly moderate performance and inexpensive
- Heuristically-based: O (p^3) Feasibly great performance and reasonably inexpensive
Naive exhaustive parametric optimization took O(a^p) amount of time and the resulting performance of the program is definitely supposed to be the best but benchmarking is infeasible and very expensive. Thus, this is not considered as a viable solution.
Random parametric optimization took approximately quadratic time and it is considered as the least inexpensive approach. However, the resulting performance of the benchmark is quite limited to the average.
Lastly, the proposed heuristically-based parametric optimization took slightly over the quadratic time or at most the cubic time and the resulting performance is also promising with feasible optimization time. The proposed heuristic employed a greedy approach, and the resulting performance is great instead of optimal since it doesn’t guarantee the optimal performance
References
[1]."Hadoop Distributed File System Architecture." http://hadoop.apache.org/docs/current/
hadoop-project-dist/hadoop- hdfs/images/hdfsarchitecture.png. Web. 20 Jan. 2015.
[2]. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.
[3]. "The Fresh Open Source Software Archive." Http://fossies.org/linux/misc/hadoop-1.2.1.tar.gz/hadoop-1.2.1/docs/mapred-default.html. Web. 20 Jan. 2015.
[4]. Noll, Michael G. "Benchmarking and stress testing a Hadoop cluster with TeraSort, TestDFSIO & co." 2011.4. http://www. michael-noll. com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench (2011).
[5]. Hadoop 1.2.1 Documentation, mapred-dedault configuration parameters.
https://hadoop.apache.org/docs/r1.2.1/mapred-default.htm, Web, 20 Jan. 2015.
[6]. T. White, Hadoop: The definitive Guide. O’Reilly Media,Inc.,2012, pp 166-168.
[7]. Kim, Jongyeop, et al. "Performance evaluation and tuning for MapReduce computing in Hadoop distributed file system." Industrial Informatics (INDIN), 2015 IEEE 13th International Conference on. IEEE, 2015.
[8]. Jongyeop Kim and Nohpill Park. "Identifiation of the Optimal Hadoop Configuration Parameter sets for MapReduce Computing", 3rd International Conference on Applied Computing & Information Technology, July 2015.
Sample Solution