原创

用idea开发我们的spark项目

写在前面

如果你是刚入行的java(或大数据)菜鸟,如果你还不会使用idea这样的“神兵利器”,如果你还对 mvn clean package 这样的命令一知半解。那么,你有必要花点时间,瞧一瞧这篇文章,正所谓,“工欲善其事,必先利其器”,它将指导你一步一步用idea开发出我们的spark程序,用maven编译打包我们的Scala(Scala与Java混合)代码。当然,大神请自动忽略。

开发环境

  • macOs Mojave
  • idea 2019.3.1
  • maven-3.6.4
  • spark-2.4.3
  • hadoop-2.7.4
  • jdk-1.8
  • scala-2.1.8

快速开始

1. 创建一个空的maven项目

maven

2. 使用Scala插件

使用scala插件,Scala插件的安装和使用在此将不做赘述。引入Scala插件之后,我们就能愉快地编辑我们的Scala代码。

3. 编辑pom文件

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.dr.leo</groupId>
    <artifactId>leo-study-spark</artifactId>
    <version>1.0-SNAPSHOT</version>

    <description>跟着leo学习spark</description>
    <url>https://www.jlpyyf.com</url>
    <developers>
        <developer>
            <id>leo</id>
            <name>leo.jie</name>
            <email>leo.jie@datarealglobal.com</email>
            <roles>
                <role>developer</role>
            </roles>
        </developer>
    </developers>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <spark.version>2.4.3</spark.version>
        <scala.version>2.11</scala.version>
        <!--        <spark.cope>compile</spark.cope>-->
        <spark.cope>provided</spark.cope>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
            <scope>${spark.cope}</scope>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version}</artifactId>
            <version>${spark.version}</version>
            <scope>${spark.cope}</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_${scala.version}</artifactId>
            <version>${spark.version}</version>
            <scope>${spark.cope}</scope>
        </dependency>

        <!-- JDBC-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.32</version>
        </dependency>

        <dependency>
            <groupId>com.microsoft.sqlserver</groupId>
            <artifactId>mssql-jdbc</artifactId>
            <version>6.1.0.jre8</version>
        </dependency>

    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <scalaVersion>${scala.version}</scalaVersion>
                    <args>
                        <arg>-target:jvm-1.5</arg>
                    </args>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-eclipse-plugin</artifactId>
                <version>2.10</version>
                <configuration>
                    <downloadSources>true</downloadSources>
                    <buildcommands>
                        <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
                    </buildcommands>
                    <additionalProjectnatures>
                        <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
                    </additionalProjectnatures>
                    <classpathContainers>
                        <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
                        <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
                    </classpathContainers>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>3.1.1</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <finalName>leo-study-spark</finalName>
                    <appendAssemblyId>false</appendAssemblyId>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>build-helper-maven-plugin</artifactId>
                <version>1.10</version>
                <executions>
                    <execution>
                        <id>add-source</id>
                        <phase>generate-sources</phase>
                        <goals>
                            <goal>add-source</goal>
                        </goals>
                        <configuration>
                            <!-- 我们可以通过在这里添加多个source节点,来添加任意多个源文件夹 -->
                            <sources>
                                <source>src/main/java</source>
                                <source>src/main/scala</source>
                            </sources>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

    <reporting>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <configuration>
                    <scalaVersion>${scala.version}</scalaVersion>
                </configuration>
            </plugin>
        </plugins>
    </reporting>

</project>

pom 文件中引入了spark的一些核心依赖以及编译打包Scala和Java代码的插件。

4. 编写WordCount程序

object WordCount {
  def main(args: Array[String]): Unit = {
    println(JavaTest.sayHello())
    val javaTest = new JavaTest
    javaTest.sayHi()
    val conf = new SparkConf().setMaster("local[4]").setAppName("WordCount")
    val sc = new SparkContext(conf)
    sc.textFile(args(0)).flatMap(_.split(" "))
      .map((_, 1))
      .reduceByKey(_ + _)
      .saveAsTextFile(args(1))
    sc.stop()
  }
}
  • 最终的项目结构:

project

  1. 编译打包,发往集群上运行

  2. 注意一:一般情况下,我们pom中引入的spark相关依赖是不需要打进我们最终的jar中的。所以修改,spark相关依赖的scope属性值为provided。

  3. 注意二:如果jar需要以集群模式运行,则需要注释我们的WordCount程序中的 setMaster("local[4]")
  4. 运行maven的打包命令
mvn clean package
  • 打包结果

maven

  • leo-study-spark.jar就是我们需要在集群上运行的可执行jar包
  • 上传leo-study-spark.jar到集群,执行如下命令:
spark-submit --master yarn --deploy-mode cluster --driver-memory 2G --executor-memory 2g --num-executors 5 --conf spark.yarn.executor.memoryOverhead=2048 --conf spark.network.timeout=30000 --class com.dr.leo.WordCount leo-study-spark.jar hdfs://leo/test/test.txt hdfs://leo/test/leo_word_count
  • 运行结果截图

success

  • WordCount的输出结果

wordcount

  • yarn ui 查看我们刚刚提交的任务

yran

Scala代码中调用Java的方法的结果输出。

小节

以上便完成了spark的项目的搭建,以及用maven编译打包我们的Scala程序。整个流程最核心的其实是pom.xml文件,当然,这个pom.xml文件应该不是最完美的,如有不足之处,还望大家积极指正,与君共勉。

正文到此结束