IDEA 中使用 Hudi的示例代码

2024-04-02 19:04:59 139人浏览泡泡鱼

Python 官方文档：入门教程 => 点击学习

摘要

目录环境准备核心代码测试参考资料环境准备创建 Maven 项目创建服务器远程连接Tools------Delployment-----Browse Remote Host 设置如

环境准备

创建 Maven 项目创建服务器远程连接
Tools------Delployment-----Browse Remote Host

在这里插入图片描述

设置如下内容：

在这里插入图片描述

在这里输入服务器的账号和密码

在这里插入图片描述

点击Test Connection，提示Successfully的话，就说明配置成功。

在这里插入图片描述

复制hadoop的 core-site.xml、hdfs-site.xml 以及 log4j.properties 三个文件复制到resources文件夹下。

在这里插入图片描述

设置 log4j.properties 为打印警告异常信息：

log4j.rootCateGory=WARN, console

4.添加 pom.xml 文件

<repositories>
        <repository>
            <id>aliyun</id>
            <url>Http://maven.aliyun.com/nexus/content/groups/public/</url>
        </repository>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
        <repository>
            <id>jboss</id>
            <url>http://repository.jboss.com/nexus/content/groups/public</url>
        </repository>
    </repositories>

    <properties>
        <Scala.version>2.12.10</scala.version>
        <scala.binary.version>2.12</scala.binary.version>
        <spark.version>3.0.0</spark.version>
        <hadoop.version>2.7.3</hadoop.version>
        <hudi.version>0.9.0</hudi.version>
    </properties>

    <dependencies>
        <!-- 依赖Scala语言 -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <!-- Spark Core 依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- Spark sql 依赖 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.binary.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <!-- Hadoop Client 依赖 -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <!-- hudi-spark3 -->
        <dependency>
            <groupId>org.apache.hudi</groupId>
            <artifactId>hudi-spark3-bundle_2.12</artifactId>
            <version>${hudi.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-avro_2.12</artifactId>
            <version>${spark.version}</version>
        </dependency>

    </dependencies>

    <build>
        <outputDirectory>target/classes</outputDirectory>
        <testOutputDirectory>target/test-classes</testOutputDirectory>
        <resources>
            <resource>
                <directory>${project.basedir}/src/main/resources</directory>
            </resource>
        </resources>
        <!-- Maven 编译的插件 -->
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

要注释掉创建项目时的生成的下面的代码，不然依赖一直报错：

代码结构：

在这里插入图片描述

核心代码

import org.apache.hudi.QuickstartUtils.DataGenerator
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}

object HudiSparkDemo {
  
  def insertData(spark: SparkSession, table: String, path: String): Unit = {
    import spark.implicits._
    // 第1步、模拟乘车数据
    import org.apache.hudi.QuickstartUtils._
    val dataGen: DataGenerator = new DataGenerator()
    val inserts = convertToStringList(dataGen.generateInserts(100))
    import scala.collection.JavaConverters._
    val insertDF: DataFrame = spark.read.JSON(
      spark.sparkContext.parallelize(inserts.asScala, 2).toDS()
    )
//    		insertDF.printSchema()
//    		insertDF.show(10, truncate = false)
    //第二步： 插入数据到Hudi表
    import org.apache.hudi.DataSourceWriteOptions._
    import org.apache.hudi.config.HoodieWriteConfig._
    insertDF.write
      .mode(SaveMode.Append)
      .fORMat("hudi")
      .option("hoodie.insert.shuffle.parallelism", 2)
      .option("hoodie.insert.shuffle.parallelism", 2)
      //Hudi表的属性设置
      .option(PRECOMBINE_FIELD.key(), "ts")
      .option(RECORDKEY_FIELD.key(), "uuid")
      .option(PARTITIONPATH_FIELD.key(), "partitionpath")
      .option(TBL_NAME.key(), table)
      .save(path)
  }
  
  def queryData(spark: SparkSession, path: String): Unit = {
    import spark.implicits._
    val tripsDF: DataFrame = spark.read.format("hudi").load(path)
//    tripsDF.printSchema()
//    tripsDF.show(10, truncate = false)
    //查询费用大于10，小于50的乘车数据
    tripsDF
      .filter($"fare" >= 20 && $"fare" <=50)
      .select($"driver", $"rider", $"fare", $"begin_lat", $"begin_lon", $"partitionpath", $"_hoodie_commit_time")
      .orderBy($"fare".desc, $"_hoodie_commit_time".desc)
      .show(20, truncate = false)
  }
  def queryDataByTime(spark: SparkSession, path: String):Unit = {
    import org.apache.spark.sql.functions._
    //方式一：指定字符串，按照日期时间过滤获取数据
    val df1 = spark.read
      .format("hudi")
      .option("as.of.instant", "20220610160908")
      .load(path)
      .sort(col("_hoodie_commit_time").desc)
    df1.printSchema()
    df1.show(numRows = 5, truncate = false)
    //方式二：指定字符串，按照日期时间过滤获取数据
    val df2 = spark.read
      .format("hudi")
      .option("as.of.instant", "2022-06-10 16:09:08")
      .load(path)
      .sort(col("_hoodie_commit_time").desc)
    df2.printSchema()
    df2.show(numRows = 5, truncate = false)
  }
  
  
  def insertData(spark: SparkSession, table: String, path: String, dataGen: DataGenerator): Unit = {
    import spark.implicits._
    // 第1步、模拟乘车数据
    import org.apache.hudi.QuickstartUtils._
    val inserts = convertToStringList(dataGen.generateInserts(100))
    import scala.collection.JavaConverters._
    val insertDF: DataFrame = spark.read.json(
      spark.sparkContext.parallelize(inserts.asScala, 2).toDS()
    )
    //    		insertDF.printSchema()
    //    		insertDF.show(10, truncate = false)
    //第二步： 插入数据到Hudi表
    import org.apache.hudi.DataSourceWriteOptions._
    import org.apache.hudi.config.HoodieWriteConfig._
    insertDF.write
      //更换为Overwrite模式
      .mode(SaveMode.Overwrite)
      .format("hudi")
      .option("hoodie.insert.shuffle.parallelism", 2)
      .option("hoodie.insert.shuffle.parallelism", 2)
      //Hudi表的属性设置
      .option(PRECOMBINE_FIELD.key(), "ts")
      .option(RECORDKEY_FIELD.key(), "uuid")
      .option(PARTITIONPATH_FIELD.key(), "partitionpath")
      .option(TBL_NAME.key(), table)
      .save(path)
  }
  
  def updateData(spark: SparkSession, table: String, path: String, dataGen: DataGenerator):Unit = {
    import spark.implicits._
    // 第1步、模拟乘车数据
    import org.apache.hudi.QuickstartUtils._
    //产生更新的数据
    val updates = convertToStringList(dataGen.generateUpdates(100))
    import scala.collection.JavaConverters._
    val updateDF: DataFrame = spark.read.json(
      spark.sparkContext.parallelize(updates.asScala, 2).toDS()
    )
    // TOOD: 第2步、插入数据到Hudi表
    import org.apache.hudi.DataSourceWriteOptions._
    import org.apache.hudi.config.HoodieWriteConfig._
    updateDF.write
      //追加模式
      .mode(SaveMode.Append)
      .format("hudi")
      .option("hoodie.insert.shuffle.parallelism", "2")
      .option("hoodie.upsert.shuffle.parallelism", "2")
      // Hudi 表的属性值设置
      .option(PRECOMBINE_FIELD.key(), "ts")
      .option(RECORDKEY_FIELD.key(), "uuid")
      .option(PARTITIONPATH_FIELD.key(), "partitionpath")
      .option(TBL_NAME.key(), table)
      .save(path)
  }
  
  def incrementalQueryData(spark: SparkSession, path: String): Unit = {
    import spark.implicits._
    // 第1步、加载Hudi表数据，获取commit time时间，作为增量查询数据阈值
    import org.apache.hudi.DataSourceReadOptions._
    spark.read
      .format("hudi")
      .load(path)
      .createOrReplaceTempView("view_temp_hudi_trips")
    val commits: Array[String] = spark
      .sql(
        """
				  |select
				  |  distinct(_hoodie_commit_time) as commitTime
				  |from
				  |  view_temp_hudi_trips
				  |order by
				  |  commitTime DESC
				  |""".stripMargin
      )
      .map(row => row.getString(0))
      .take(50)
    val beginTime = commits(commits.length - 1) // commit time we are interested in
    println(s"beginTime = ${beginTime}")
    // 第2步、设置Hudi数据CommitTime时间阈值，进行增量数据查询
    val tripsIncrementalDF = spark.read
      .format("hudi")
      // 设置查询数据模式为：incremental，增量读取
      .option(QUERY_TYPE.key(), QUERY_TYPE_INCREMENTAL_OPT_VAL)
      // 设置增量读取数据时开始时间
      .option(BEGIN_INSTANTTIME.key(), beginTime)
      .load(path)
    // 第3步、将增量查询数据注册为临时视图，查询费用大于20数据
    tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
    spark
      .sql(
        """
				  |select
				  |  `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts
				  |from
				  |  hudi_trips_incremental
				  |where
				  |  fare > 20.0
				  |""".stripMargin
      )
      .show(10, truncate = false)
  }
  
  def deleteData(spark: SparkSession, table: String, path: String): Unit = {
    import spark.implicits._
    // 第1步、加载Hudi表数据，获取条目数
    val tripsDF: DataFrame = spark.read.format("hudi").load(path)
    println(s"Raw Count = ${tripsDF.count()}")
    // 第2步、模拟要删除的数据，从Hudi中加载数据，获取几条数据，转换为要删除数据集合
    val dataframe = tripsDF.limit(2).select($"uuid", $"partitionpath")
    import org.apache.hudi.QuickstartUtils._
    val dataGenerator = new DataGenerator()
    val deletes = dataGenerator.generateDeletes(dataframe.collectAsList())
    import scala.collection.JavaConverters._
    val deleteDF = spark.read.json(spark.sparkContext.parallelize(deletes.asScala, 2))
    // 第3步、保存数据到Hudi表中，设置操作类型：DELETE
    import org.apache.hudi.DataSourceWriteOptions._
    import org.apache.hudi.config.HoodieWriteConfig._
    deleteDF.write
      .mode(SaveMode.Append)
      .format("hudi")
      .option("hoodie.insert.shuffle.parallelism", "2")
      .option("hoodie.upsert.shuffle.parallelism", "2")
      // 设置数据操作类型为delete，默认值为upsert
      .option(OPERATION.key(), "delete")
      .option(PRECOMBINE_FIELD.key(), "ts")
      .option(RECORDKEY_FIELD.key(), "uuid")
      .option(PARTITIONPATH_FIELD.key(), "partitionpath")
      .option(TBL_NAME.key(), table)
      .save(path)
    // 第4步、再次加载Hudi表数据，统计条目数，查看是否减少2条数据
    val hudiDF: DataFrame = spark.read.format("hudi").load(path)
    println(s"Delete After Count = ${hudiDF.count()}")
  }
  def main(args: Array[String]): Unit = {
    System.setProperty("HADOOP_USER_NAME","hty")
    //创建SparkSession示例对象，设置属性
    val spark: SparkSession = {
      SparkSession.builder()
        .appName(this.getClass.getSimpleName.stripSuffix("$"))
        .master("local[2]")
        // 设置序列化方式：Kryo
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
        .getOrCreate()
    }
    //定义变量：表名称、保存路径
    val tableName: String = "tbl_trips_cow"
    val tablePath: String = "/hudi_warehouse/tbl_trips_cow"
    //构建数据生成器，模拟产生业务数据
    import org.apache.hudi.QuickstartUtils._
    //任务一：模拟数据，插入Hudi表，采用COW模式
    //insertData(spark, tableName, tablePath)
     //任务二：快照方式查询（Snapshot Query）数据，采用DSL方式
      //queryData(spark, tablePath)
    //queryDataByTime(spark, tablePath)
    // 任务三：更新（Update）数据，第1步、模拟产生数据，第2步、模拟产生数据，针对第1步数据字段值更新，
    // 第3步、将数据更新到Hudi表中
    val dataGen: DataGenerator = new DataGenerator()
    //insertData(spark, tableName, tablePath, dataGen)
    //updateData(spark, tableName, tablePath, dataGen)
    //任务四：增量查询（Incremental Query）数据，采用SQL方式
    //incrementalQueryData(spark, tablePath)
    //任务五：删除（Delete）数据
    deleteData(spark, tableName,tablePath)
    //应用结束，关闭资源
    spark.stop()
  }
}

测试

执行 insertData(spark, tableName, tablePath) 方法后对其用快照查询的方式进行查询：

queryData(spark, tablePath)

在这里插入图片描述

增量查询（Incremental Query）数据：

incrementalQueryData(spark, tablePath)

在这里插入图片描述

参考资料

https://www.bilibili.com/video/BV1sb4y1n7hK?p=21&vd_source=e21134e00867aeadc3c6b37bb38b9eee

到此这篇关于idea 中使用 Hudi的文章就介绍到这了,更多相关IDEA 使用 Hudi内容请搜索编程网以前的文章或继续浏览下面的相关文章希望大家以后多多支持编程网！

您可能感兴趣的文档:

点击免费下载>>软考高级考试备考技巧/历年真题/备考精华资料

--结束END--

本文标题: IDEA 中使用 Hudi的示例代码

本文链接: https://www.lsjlt.com/news/151300.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

本篇文章演示代码以及资料文档资料下载

下载Word文档到电脑，方便收藏和打印～

下载Word文档

去做题

猜你喜欢

IDEA 中使用 Hudi的示例代码

目录环境准备核心代码测试参考资料环境准备创建 Maven 项目创建服务器远程连接Tools------Delployment-----Browse Remote Host 设置如...

99+

2022-11-13
Vue3中使用pinia的示例代码

目录1、安装：npm install pinia2、创建store文件并配置内部的index.js文件3、main.js文件中配置4、组件使用4-1、 store.$reset()&...

99+

2022-12-15

Vue3中使用pinia Vue3 pinia使用
SpringBoot中使用RocketMQ的示例代码

目录1 订单微服务发送消息1.1 订单微服务添加rocketmq的依赖1.2 添加配置1.3 编写测试代码1.4 测试2 用户微服务订阅消息2.1 用户微服务增加rocketmq依赖...

99+

2022-11-12
springboot中使用groovy的示例代码

目录GroovypomResourceScriptSourceDatabaseScriptSourceGroovy Groovy是一种基于Java的语法的基于JVM的编程语言。Gro...

99+

2022-11-13
idea快速生成代码配置的方法示例

前言这里是用的goland idea，实际上这个idea和 intellij idea的配置是一样的，并没有太大区别，开整 1、进入 File->settings->...

99+

2022-11-12
Java基于IDEA实现http编程的示例代码

http开发前言之为什么要有应用层我们已经学过TCP/IP , 已经知道目前数据能从客户端进程经过路径选择跨网络传送到服务器端进程 [ IP+Port ],可是，仅仅把数据从A点传...

99+

2022-11-12
.net core 中 WebApiClientCore的使用示例代码

WebApiClient 接口注册与选项 1 配置文件中配置HttpApiOptions选项配置示例 "IUserApi": { "HttpHost": "http://...

99+

2022-12-14

.net core 中 WebApiClientCore使用 .net core WebApiClientCore
在Java中使用Jwt的示例代码

目录JWT 特点 1. JWT 的原理 2. JWT 的数据结构 2.1 Header 2.2 Payload 2.3 Signature 3. 在 Java 中使用 JWT 特点 ...

99+

2022-11-12
springboot 使用 minio的示例代码

什么是MinIo MinIO 是一个基于Apache License v2.0开源协议的对象存储服务。它兼容亚马逊S3云存储服务接口，非常适合于存储大容量非结构化的数据，例如图片、视...

99+

2022-11-13
vue3使用mqtt的示例代码

目录vue3使用mqtt下面再看下vue3调用mqttvue3使用mqtt 封装类 //封装一个类（可直接cv） class createds { //创建公共变量 stat...

99+

2023-05-16

vue3使用mqtt vue3 mqtt vue3调用mqtt
Android ScrollView使用代码示例

ScrollView可实现控件在超出屏幕范围的情况下滚动显示。用法：在XML文件中将需滚动的控件包含在ScrollView中，当控件超出屏幕范围时可通过滚动查看；Scroll...

99+

2022-06-06

示例 scrollview Android
Tomcat中使用ipv6地址的示例代码

目录1、替换老版本Tomcat2、项目无法启动3、Tomcat7拦截特殊字符4、设置IPV6的监听在公司的一次项目改造过程中，需要将原来的IPV6替换成IPV4。查询资料资料之后,你...

99+

2022-11-13
Vue3中正确使用ElementPlus的示例代码

目录一、创建Vue3项目二、进入项目，安装Element-Plus三、配置Icon四、测试运行五、Git提交一下一、创建Vue3项目 vue create vue_element...

99+

2023-01-28

Vue3使用ElementPlus ElementPlus使用
java Scanner类的使用示例代码

Scanner类简介 Java 5添加了java.util.Scanner类，这是一个用于扫描输入文本的新的实用程序。它是以前的StringTokenizer和Matcher类之间...

99+

2022-11-12
springboot2+es7使用RestHighLevelClient的示例代码

目录一、引入依赖jar二、application.properties配置三、使用其它由于spring和es的集成并不是特别友好，es的高低版本兼容问题、api更新频率高等问题，所以...

99+

2022-11-13
vue3+ts使用APlayer的示例代码

目录引言安装依赖代码APlayer.Vueplayer.ts效果图引言自己弄新版博客想用APlayer,到github看了一圈没见有vue3版本的，所以就用基于Aplayer组件化...

99+

2022-11-13

vue3 ts使用APlayer vue3使用APlayer
Npmlink的作用与使用示例代码

目录一、为什么要用Npm link二、Npm link工作原理三、Npm link的使用三、删除npm link的链接补充：npm link一、为什么要用Npm link...

99+

2023-01-17

Npm link的作用与使用 Npm link使用 Npm link用法
vue3中使用ref语法糖的示例代码

自从引入组合式 API 的概念以来，一个主要的未解决的问题就是 ref 和响应式对象到底用哪个。响应式对象存在解构丢失响应性的问题，而 ref 需要到处使用 .value 则感觉很繁...

99+

2022-11-13
SpringDataRedis简单使用示例代码

目录spring-data-redisspring-data-redis的特性SpringBoot的版本是2.xSpringDataRedis自动序列化Spring默认提供的Stri...

99+

2023-02-03

SpringDataRedis使用 SpringData Redis
pygame库pgu使用示例代码

目录前言一、pgu是什么？二、使用步骤 1.安装库2.制作按钮弹窗3.制作事件触发弹窗4.两种模式完整代码总结前言现在用pygame制作小游戏的人越来越多，但是pygam...

99+

2022-11-12