本文介绍常见的几种对spark dataframe 进行列名的修改方法
生成测试数据
val data = Seq(Row(Row("James ","","Smith"),"36636","M",3000),
Row(Row("Michael ","Rose",""),"40288","M",4000),
Row(Row("Robert ","","Williams"),"42114","M",4000),
Row(Row("Maria ","Anne","Jones"),"39192","F",4000),
Row(Row("Jen","Mary","Brown"),"","F",-1)
)
定义schema
val schema = new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("dob",StringType)
.add("gender",StringType)
.add("salary",IntegerType)
创建 DataFrame
val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.printSchema()
下图是打印schema信息
1.使用 withColumnRenamed 重命名
**`withColumnRenamed()` 函数是最直接的方法,第一个参数是原来的名字,第二个是新名字**
**Syntax:**
def withColumnRenamed(existingName: String, newName: String): DataFrame
**`existingName` – 存在的原始的名字**
**`newName` – 新的列名**
返回一个新的 DataFrame**.**
举例**:**
df.withColumnRenamed("dob","DateOfBirth")
.printSchema()
上面的例子就是将 dob
2. 使用 withColumnRenamed 重命名多列
val df2 = df.withColumnRenamed("dob","DateOfBirth")
.withColumnRenamed("salary","salary_amount")
df2.printSchema()
3. 使用 StructType 结构体实现嵌套命名
val schema2 = new StructType()
.add("fname",StringType)
.add("middlename",StringType)
.add("lname",StringType)
df.select(col("name").cast(schema2),
col("dob"),
col("gender"),
col("salary"))
.printSchema()
上面的例子就是讲name变成的了嵌套的名字包含 firstname middlename lastname
4. 使用 Select 修改嵌套命名
df.select(col("name.firstname").as("fname"),
col("name.middlename").as("mname"),
col("name.lastname").as("lname"),
col("dob"),col("gender"),col("salary"))
.printSchema()
5. 使用 withColumn 修改嵌套命名
val df4 = df.withColumn("fname",col("name.firstname"))
.withColumn("mname",col("name.middlename"))
.withColumn("lname",col("name.lastname"))
.drop("name")
df4.printSchema()
6. 使用col() 函数 – 修改全部列名或者多个列名
val old_columns = Seq("dob","gender","salary","fname","mname","lname")
val new_columns = Seq("DateOfBirth","Sex","salary","firstName","middleName","lastName")
val columnsList = old_columns.zip(new_columns).map(f=>{col(f._1).as(f._2)})
val df5 = df4.select(columnsList:_*)
df5.printSchema()
7. 使用 toDF() 修改全部的列名
当我们的列名中不包含嵌套列名的时候 **,** 给定新的 schema使用 **`toDF()`** 可以修改全部的列名
val newColumns = Seq("newCol1","newCol2","newCol3")
val df = df.toDF(newColumns:_*)