Skip to content

Hive Catalog fails to read Spark DataSource tables - incorrectly treats Parquet as SequenceFile #70956

@ever4Kenny

Description

@ever4Kenny

Steps to reproduce the behavior (Required)

  1. Create a Spark DataSource table in Spark SQL:
    CREATE TABLE test_spark_datasource_table (
    id INT,
    name STRING
    )
    USING PARQUET
    LOCATION 'hdfs://lan/user/hive/warehouse/test.db/test_table';

  2. Insert test data into the table in Spark SQL:
    INSERT INTO test_spark_datasource_table VALUES
    (1, 'Alice'),
    (2, 'Bob');

  3. Query the table through Hive Catalog in StarRocks:
    SELECT * FROM hive_catalog.test_db.test_spark_datasource_table;

Expected behavior (Required)

StarRocks should correctly identify the table as a Parquet table and read the actual schema and data from the Parquet files.

Real behavior (Required)

Query fails with the following error:
Failed to open the off-heap table scanner. java exception details: java.io.IOException: Failed to open the hive reader.at com.starrocks.hive.reader.HiveScanner.open(HiveScanner.java:218)
Caused by: java.io.IOException: hdfs:/user/hive/warehouse/test.db/test_table/part-00004-b0139440-a3b4-4755-864c-0c1c386da2f8-c000.snappy.parquet
not a SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2036)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1982)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1931)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1945)
at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:49)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at com.starrocks.hive.reader.HiveScanner.initReader(HiveScanner.java:194)
at com.starrocks.hive.reader.HiveScanner.open(HiveScanner.java:214)

Root Cause:

Spark DataSource tables store their metadata differently from traditional Hive tables. When created with USING PARQUET/ORC/etc, Spark stores:

  • A fake SerDe (SequenceFile) in the standard Hive metastore fields
  • A fake column schema (cols)
  • The actual file format and schema in TABLE_PARAMS with keys like spark.sql.sources.provider and spark.sql.sources.schema

StarRocks version (Required)

  • 3.3.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions