CRUD eval problems

Hi,
I have greatly benefited from reading and reproducing your work. Recently, while evaluating with the CRUD dataset, I encountered some phenomena and would like to seek your advice.

In my experiments, I observed that **the response length of the LLM appears to be positively correlated with the average chunk length**, which in turn leads to fluctuations in metrics such as BLEU. May I ask whether the **Dynamic Merging mechanism** proposed in your Meta-Chunk method was designed, at least in part, to address this issue?

In addition, I noticed that **the reference texts in the CRUD dataset are generally short**, whereas the outputs generated by RAG models tend to be much longer. This seems to result in shorter generations having an advantage in evaluation metrics. **Does this imply that CRUD, as an evaluation benchmark, may not fully and objectively reflect the true performance** ?

I would be very grateful for your guidance, and I truly appreciate your important contributions in this area.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRUD eval problems #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CRUD eval problems #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions