Skip to content

Update Duplicate Detection#77

Open
Lightning11wins wants to merge 121 commits intomasterfrom
dups
Open

Update Duplicate Detection#77
Lightning11wins wants to merge 121 commits intomasterfrom
dups

Conversation

@Lightning11wins
Copy link
Copy Markdown
Contributor

@Lightning11wins Lightning11wins commented Nov 14, 2025

The duplicate detection project is ready to review, although (best case), there are still a couple of things blocking it from being ready to merge.

I would appreciate a full review of all changes, as there's quite a lot here. That said, some areas may require additional special attention, so I've compiled a list of all 28 TODO: Greg comments below. (Note: Some of my todos assume the reader understands various pieces of nearby context / has generally read the indicated source code.)

  • 3 TODO: Gregs in objdrv_cluster.c
  • 1 TODO: Greg in mtsession.md.
  • 1 TODO: Greg in xarray.md.
  • 1 TODO: Greg in xstring.md.

Please let me know if you have any questions, comments, or concerns about my changes and design choices.

Israel added 10 commits October 13, 2025 09:53
Improve edge case logic in comparison functions.
Remove unregister driver function.
Clean up exp_functions.c.
Simplify dataqa_duplicates component in preparation for making it the boundary into our new duplicate system.
Add exp functions: sparse_eql(), ln(), and logn().
Fix bugs in comparison functions.
Make minor tweaks to objdrv_cluster.c.
Modify cluster files to use string keys.
Build vectors fully sparsely.
Add ca_fprint_vector().
Add snprint_llu().
Add exp_fn_trim().
Update exp_fn_cmp().
Organize exp function definitions by group.
Add statistics tracking to cluster driver.
Reduce minimum hint threshold.
Add array handling to ci_xaToTrimmedArray().
Update timer to handle multiple starts and stops properly.
Re-add Levenshtein to exp_functions.
Publish edit_dist() in the cluster library.
Fix mistakes in cluster driver function signatures.
Fix spelling mistakes.
Add detail to an error message in the lexer.
Remove unused .cluster files.
Clean up cluster-schema.cluster.
Clean up other unused junk.
Add known issues to string similarity documentation.
Clean up and organize todos.
Clean up testing code in several files.
@Lightning11wins
Copy link
Copy Markdown
Contributor Author

Kardia PR.

@Lightning11wins
Copy link
Copy Markdown
Contributor Author

I'd probably recommend rebasing these into one commit, but that's up to you.

Israel added 10 commits November 14, 2025 16:10
…ast commit).

Update tests to pass with this modification.
… caches).

Fix a formatting issue with the stat method.
Fix a missing include in the util.c library.
…le hundred bytes.

Add check_double() to handle functions that return NAN on failure.
Clean up.
…rary.

Round similarity results to avoid floating point errors.
Enable caching for memory allocated in get_cluster_size().
Rename edit_dist() to ca_edit_dist() to match format for public functions.
Rename print_diagnostics() to print_err().
@Lightning11wins Lightning11wins requested a review from nboard March 24, 2026 19:06
# Conflicts:
#	centrallix-lib/src/strtcpy.c
@Lightning11wins
Copy link
Copy Markdown
Contributor Author

Lightning11wins commented Mar 24, 2026

TODO: Update this to use the new expect.h functionality.

Done.

Fix m4 macros not adding -DHAVE_BUILTIN_EXPECT to CFLAGS.
Fix CHECK_BUILTIN_EXPECT being run too early, causing its CFLAGS to be clobbered by something.
Fix typo in the module line of the expect.h copyright notice.
Remake configure files.
Modify magic functions to print errors to stderr instead of stdout.
Add explicit tree->Alloc = 0; to some exp_functions.
Add error checking.
Add consts.
Re-add typecasts that I thought were optional.
Improve comments.
Clean up.
@Lightning11wins Lightning11wins requested a review from nboard March 26, 2026 17:13
@Lightning11wins
Copy link
Copy Markdown
Contributor Author

@greptileai Please rereview.

Comment thread centrallix/expression/exp_functions.c
Comment thread centrallix/expression/exp_functions.c Outdated
Comment thread centrallix/expression/exp_functions.c
@Lightning11wins
Copy link
Copy Markdown
Contributor Author

@greptileai I fixed your comments. Review that they are fixed properly. Also, though, I noticed that all of these are in code that wasn't changed recently, so it looks like there are latent bugs we haven't been catching. Please do another thorough re-review of the ENTIRE PR to see if there are any bugs in ANY of the previous work.

Comment thread centrallix/osdrivers/objdrv_cluster.c
Comment thread centrallix/expression/exp_functions.c Outdated
Add some error checking.
Add expect.h to mtask.c.
Add recursion checking.
@Lightning11wins
Copy link
Copy Markdown
Contributor Author

@greptileai I fixed your comments. Review that they are fixed properly. Also, though, I noticed that all of these are in code that wasn't changed recently, so it looks like there are latent bugs we haven't been catching. Please do another thorough re-review of the ENTIRE PR to see if there are any bugs in ANY of the previous work.

Comment thread centrallix/osdrivers/objdrv_cluster.c Outdated
Add HAVE_BUILTIN_EXPECT to config.h.in and cxlibconfig.h.in.
Add code to include cxlibconfig.h.in from expect.h.
Add comments to explain what each .h.in file is used for.
Remove checks for __builtin_expect() from centrallix because we only need them in centrallix-lib.
Remove cflags -DHAVE_BUILTIN_EXPECT because using configs is better.
Rebuild configure files.
@Lightning11wins
Copy link
Copy Markdown
Contributor Author

@greptileai I fixed your comments. Review that they are fixed properly. Also, though, I noticed that all of these are in code that wasn't changed recently, so it looks like there are latent bugs we haven't been catching. Please do another thorough re-review of the ENTIRE PR to see if there are any bugs in ANY of the previous work.

Comment thread centrallix/expression/exp_functions.c
Comment thread centrallix/osdrivers/objdrv_cluster.c
@Lightning11wins
Copy link
Copy Markdown
Contributor Author

@greptileai I fixed your comments. Review that they are fixed properly. Also, though, I noticed that all of these are in code that wasn't changed recently, so it looks like there are latent bugs we haven't been catching. Please do another thorough re-review of the ENTIRE PR to see if there are any bugs in ANY of the previous work.

@Lightning11wins
Copy link
Copy Markdown
Contributor Author

Oops, forgot to post my message before...

This PR is cleared for human review.

@Lightning11wins Lightning11wins added the documentation Changes, improvements, or fixes to documentation files. label Apr 13, 2026
Comment on lines +4550 to +4554
goto end_free;
}

exp_fn_levenshtein(tree, objlist, i0, i1, i2);
//!!! I am not checking for errors here, because IN THEORY we have two strings... if we don't, big uh-oh.
int lev_dist = tree->Integer;
/** Allocate space to store metaphone pointers. **/
char* primary = NULL;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Uninitialized primary/secondary after schema-check goto end_free

goto end_free at line 4550 jumps past char* primary = NULL; and char* secondary = NULL; (lines 4554–4555). In C, the initializer is never executed when the declaration is jumped over, so both pointers have indeterminate values at end_free:. Because free_strs is true at that point, if (LIKELY(free_strs && primary != NULL)) nmSysFree(primary) may call nmSysFree with a garbage stack pointer, causing undefined behaviour (likely a crash).

Fix: move both declarations to the top of the function, before any goto:

int ret = -1;
bool free_strs = true;
char* primary = NULL;
char* secondary = NULL;

/** Store the results. **/
store_data:;
const size_t length = strlen(primary) + 1lu + strlen(secondary) + 1lu;
if (check(exp_fn_i_alloc_result_string(tree, length)) != 0) goto end_free;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent success return when exp_fn_i_alloc_result_string fails after metaphone succeeds

When meta_double_metaphone succeeds it sets ret = 0. If exp_fn_i_alloc_result_string then fails, control jumps to end_free: still holding ret = 0. The guard if (UNLIKELY(ret != 0)) mssError(...) is never triggered, no error is printed, and the function returns 0 (success) without having stored anything in tree->String. The caller then treats the expression as a valid string result.

Suggested change
if (check(exp_fn_i_alloc_result_string(tree, length)) != 0) goto end_free;
if (check(exp_fn_i_alloc_result_string(tree, length)) != 0) { ret = -1; goto end_free; }

@Lightning11wins
Copy link
Copy Markdown
Contributor Author

TODO to self, for later: Add code location support to mssError() using a macro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-review Request AI review for PRs. documentation Changes, improvements, or fixes to documentation files. enhancement testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants