Conversation
aavshr
left a comment
There was a problem hiding this comment.
Thanks for the PR @satp42
The pull request is doing several things and it's hard to gauge without any measurements whether we're solving the actual problem.
On config
userConfig is meant to be configured by the user, but exposing batch size, threads, connections config to the average user is not useful especially without the user knowing the internals of how the embeddings work.
Lazy embeddings
The lazy embeddings are mainly for write heavy resources (right now only the note). It could also make sense for other resources but it will lead to a case where when someone has a notebook in the context, it will take a long time before things are embedded when asking a question.
Embeddings batch and chunk size
We could perhaps be even more aggressive on the batch size depending on the users machine. This should be dynamic with a sane default (32 seems safe).
On the chunk size, it could even perhaps be lower as the embedding models scale with the token length so longer chunks will need more compute, and smaller chunks will actually be better for parallelization.
Rayon & connections limit
We should not use rayon's global builder, the embedding models use fastembed-rs which uses rayon underneath as well. The unlimited thread spawning does seem like the culprit for the CPU saturation but it could be solved by using a simple threadpool or we should use an async runtime (probably better for long term).
Also the default values with 4 max threads being contested by 8 max connections is not ideal.
The code also doesn't compile (culprit: https://github.com/deta/surf/pull/43/files#diff-1e2c482dbe66cf699a1c8731d573227090fb956ca6259f6797e27b551d410d24R156) .
We need to actually do some measurements on what the root cause actually is for the CPU saturation before making all these changes.
Can you please split the PR into just the embeddings batch size related change?
For the other changes, we should dicuss on this issue to get to the root cause and then discuss the right approach.
Added new settings to control embedding performance in
packages/types/src/config.types.ts. Specifically:embedding_batch_size(number, default: 64)embedding_max_threads(number, default: 4)embedding_max_connections(number, default: 8)Modified
packages/backend-server/src/main.rsto accept additional command-line arguments for batch size, max threads, and max connections.Updated
packages/backend-server/src/server/mod.rs:max_connectionsfield toLocalAIServerrayon::ThreadPoolBuilderbefore starting the serverModified
packages/backend-server/src/embeddings/model.rs:batch_sizefield toEmbeddingModelstructSome(1)at line 71 with configurableself.batch_sizePassed Configuration from Electron Main Process
Implemented Lazy Embeddings for Large Document Types
ResourceTextContentType::PDF,ResourceTextContentType::Document, andResourceTextContentType::ArticlegenerateLazyEmbeddingstag instead of immediate embedding generationOptimized Chunking Strategy
max_chunk_sizefrom 2000 to 2500 characters (reduces total chunks by ~20% while maintaining quality)overlap_sentencesat 1 for continuityThe expected impact of this PR:
Related to #28