Optimized Embeddings CPU Usage by satp42 · Pull Request #43 · deta/surf

satp42 · 2025-10-24T22:08:04Z

Added new settings to control embedding performance in packages/types/src/config.types.ts. Specifically:

embedding_batch_size (number, default: 64)
embedding_max_threads (number, default: 4)
embedding_max_connections (number, default: 8)

Modified packages/backend-server/src/main.rs to accept additional command-line arguments for batch size, max threads, and max connections.

Updated packages/backend-server/src/server/mod.rs:

Added max_connections field to LocalAIServer
Implemented a semaphore or connection counter to limit concurrent client connections (currently unlimited thread spawning)
Configured rayon global thread pool using rayon::ThreadPoolBuilder before starting the server

Modified packages/backend-server/src/embeddings/model.rs:

Added batch_size field to EmbeddingModel struct
Replaced hardcoded batch size Some(1) at line 71 with configurable self.batch_size

Passed Configuration from Electron Main Process

Implemented Lazy Embeddings for Large Document Types

Extended lazy embeddings logic to include ResourceTextContentType::PDF, ResourceTextContentType::Document, and ResourceTextContentType::Article
These document types will get a generateLazyEmbeddings tag instead of immediate embedding generation
Embeddings will then be generated on-demand when documents are accessed in chat/search

Optimized Chunking Strategy

Increased max_chunk_size from 2000 to 2500 characters (reduces total chunks by ~20% while maintaining quality)
Kept overlap_sentences at 1 for continuity
This change reduced the number of embeddings needed per document

The expected impact of this PR:

Batch size increase (1 → 64): reduction in CPU overhead due to better model utilization
Thread pool limits: Prevents CPU saturation, keeps usage under control
Connection limits: Prevents thread explosion during bulk uploads
Lazy embeddings for large docs: Defers expensive operations until needed
Larger chunks (2000 → 2500): fewer embeddings to generate and store

Related to #28

aavshr

Thanks for the PR @satp42

The pull request is doing several things and it's hard to gauge without any measurements whether we're solving the actual problem.

On config

userConfig is meant to be configured by the user, but exposing batch size, threads, connections config to the average user is not useful especially without the user knowing the internals of how the embeddings work.

Lazy embeddings

The lazy embeddings are mainly for write heavy resources (right now only the note). It could also make sense for other resources but it will lead to a case where when someone has a notebook in the context, it will take a long time before things are embedded when asking a question.

Embeddings batch and chunk size

We could perhaps be even more aggressive on the batch size depending on the users machine. This should be dynamic with a sane default (32 seems safe).

On the chunk size, it could even perhaps be lower as the embedding models scale with the token length so longer chunks will need more compute, and smaller chunks will actually be better for parallelization.

Rayon & connections limit

We should not use rayon's global builder, the embedding models use fastembed-rs which uses rayon underneath as well. The unlimited thread spawning does seem like the culprit for the CPU saturation but it could be solved by using a simple threadpool or we should use an async runtime (probably better for long term).

Also the default values with 4 max threads being contested by 8 max connections is not ideal.

The code also doesn't compile (culprit: https://github.com/deta/surf/pull/43/files#diff-1e2c482dbe66cf699a1c8731d573227090fb956ca6259f6797e27b551d410d24R156) .

We need to actually do some measurements on what the root cause actually is for the CPU saturation before making all these changes.

Can you please split the PR into just the embeddings batch size related change?

For the other changes, we should dicuss on this issue to get to the root cause and then discuss the right approach.

finalized changes

479914f

satp42 marked this pull request as draft October 24, 2025 22:10

satp42 marked this pull request as ready for review October 24, 2025 22:11

satp42 marked this pull request as draft October 24, 2025 22:12

satp42 marked this pull request as ready for review October 24, 2025 22:13

aavshr self-requested a review October 31, 2025 12:33

aavshr reviewed Nov 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized Embeddings CPU Usage#43

Optimized Embeddings CPU Usage#43
satp42 wants to merge 1 commit intodeta:mainfrom
satp42:embeddings-cpu-optimizations

satp42 commented Oct 24, 2025 •

edited

Loading

Uh oh!

aavshr left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

satp42 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aavshr left a comment

Choose a reason for hiding this comment

On config

Lazy embeddings

Embeddings batch and chunk size

Rayon & connections limit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

satp42 commented Oct 24, 2025 •

edited

Loading