Skip to content

Conversation

@milenkovicm
Copy link
Contributor

@milenkovicm milenkovicm commented May 21, 2025

Which issue does this PR close?

Closes #.

Rationale for this change

previously job id is generated randomly, without any ordering guarantees, which make is rather hard to determine ordering of jobs, and returning totally random results on rest api. Spark is generating job id incrementing atomic int.

What changes are included in this PR?

  • use atomic integer to generate job id, i do not see valid reason not to do it
  • move job id generation from task manager to scheduler (if we later want to make it configurable it would be easier to do)

Are there any user-facing changes?

  • no, they might get confused with new job ids

Open points

  • job_id is used as directory name where job (shuffle) data is persisted, if we use incremental id, we may get into problems with leftover directory names in case of scheduler restart

@milenkovicm milenkovicm marked this pull request as draft May 21, 2025 19:37
@Dandandan
Copy link
Contributor

How would this work in a multi-scheduler setup? Maybe add the scheduler id as a first item?

@milenkovicm
Copy link
Contributor Author

hey @Dandandan it does not, the change is focused more on aligning with spark, which to my knowledge does not have multi scheduler setup.

I'm consider using ulid-s which may be better fit, they are random, and sortable. Issues with ulid-s they are quite big strings. Would work with multi scheduler.

wdyt ?

@Dandandan
Copy link
Contributor

hey @Dandandan it does not, the change is focused more on aligning with spark, which to my knowledge does not have multi scheduler setup.

I'm consider using ulid-s which may be better fit, they are random, and sortable. Issues with ulid-s they are quite big strings. Would work with multi scheduler.

wdyt ?

In the API, it's possible to sort by time (e.g. job start time) as well based on the job information.

Why would it need to be in the job id?

@milenkovicm
Copy link
Contributor Author

when going through logs, for example, it makes it easier to reason about.

@milenkovicm
Copy link
Contributor Author

there is issue with having job id tied to physical directory, which may make mess when scheduler is restarted without restarting executors, making possibility to overlap job data directories (shuffle readers may read wrong spill in current implementation) thus just using incremental id in current implementation may make problems.

I was looking at the ULID they generate random but sortable IDs, downside is that they are too big (IMHO), not easiest to reason about ordering either.

snowflake like ids, are sortable and unique; for them machine id should be exposed as a configuration parameter.

Current approach is not sortable, now there is slight chance to get name collision (with very low probability).

@Dandandan
Copy link
Contributor

Dandandan commented Jun 1, 2025

Hm yeah. Not anything against it, but just my thoughts :). I think ULID would be preferable over an atomic / incremental id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants