| title | The Sorter Project |
|---|---|
| emoji | 📊 |
| colorFrom | purple |
| colorTo | yellow |
| sdk | docker |
| sdk_version | 4.66 |
| python_version | 3.13 |
| app_file | server/app.py |
| pinned | false |
| app_port | 8000 |
We came up with this idea, keeping in mind its application in factories, warehouses and storage facilities. (and even your coffee table!)
sorter is an OpenEnv environment for warehouse slotting and reslotting, in both macro and micro slotting aspects. It models three tasks that human warehouse operators and inventory planners actually perform, with an aim to possibly automate the process:
- identify incoming items from known inventory metadata
- reposition one item to a better legal slot
- reorganize an entire layout to improve storage quality
The environment is designed for agent evaluation rather than low level robotics. Collision checks, support constraints, bounds, and stackability are enforced by the environment, while the agent is evaluated on recognition, layout decisions, and improvement over time.
- Companies spend milllions if not billions on establishing, maintaining and organisising warehouses and storage facilities, and in a densely populated country like India with increasing demand for land and with the surging property prices efficient storage and orgsation becomes the need of the hour, leading to the demand for an environment or an agent that can help companies and organisations and provide them with ways for the maximum efficient and logical storage of their "objects".
- The environments and agents that specialise in full fledged identifying, sorting, stacking and organising of objects or warehouse material are few in number, and we are here to fill that gap.
With increase in population causing decrease of 'Open Spaces' it becomes extremely important to build societies and localities that can cater to a huge chunk of population and in such a case, The Sorter Project though being mainly built for industrial application, becomes an extremely useful tool that allows proper space utilisation to accomatodate more people whilst taking minimum space. (so in the near future we might not have to shift to mars)
This environment maps to common operations in warehouses, storage rooms, micro fulfillment centers (similiar to the ones used by Zepto, Blinkit, etc), and factory floor inventory areas:
segment: match observed object positions to known SKU metadataadjust: move a single misplaced or newly arrived item to a better slotplace: perform full reslotting when utilization changes or the area must be reorganized
The world state is a 3D grid:
current_grid: current occupancy of the warehouse volumeweighted_grid: a dense preference field that defines which placements are more valuableobjects_present: ground-truth object placements, exposed selectively by task
Importand Decisions substantiated by reasons:
- We have designed our environment to simulate a warehouse or any space for that matter as a three dimensional grid, although not an entirely accurate representation, most companies like NVIDIA, Siemens, Dassault Systèmes, Microsoft and IBM use similiar or near same systems as recorded in Warehouse Digital Twins by NVIDIA and Supply Chain 2.0 by Microsoft. However these designs are mostly used by the massive coorporations for inventory management, automations and other related aspects of warehouse management, and we take inspiration from their logic to build something novel.
- To achieve our goal we created two grid namely, the main grid called
current_gridwhich contains a few objects randomly initialised to simulate objects present in the warehouse andweighted_grid, which is a duplicate of the main grid made in order to introduce noise and indicate the preference of placement. - Weighted grid is exposed because it defines what "better" means, but not the "optimal" positions of objects in the warehouse. This preserves the optimization problem while still making the reward landscape quite interpretable.
- We intentionally expose only small, task relevant slices of state, in order to provide better context for the model to perform each action, without exposing vital "ground truths" and destroying the reinforcement learning aspect of it.
- The environment keeps
objects_presentas internal truth, but reveals it selectively depending on the task. It is revealed inadjustandplacefunctions, in order to ensure that the ability of the tasks of functions to run independently is not compromised. - We return the latest scalar reward together with textual feedback and advisory messages. This was a choice made to support both the reinforcement learning logic and help the LLM in iterative correction.
- Validity constraints such as bounds, non-overlap, stackability, and support are enforced inside the environment instead of delegated to the agent. This keeps the task focused on decision quality and finding the best "optimal" position, rather than simulating low level frivolous tasks such as obstacle prevention, etc, which does not fit our pursuit.
The reward is shaped across the whole trajectory:
- partial credit for correct segmentation
- incremental reward for better local moves in
adjust - incremental reward for improving the total layout in
place - penalties for malformed, illegal, destructive, or clearly unhelpful actions
We have developed this environment with 'ease' thanks to OpenEnv!
Our Sorter Project consists of 3 different parts or tasks:
Objective: identify every observed object by name from its visible properties and position.
Why it is easy:
- the inventory catalog is known
- object dimensions and stackability are exposed
- the task is discrete and fully auditable
What the agent sees:
- observed object descriptors
- candidate positions
- grid dimensions
- reward and advisory history
These above details are exposed considering the fact that these are the basic details recorded by warehouses and factories in the present as a part of their inventory management.
What the agent must return:
{
"segment": {
"book": [0, 0, 0, false],
"bottle": [2, 1, 0, true]
}
}Done condition:
- the task ends when every object is labeled at the exact correct position
Grader rule:
- exact match mapping from object name to position
- score normalized to be strictly between
0and1
Reward behavior:
- positive reward for each exact correct mapping
- negative reward for wrong labels, wrong positions, or malformed payloads
Objective: improve the location of one movable object using the legal candidate moves exposed by the environment.
Why it is medium:
- the agent must choose among legal moves
- the focus locks onto one object after the first valid move
- reward is based on improvement, not just validity of the position at which object is placed
What the agent sees:
- current object placements
- adjustable objects
- ranked legal targets with score deltas
- current adjustment focus and visited positions
What the agent must return:
{
"adjust": ["book", 0]
}Done condition:
- no legal improving moves remain
- no legal targets remain for the chosen object
- or the task logic marks the episode complete
Grader rule:
- validates that the selected
(object_name, option_index)is legal for the current state - grades by realized improvement versus achievable improvement
- returns a normalized score strictly between
0and1
Reward behavior:
- positive reward for better legal moves
- zero for legal but not improving moves
- negative reward for worsening or invalid moves
Objective: propose a full legal layout for all objects that improves total layout quality.
Why it is hard:
- every object must be placed
- overlap, support, bounds, and stackability must all remain valid
- reward depends on total layout quality, not one local move
- the optimal solution is hidden
What the agent sees:
- current object placements
- full object set
- weighted preference field
- reward and optimizer advisory
What the agent must return:
{
"place": {
"book": [0, 0, 0, false],
"bottle": [2, 1, 0, true],
"box": [4, 0, 0, false]
}
}Done condition:
- a valid accepted layout may end the episode when the task determines the reorganization is complete
Grader rule:
- validates completeness and physical legality of the full layout
- compares achieved layout quality against the current state
- returns a normalized score strictly between
0and1
Reward behavior:
- positive reward for improving total layout score
- negative reward for incomplete, overlapping, unsupported, or out-of-bounds layouts
NOTE: In a real life scenario, idealy all tasks would be done sequentially, in a chronological order for the agent to function independently without any external context on the objects present, allowing it to function on its own volition and giving it full freedom.
The three tasks intentionally progress from local recognition to local optimization to global optimization:
| Task | Difficulty | Core skill |
|---|---|---|
segment |
Easy | Recognition and exact mapping |
adjust |
Medium | Constrained local search |
place |
Hard | Global layout optimization |
The environment exposes a typed SorterAction Pydantic model composed of three task specific action fields.
| Field | Type | Used in | Meaning |
|---|---|---|---|
segment |
Dict[str, PositionTuple] |
segment |
Predicted object name to position mapping |
adjust |
Tuple[str, int] or empty tuple |
adjust |
Choose one legal move by object name and exposed option index |
place |
Dict[str, PositionTuple] |
place |
Proposed full layout for every object |
PositionTuple = (x, y, z, rotated)
x,y,zare integer coordinates in the gridrotatedis a boolean indicating whether the object is rotated relative to its default orientation
One action payload should target the active task and leave the other task fields empty.
The environment exposes a typed SorterObservation Pydantic model. Some fields are global, while others are revealed only for the active task.
| Field | Type | Visible in | Meaning |
|---|---|---|---|
grid_dims |
Tuple[int, int, int] |
All tasks | Dimensions of the warehouse grid |
weighted_grid |
NDArray |
All tasks | Preference field defining what counts as a better placement |
current_grid |
NDArray |
All tasks | Current occupancy grid |
reward |
float |
All tasks | Latest scalar reward |
reward_details |
Tuple[List[float], List[str]] |
All tasks | Reward event log and matching textual feedback |
advisory |
List[str] |
All tasks | Guidance, including optimizer messages |
done |
bool |
All tasks | Whether the current episode is finished |
| Field | Type | Task | Meaning |
|---|---|---|---|
positions_segment |
Dict[str, PositionTuple] |
segment |
Segment-specific internal positions tracked by the task |
positions |
List[PositionTuple] |
segment |
Candidate positions shown during segmentation |
observed_objects |
List[Dict[str, Any]] |
segment |
Visible descriptors such as dimensions, stackability, and volume |
last_segment_attempt |
Dict[str, PositionTuple] |
segment |
Most recent segmentation payload |
objects_present |
Dict[str, PositionTuple] |
adjust, place |
Currently exposed object placements |
positions_adjust |
Dict[str, PositionTuple] |
adjust |
Positions relevant to adjustment |
adjustable_objects |
List[Dict[str, Any]] |
adjust |
Objects eligible for movement and their legal targets |
adjust_focus_object |
str |
adjust |
Object currently locked for multi-step adjustment |
adjust_start_position |
PositionTuple or empty tuple |
adjust |
Original position of the focused object |
adjust_visited_positions |
List[Tuple[int, int, int]] |
adjust |
Legal coordinates already explored |
adjust_action_options |
List[List[Any]] |
adjust |
Currently valid (object_name, option_index) choices |
positions_place |
Dict[str, PositionTuple] |
place |
Placement map used during full-layout optimization |
reset()creates a fresh random layout, clears reward history, clears task specific caches, and starts a new episodestep(action)applies the task specific transition and returns a typedSorterObservationstatereturns the current typed internal state for inspection and grading
weighted_grid reveals what the environment prefers without exposing the optimizer answer. This gives agents useful reward shaping context while preserving the optimization problem.
The environment surfaces legal candidate moves instead of the full coordinate search space. This keeps the task tractable and better matches real planning systems that propose feasible actions before a policy chooses among them.
Bounds checking, overlap prevention, support requirements, and stackability are handled internally so the task evaluates planning quality rather than low-level physics bookkeeping.
Where:
Nis the number of objects in the episodescore(obj, pos)is the meanweighted_gridvalue covered by the object at that positionL(layout)is the total layout score across all objects
segmentrewards exact object identification and penalizes wrong or malformed submissionsadjustrewards local score improvement and penalizes invalid or harmful movesplacerewards global layout improvement and penalizes illegal full-layout proposals
The reward is meaningful across the trajectory, not only at the terminal step. Agents can detect partial progress, plateauing, and regressions from the reward stream and advisory feedback.
Task graders are implemented in graders.py and return normalized results strictly between 0 and 1.
grade_segment(...)checks exact segmentation correctnessgrade_adjust(...)checks legal adjustment execution and progressgrade_adjust_progress(...)assigns partial credit when the rollout ends before full completiongrade_place(...)checks legality and global layout qualitygrade_task(...)dispatches to the appropriate task grader
For a fixed state and action payload, grading is deterministic. The environment itself resets to randomized layouts, but once a specific episode state exists, the same action produces the same grade and reward transitions.
Episode boundaries are task dependent and are designed to be sensible for each workflow:
segment: ends when all objects are labeled correctlyadjust: ends when no improving legal moves remain, no legal targets remain, or the task logic completesplace: ends when a valid full layout is accepted and the task marks the episode done
Invalid actions do not silently pass:
- malformed payloads are penalized
- illegal moves are penalized
- incomplete or unsupported layouts are penalized
The baseline inference rollout is capped by MAX_STEPS = 8 in inference.py.
The FastAPI app is defined in server/app.py.
Available endpoints:
POST /resetPOST /stepGET /stateGET /schemaWS /ws
Typical interaction:
- call
POST /reset - inspect the task relevant observation fields
- call
POST /stepwith oneSorterAction - continue until
done=true
The baseline and submission flow use the following variables:
| Variable | Required | Purpose |
|---|---|---|
API_KEY or HF_TOKEN or OPENAI_API_KEY |
Yes | API key consumed by the OpenAI client |
API_BASE_URL |
Yes | Base URL for the LLM provider endpoint |
MODEL_NAME |
Yes | Model identifier used by the baseline |
Example .env:
API_KEY=your-api-key
API_BASE_URL=https://integrate.api.nvidia.com/v1
MODEL_NAME=openai/gpt-oss-120buv syncuv run --project . server --host 0.0.0.0 --port 8000python inference.pyBuild:
docker build -t sorter .Run:
docker run --rm -p 8000:8000 sorterThe baseline script is inference.py in the project root, as required by the submission rules. It emits structured stdout in the required [START], [STEP], and [END] format and uses the OpenAI client for model calls.
Baseline scores depend on the configured API_BASE_URL, MODEL_NAME, and credentials.
sorter/
├── config/ # Grid and object Configuration
├── model_types/ # Task Specific Types
├── models/ # Typed action, observation, and state models
├── server/ # FastAPI app and OpenEnv environment
├── tasks/ # Segment, adjust, and place task logic
├── utils/ # Grid and reward helpers
├── client.py # Client utilities
├── graders.py # Deterministic task graders
├── inference.py # Baseline agent
├── openenv.yaml # OpenEnv metadata
├── Dockerfile # Container build configuration
├── validate-submission.sh # Validation script
└── README.md
NOTE: config/objects.py file has actual list of objects, modify that or try using one of the predefined objects to try out the live server. Objects listed below are mere examples.
{
"segment": {
"book": [0, 0, 0, false],
"bottle": [2, 1, 0, true]
}
}{
"adjust": ["book", 0]
}{
"place": {
"book": [0, 0, 0, false],
"bottle": [2, 1, 0, true],
"box": [4, 0, 0, false]
}
}Hugging Face Repository: https://huggingface.co/spaces/Jibrann/sorter
Hugging Face Space: https://jibrann-sorter.hf.space
GitHub Repository: https://github.com/jibcamun/The-Sorter-Project