This section provides a complete reference for all UI components, pages, tabs, buttons, dropdowns, and cards in AceML Studio.
- Sidebar Navigation: Main navigation panel to access all sections of the ML pipeline
- Top Bar: Shows current section and provides global actions
- Chat Assistant FAB: Quick access to AI chat assistant
- Dashboard: Overview of your ML project with key statistics and workflow visualization
- Upload Data: Import your dataset and configure target column
- Data Quality: Analyze dataset for issues and quality metrics
- Data Cleaning: Fix data quality issues identified in previous step
- Feature Engineering: Create new, more useful features from existing data
- Transformations: Convert data into ML-friendly formats
- Reduce Dimensions: Reduce number of features while preserving information
- Train Models: Train machine learning models on prepared data
- Evaluation: Assess model performance with detailed metrics
- Visualizations: Create charts to understand data and model performance
- Hyperparameter Tuning: Optimize model settings for better performance
- Experiment Tracking: Track and compare different ML experiments
- AI Insights: Get AI-powered analysis and recommendations
- Interactive AI chat for real-time help throughout ML workflow
- Context toggles: Logs, Data Summary, Tuning, Evaluation
- Remembers conversation history and provides personalized recommendations
- Save Experiment Modal: Save your current experiment for tracking
- Help Panels: Expandable panels with "What does this do?" buttons
- Control Descriptions: Tooltip-like guidance for specific controls
- Stat Cards: Large number with icon and label
- Pipeline Steps: Clickable boxes with icons
- Action Buttons: Color-coded for action type
- Upload Data β 2. Data Quality β 3. Data Cleaning β 4. Feature Engineering β 5. Transformations β 6. Reduce Dimensions β 7. Train Models β 8. Evaluation β 9. Visualizations β 10. Tuning β 11. Experiments β 12. AI Insights
- First Time Users: Start with Dashboard, follow Quick Actions, use help buttons, try AI Chat
- Data Scientists: Use keyboard navigation, bulk operations, save experiments, use AI for suggestions
- Business Users: Focus on help panels, use AI Chat, review visualizations, trust default settings
- Desktop: Sidebar always visible, two-column layouts, large stat cards, interactive charts
- Tablet/Mobile: Collapsible sidebar, stacked layouts
This section provides a quick reference for file upload settings and flow in AceML Studio.
MAX_FILE_UPLOAD_SIZE_MB=256
CHUNK_SIZE_MB=5
LARGE_FILE_THRESHOLD_MB=50
USE_DB_FOR_LARGE_FILES=True
DB_FALLBACK_THRESHOLD_MB=500
Database Used: SQLite
- Location:
uploads/large_files.db - When Used: If a dataset's in-memory size is β₯ 500 MB (
DB_FALLBACK_THRESHOLD_MB), it is stored in SQLite instead of RAM. - How:
- The backend creates a table per session in the SQLite database and stores the full DataFrame.
- Metadata (session, filename, shape, dtypes, timestamp) is tracked in a metadata table.
- Only a small sample (up to 1000 rows) is kept in memory for preview.
- Retrieval is automatic: if the in-memory DataFrame is missing (e.g., after a server reload), it is reloaded from the database or, for smaller files, from disk.
- Relevant Code: See
ml_engine/db_storage.pyandapp.py(_df()and upload logic).
For datasets < 500 MB:
- Data is kept in memory, but the file path is stored for recovery after server reloads.
User selects file
β
File size < LARGE_FILE_THRESHOLD_MB?
β YES β NO
Regular Upload Chunked Upload
β β
Load to DataFrame Split β Upload β Reassemble
β β
Memory size > DB_FALLBACK_THRESHOLD_MB?
β YES β NO
Store in Database Store in Memory (with file path for reload)
β β
βββββββββββββββββββββββββββ
β
Show Preview
| File Size | Method | Storage | Upload Time* |
|---|---|---|---|
| 10 MB | Regular | Memory | < 1 sec |
| 75 MB | Chunked | Memory | 3-5 sec |
| 150 MB | Chunked | Memory | 5-10 sec |
| 600 MB | Chunked | SQLite DB | 15-30 sec |
Times approximate, depend on network and disk speed
- Small Dataset (< 50 MB): Single POST, in-memory, instant preview
- Large Dataset (50-500 MB): Chunked, in-memory, file path stored for reload, few seconds
- Very Large Dataset (β₯ 500 MB in memory): Chunked, stored in SQLite, preview sample only
- Larger files:
MAX_FILE_UPLOAD_SIZE_MB=512 - Smaller chunks:
CHUNK_SIZE_MB=2 - Trigger chunking earlier:
LARGE_FILE_THRESHOLD_MB=25 - Use database sooner:
DB_FALLBACK_THRESHOLD_MB=200
| Error | Cause | Solution |
|---|---|---|
| "File size (XXX MB) exceeds maximum allowed (256 MB)" | File too large | Increase MAX_FILE_UPLOAD_SIZE_MB |
| "Upload incomplete β N chunks missing" | Network issue | Retry upload |
| "Invalid or expired uploadId" | Session timeout | Restart upload |
| "Failed to reassemble file" | Disk space issue | Free up disk space |
| "No file part in request" | Invalid request | Check file input |
- Frontend Console: Logs upload config
- Backend Logs: Tracks chunked upload, reassembly, completion
- Check database size, cleanup old data, manual reset (see
ml_engine/db_storage.py) - Each session's data is stored in a separate table in SQLite for isolation and efficient retrieval.
- Small file upload (< 50 MB) works
- Large file upload (50-256 MB) uses chunked upload
- Progress messages display correctly
- Database storage triggers for files > 500 MB in memory
- Error handling works (try invalid inputs)
- Configuration endpoint returns correct values
- UI banner shows for large datasets
python test_enhanced_upload.py
python test_chunked_upload.py
python -c "from config import Config; print(f'Max: {Config.MAX_FILE_UPLOAD_SIZE_MB} MB')"
- config.properties # Configuration settings
- config.py # Config loader
- app.py # Backend upload logic
- ml_engine/db_storage.py # Database storage (NEW)
- static/js/app.js # Frontend upload handling
- test_enhanced_upload.py # Test suite (NEW)
- Slow uploads? β Increase CHUNK_SIZE_MB
- Running out of memory? β Lower DB_FALLBACK_THRESHOLD_MB
- Network unreliable? β Decrease CHUNK_SIZE_MB
- Need faster loading? β Use Parquet format instead of CSV
- Database growing large? β Run periodic cleanup
This section summarizes the implementation of enhanced file upload features, chunked uploads, database storage, and exception handling.
- Configurable File Upload Limits: All upload parameters are configurable in
config.propertiesand loaded byconfig.py. - Database Storage for Large Files: Automatic SQLite storage for datasets exceeding RAM threshold, with chunk-based retrieval and metadata tracking.
- Enhanced Chunked Upload System: Robust backend and frontend chunked upload, progress tracking, error handling, and database integration.
- Comprehensive Exception Handling: All upload and data processing endpoints have detailed error handling and user-friendly messages.
- UI Progress Indicators: Progress bar, status messages, and large dataset banners in the UI.
- Update
config.propertiesfor your system (see above for recommended settings) - For low-memory, high-memory, and production environments, adjust limits accordingly
- Run
python test_enhanced_upload.pyfor comprehensive test coverage - Tests include configuration, endpoints, regular and chunked uploads, and error handling
- Persistent Storage:
ml_engine/db_storage.py(SQLite logic),app.py(session and reload logic) - Config:
config.properties,config.py - Frontend:
static/js/app.js - Tests:
test_enhanced_upload.py
GET /api/config/upload- Returns upload configurationPOST /api/upload- Regular uploadPOST /api/upload/chunked/init- Chunked upload initializationPOST /api/upload/chunked/chunk- Chunk receptionPOST /api/upload/chunked/complete- Upload finalizationPOST /api/upload/chunked/cancel- Upload cancellation
- Small files (< 50 MB): Single upload, < 1 second
- Medium files (50-500 MB): Chunked upload, 2-10 seconds, in-memory with file path for reload
- Large files (β₯ 500 MB): Chunked upload, automatic SQLite DB storage
- Memory Usage: In-memory for small/medium, SQLite for large
- Database Storage:
uploads/large_files.db, auto-cleanup of old data
- File type validation, file size limits, unique filenames, session isolation, automatic cleanup, error message sanitization
- Basic Upload (Small File):
- Frontend detects file size and chooses upload method
- Programmatic Chunked Upload:
- See Python example in summary above
- "File too large" error: Increase
MAX_FILE_UPLOAD_SIZE_MB - Out of memory: Enable
USE_DB_FOR_LARGE_FILES, lowerDB_FALLBACK_THRESHOLD_MB - Chunked upload fails: Check disk space, logs, network, use cancel endpoint and retry
- Database file grows: Run cleanup, reduce DB threshold, or delete DB file
- Resume capability, parallel chunk upload, cloud storage, compression, progress webhooks, multiple file upload, file preview, validation rules
AceML Studio's upload and storage system provides:
- Configurable upload and chunking limits
- Automatic persistent storage for large datasets using SQLite
- In-memory storage for small/medium datasets, with file path fallback for server reloads
- Robust error handling and user feedback
- User-friendly progress and preview features
- All features tested and verified
- Save/Load: β Working perfectly
- Schema Changes: β Successfully adds/removes columns dynamically
- Multiple Datasets: β Stores unlimited datasets
- Search/Filter: β Full-text search by name, description, tags
- Metadata: β Tracks rows, columns, dtypes, size, timestamps
- Zero Configuration: No separate server required
- File-Based: Single
.dbfile, easy backup/transfer - ACID Compliant: Safe transactions, data integrity
- Python Integration: Built-in
sqlite3module - Performance: Fast for datasets up to several GB
- Handles Schema Changes: pandas
to_sql()withif_exists='replace'works perfectly - Minimal Dependencies: No external database installation needed
# Each dataset gets its own table
df.to_sql(table_name, conn, if_exists='replace', index=False)
# Metadata stored in saved_datasets table
# Schema changes handled by dropping/recreating tablesPros:
- Schema-less: No predefined schema required
- Flexible documents: Each record can have different fields
- Powerful query language
- Horizontal scaling for very large datasets
- GridFS for large binary storage
Cons:
- Requires MongoDB server installation
- Additional dependency:
pymongo - Overkill for tabular data (DataFrames are inherently tabular)
- More complex setup and maintenance
- Larger memory footprint
- Not ideal for analytical queries on columns
Use Case Fit: β Poor - DataFrames are tabular, not document-oriented
Pros:
- Extremely fast (in-memory)
- Simple key-value operations
- Supports complex data types (hashes, lists, sets)
- Pub/Sub for real-time features
Cons:
- Primarily in-memory (limited by RAM)
- Requires Redis server
- Not designed for large datasets
- Data persistence is secondary feature
- Poor for complex queries
- Not columnar - inefficient for DataFrame operations
Use Case Fit: β Poor - Designed for caching, not dataset persistence
Pros:
- EXCELLENT for DataFrames: Native pandas support
- Columnar storage: Very efficient for analytics
- Compression: Smaller file sizes (often 10-50% of CSV)
- Schema evolution: Can add/remove columns
- Fast column reads: Only read columns you need
- Preserves dtypes perfectly
- No server required
Cons:
- No built-in metadata management (need separate index)
- No transaction support
- Manual file management required
- Not a "database" - just file format
Implementation:
# Save
df.to_parquet('dataset_name.parquet', compression='snappy')
# Load
df = pd.read_parquet('dataset_name.parquet')
# Read specific columns only
df = pd.read_parquet('dataset_name.parquet', columns=['A', 'B'])Use Case Fit: β Excellent - Purpose-built for DataFrames
Pros:
- Designed for scientific/ML datasets
- Fast I/O for large arrays
- Hierarchical structure: Store multiple datasets
- Compression support
- Native pandas support:
HDFStore
Cons:
- Binary format (not human-readable)
- File locking issues on Windows
- Can become corrupted if not closed properly
- More complex than Parquet
Implementation:
# Save
df.to_hdf('datasets.h5', key='dataset_name', mode='w')
# Load
df = pd.read_hdf('datasets.h5', key='dataset_name')Use Case Fit: β Good - But Parquet is simpler and more modern
Pros:
- Optimized for analytics on DataFrames
- Columnar storage engine
- Fast queries on large datasets
- SQL interface
- Zero-config like SQLite
- Excellent pandas integration
- Can query Parquet files directly
Cons:
- Relatively new (less mature than SQLite)
- Additional dependency:
duckdb
Implementation:
import duckdb
# Save
conn = duckdb.connect('datasets.duckdb')
conn.execute("CREATE TABLE dataset_name AS SELECT * FROM df")
# Load
df = conn.execute("SELECT * FROM dataset_name").df()
# Query
df = conn.execute("SELECT A, B FROM dataset_name WHERE A > 10").df()Use Case Fit: β Excellent - Modern analytical database for DataFrames
| Database | Schema Flexibility | Performance | Setup Complexity | Use Case Fit |
|---|---|---|---|---|
| SQLite (Current) | β Good | β Good | β Excellent | β Good |
| MongoDB | β Excellent | β Poor | β Poor | |
| Redis | β Excellent | β Excellent | β Poor | β Poor |
| Parquet Files | β Excellent | β Excellent | β Excellent | β Excellent |
| HDF5 | β Good | β Good | β Good | |
| DuckDB | β Excellent | β Excellent | β Excellent | β Excellent |
Best for: Simple deployments, small to medium datasets (<1GB)
Why it works:
- Already implemented and tested
- Handles schema changes perfectly (test confirmed)
- No additional dependencies
- Easy to understand and maintain
- Sufficient for most ML workflows
Best for: Large datasets, long-term storage, cloud deployments
Implementation:
# Save dataset as Parquet
df.to_parquet(f'datasets/{dataset_name}.parquet', compression='snappy')
# Save metadata in SQLite
metadata = {
'dataset_name': dataset_name,
'file_path': f'datasets/{dataset_name}.parquet',
'rows': len(df),
'columns': len(df.columns),
'size_mb': file_size / (1024**2),
# ... other metadata
}
# Store in SQLite saved_datasets tableAdvantages:
- Parquet: Fast, efficient, schema-flexible, industry standard
- SQLite: Manages metadata, search, indexing
- Best of both worlds
- Easy migration from current implementation
Best for: Complex queries, data aggregation, multi-table joins
When to use:
- Need to run SQL queries on datasets
- Joining multiple datasets
- Complex filtering/aggregation before loading
- Working with datasets too large for memory
- Install:
pip install pyarrow - Add method:
save_dataset_parquet() - Metadata still in SQLite
- Backward compatible with current implementation
- Update
save_dataset()to use Parquet - Keep SQLite for metadata only
- Migration script for existing datasets
- For users needing SQL queries
- DuckDB can read Parquet directly
- No data duplication
Current SQLite implementation is solid β
The tests show it handles schema changes perfectly. The concern about "rigid schemas" doesn't apply because:
- We use
if_exists='replace'strategy - Each dataset gets its own table
- Pandas handles all type conversions
For improved performance on large datasets:
Consider Parquet + SQLite hybrid approach:
- Parquet for data storage (10x faster, smaller files)
- SQLite for metadata/search
- Minimal code changes required
- Industry best practice for data science workflows
No need for MongoDB/Redis - they're not designed for tabular data and add unnecessary complexity.