Memory consumed for a larger dataset is not freed post vegalite_to_png call

**Issue**
During the process of converting a vegalite spec to png image from `vegalite_to_png` a memory leak was observed. 

**Issue description**
The overall memory inside the container increased and then settled around that i.e. it did not reduced. Have tried explicit gc collect for the variables used but not much helpful. 
As this behavior is specific to particular kind of data, so including a script to randomly generate similar data. 
The steps to reproduce mentions the two steps required. Will see if I can link the video recording showing overall memory foot print increase.
Guess is that GIL is not freeing up the memory held during the execution. (or some leakage there)

**Steps to reproduce:**
1. Generate this random data set by following below python instructions:
```
import random
import pandas as pd

def generate_random_ref_id():
    return random.randint(10000000000, 99999999999)

def generate_random_text():
    letters = 'abcdefghijklmnopqrstuvwxyz'
    return ''.join(random.choice(letters) for _ in range(10))

def generate_random_number():
    return random.randint(1, 100)

def generate_row():
    param1 = f"(refId: {generate_random_ref_id()}, refIdR: '{generate_random_text()}:{generate_random_text()}:{generate_random_ref_id()}', paramA: None, Param2: None, time: {generate_random_ref_id()}, address: '{generate_random_text()}.{generate_random_text()}.{generate_random_text()}', param3: '{generate_random_text()}', context: '{generate_random_text()}', locP: b'\\x{generate_random_text()}\\x{generate_random_text()}', id: {generate_random_ref_id()}, paramC: None, paramD: None, paramE: None, paramF: '{generate_random_text()}-{generate_random_text()}', paramF: None, paramG: (ser: {generate_random_ref_id()}, serB: '{generate_random_text()}:{generate_random_text()}', locA: '{generate_random_text()}', locB: '{generate_random_text()}', cvasd: {generate_random_number()}, locD: '{generate_random_text()}', locEParam: '{generate_random_text()}.{generate_random_text()}.{generate_random_text()}-{generate_random_ref_id()}'), paramH: (context: '{generate_random_text()}'), ohterApp: (testLoc: '{generate_random_text()}:(cn{generate_random_text()}-{generate_random_ref_id()})', testV: '{generate_random_ref_id()}.{generate_random_ref_id()}.{generate_random_ref_id()}'), paramJ: None, paramK: '{generate_random_text()}:{generate_random_text()}-{generate_random_text()}-{generate_random_text()}', paramL: None, paramM: None, paramNnumbergoinghere: None, otherTimeDuration: {generate_random_ref_id()}, paramOsdfadfa: None, paramPsdfasdf: None, paramQsdasdfa: None)"

    param8 = f"(XId: '{generate_random_text()}-{generate_random_text()}-{generate_random_text()}-{generate_random_text()}-{generate_random_text()}', parama: None, paramB: None, paramC: '{generate_random_text()}', paramD: 'https://www.google.com/search?q=react+docs&otherparam={generate_random_text()}', paramE: '{generate_random_text()}', paramF: None, paramF: '{generate_random_text()}', paramG: 'https://www.google.com/search?q=react+docs&otherparam={generate_random_text()}', paramH: '{generate_random_text()} {generate_random_text()}. {generate_random_text()} ({generate_random_text()}, {generate_random_text()}) : {generate_random_text()}', paramH: None, paramI: None, paramJ: None, paramK: None, paramL: None, paramM: None)"

    param5 = random.choice(['some', 'thing'])

    param6 = random.choice(['other', 'sense'])
    param7 = random.choice(['cute', 'not_cute'])

    paramXsdfa = str(generate_random_ref_id())

    param9 = random.randint(-1, 10)  # -1 to 10
    param10 = "2024-10-10-08"

    row = [param1, param8, param5, param6, param7, {'paramXsdfa': paramXsdfa}, param9, param10]

    return row

# Generate 50,000 rows
data = []
for _ in range(50000):
    data.append(generate_row())

# Create a DataFrame
df = pd.DataFrame(data, columns=['param1', 'param8', 'param5', 'param6', 'param7', 'paramXsdfa', 'param9', 'param10'])

# Export to CSV
df.to_csv('git/generated_data_50k.csv', index=False)

```

2.  Run the below code to execute and get image from `vegalite_to_png`

```
import pandas as pd
from vl_convert import vegalite_to_png
import json
spec = '''
{
    "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
    "description": "A simple scatterplot chart with embedded data.",
    "width": "container",
    "height": "container",
    "selection": {
        "grid": {
            "type": "interval",
            "bind": "scales"
        }
    },
    "mark": "point",
    "encoding": {
        "x": {
            "field": "param1",
            "type": "nominal",
            "axis": {
                "labelAngle": 0
            },
            "sort": null
        },
        "y": {
            "field": "param5",
            "type": "nominal",
            "sort": null
        }
    },
    "data": {
        "values": [
            
        ]
    }
}
'''

vl_spec = json.loads(spec)
# as spec width, height are not provided by default - container, required for image.
vl_spec['width'] = 800
vl_spec['height'] = 600

data = pd.read_csv('git/generated_data.csv')

json_result = json.loads(data.to_json(orient='table'))
vl_spec['data'] = {'values': json_result['data']}


png_data = vegalite_to_png(vl_spec)
print(png_data)


```


**Temporary hack to get rid of this.**
1. Create a file(name: 'my_vl_process.py') which runs vegalite_to_png in a separate process
```
import sys
from vl_convert import vegalite_to_png
import json

def main(vl_spec_file_path):
    with open(vl_spec_file_path, 'r') as file:
        vl_spec = json.load(file)
    
    # You can use the vl_spec dictionary as needed
    png_data = vegalite_to_png(vl_spec)
    print(png_data)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        sys.exit("Usage: python my_py.py <vl_spec_file_path>")
    
    vl_spec_file_path = sys.argv[1]
    main(vl_spec_file_path)
```
2. Run the above process using python subprocess
```
import subprocess
import json
import tempfile

with tempfile.NamedTemporaryFile(mode='w', delete=False) as temp_file:
    json.dump(vl_spec, temp_file)

# Get the file path of the temporary file
vl_spec_file_path = temp_file.name

# Define the command you want to run as a list
command = ["python", "git/my_vl_process.py", vl_spec_file_path]  # Replace "your_vl_spec_here" with your actual vl_spec

# Use subprocess.run() to run the command and capture its output
result = subprocess.run(command, stdout=subprocess.PIPE, text=True)

png_data = result.stdout
print(png_data)

```

This code overcome the memory problem using a separate process wherein the overall memory cleanup is done by gc (for a different process) while in the default case, the memory is leaked to the external environment. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory consumed for a larger dataset is not freed post vegalite_to_png call #143

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory consumed for a larger dataset is not freed post vegalite_to_png call #143

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions