Skip to content

Working with Larger Files

If you have very large input files, it may make sense to break up your files into smaller ones to reduce the likely hood of memory issues.

dgraphpandas provides facilities to break up exports via the cli tool into chunks or if you are using the module directly then you can find an example below on how to use pandas to break up your file.

Command Line

In the CLI you have the chunk_size parameter to determine an upper limit for your files.

python -m dgraphpandas \
  -c samples/netflix/dgraphpandas.json \
  -ck title -f samples/netflix/input/netflix_titles.csv \
  -o samples/netflix/output \
  --chunk_size 1000

When you pass this, only chunk_size lines will be pushed through the RDF generation logic at a time and the output will be indexed per chunk. For example:

❯ ls -la samples/netflix/output/
total 12M
drwxr-xr-x 2 kiran kiran 4.0K Apr  4 18:13 .
drwxr-xr-x 6 kiran kiran 4.0K Apr  4 16:45 ..
-rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges.gz
-rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_2.gz
-rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_3.gz
-rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_4.gz
-rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_5.gz
-rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_6.gz
-rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_7.gz
-rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_8.gz
-rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic.gz
-rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_2.gz
-rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_3.gz
-rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_4.gz
-rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_5.gz
-rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_6.gz
-rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_7.gz
-rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_8.gz

You can then take these exports and live load them as normal.

Module

The chunk_size method is also available on to_rdf. If you provide an output_dir & export_rdf this will automatically be written out to an export file on disk.

For Example:

import dgraphpandas as dpd

dpd.to_rdf('your_input.csv', config, 'your_input_key', output_dir='.', export_rdf=True, chunk_size=1000)

If you wanted more control, then you could also call the underlying methods to leverage the fact that the transform methods can take a DataFrame directly and you can pre-chunk before you enter.

from dgraphpandas.strategies.horizontal import horizontal_transform
from dgraphpandas.writers.upserts import generate_upserts

# Each Chunk won't be loaded into memory until it hits that particular loop.
for index, frame in enumerate(pd.read_csv('your_input.csv', chunksize=1000)):

  # Generate for this Chunk
  intrinsic, edges = horizontal_transform(frame, dgraphpandas_config, 'your_input_key')

  # Generate Rdf Upserts for this Chunk
  intrinsic_upserts, edges_upserts = generate_upserts(intrinsic, edges)

  # Then you can do whatever you want with these before the next iteration