Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
1.1.2
-----


* Add support for schema overrides on manual import. (#946)
* Fix openpyxl dependency bug. (#941)
* Fix db column-size bug for large uploads. (#945)

1.1.1
-----
Expand Down
71 changes: 66 additions & 5 deletions docs/manual_imports.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,21 @@ Manually importing large datasets
When you need to manual import
==============================

The PANDA web interface may fail when you attempt to upload very large datasets. The exact size at which the uploads will fail depends on the specifics of your server (RAM size, in particular), but anything larger than 100MB may be a problem.
The PANDA web interface may fail when you attempt to upload very large datasets. The exact size at which the uploads will fail depends on the specifics of your server (RAM size, in particular), but anything larger than 100MB may be a problem. PANDA may also experience issues when re-indexing very large datasets for the purpose of enabling field-level search.

If you experience problems uploading large files, this document describes an alternative way of uploading them that bypasses the web interface. This method is much less convenient, but should be accessible for intermediate to advanced PANDA operators.
If you experience either of these problems, this document describes an alternative way of uploading data that bypasses the web interface. This method is much less convenient, but should be accessible for intermediate to advanced PANDA operators.

Uploading a file to your server
-------------------------------
===============================

Manually importing files is a two-step process. First you must upload them to your server, then you can execute the import process.

Uploading files your server requires using a command-line program called ``scp``. This program allows you to send a file to your server over :doc:`SSH <ssh>`. It may help to quickly review the :doc:`SSH <ssh>` documentation now. If you are on Mac/Linux, `scp` comes preinstalled. On Windows it comes as part of `Putty <http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/putty.html>`_. In either case, the command to upload your file will look like:
Uploading files to your server requires using a command-line program called ``scp``. This program allows you to send a file to your server over :doc:`SSH <ssh>`. It may help to quickly review the :doc:`SSH <ssh>` documentation now. If you are on Mac/Linux, `scp` comes preinstalled. On Windows it comes as part of `Putty <http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/putty.html>`_. In either case, the command to upload your file will look like:

``scp -i /path/to/my/ec2_key.pem /path/to/my/dataset.csv ubuntu@my_server_domain_name.com:/tmp/``

Executing the manual import
--------------------------
===========================

Once your file has finished copying to your PANDA server, you will need to SSH in to execute the manual import process. Refer to the :doc:`SSH <ssh>` documentation for instructions on how to SSH in. Once you're at the command line on your server, execute the following commands to import your file:

Expand All @@ -37,3 +37,64 @@ Once your file has finished copying to your PANDA server, you will need to SSH i
In the example ``dataset.csv`` is the name of the file you uploaded (not including the path) and ``user@email.com`` is the login of the user you want the to "own" the dataset.

Once this script returns your file will be importing via the normal process and you can review it's progress via the web interface. The dataset name and description will be set to the system defaults and should be updated in the web interface. From this point forward the dataset should be indistinguishable from one uploaded via the normal process.


Enabling field search during bulk load
=======================================

PANDA may have trouble re-indexing "large" datasets, typically of millions of rows or more. Re-indexing is performed when you add field-level search to a dataset after initial import.
If you have trouble re-indexing a large dataset, you can supply the bulk import command with a schema override file that enables field-level search during initial import.

.. code-block:: bash

sudo mv /tmp/dataset.csv /var/lib/panda/uploads/
sudo chown panda:panda /var/lib/panda/uploads/dataset.csv
cd /opt/panda
sudo -u panda -E python manage.py manual_import dataset.csv user@email.com -o /path/to/schema_overrides.csv


Schema override file format
----------------------------

The schema override file provides the ability to enable field-level search and customize the data types for any combination of fields. The override file should be a simple comma-separated CSV with two columns:

- **field name** (required) must precisely match corresponding field name in source data file (note, match is case sensitive!)
- **data type** (optional) is a valid PANDA data type. Otherwise uses PANDA's defaults:

- unicode
- int
- float
- bool
- datetime
- date
- time

When defining a schema override file, it's a good idea to test a smaller sample of data to ensure you have the correct column names and data types.
PANDA will often guess the right data type for a column based on a sampling of data. However, this may not always work as expected,
such as a salary field prefixed with a dollar sign (PANDA will treat this as a string rather than interpreting it as a float).

Experimenting with a subset of data will help identify such issues and suggest potential pre-processing steps that might be necessary prior
to final import (e.g. stripping a leading dollar sign from a currency field).

Once you've ironed out such kinks on the smaller data slice, you can apply the schema overrides to the full data set.

Below is a sample data set and schema override file.

.. code-block:: bash

# my_sample_data.csv
name,birthdate,salary,zip
John,1990-01-01,55000,20007
Jane,1989-01-01,65000,20007

The related schema override file (below) would add indexes on *birthdate*, *salary* and *zip*.

.. code-block:: bash

# schema_overrides.csv
birthdate,
salary,
zip,unicode

In this example, PANDA correctly assigns data types for *birthdate* and *salary*, so we can leave the data type column blank for those fields.
However, we explicitly specify *unicode* for zip code to ensure it is treated as a string rather than an integer.
35 changes: 32 additions & 3 deletions panda/management/commands/manual_import.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,37 @@
#!/usr/bin/env python

import csv
import os

from django.conf import settings
from django.core.management.base import BaseCommand
from django.utils.translation import ugettext as _
from livesettings import config_value

from optparse import make_option
from panda.models import Dataset, DataUpload, UserProxy
from panda.utils.typecoercion import TYPE_NAMES_MAPPING

class Command(BaseCommand):
args = '<dataset_filename user_email>'
help = _('Manually import data for when the web UI fails. See http://panda.readthedocs.org/en/latest/manual_imports.html')

option_list = BaseCommand.option_list + (
make_option('-o', '--schema_overrides',
action='store',
dest='overrides',
help=_('Full path to CSV containing schema overrides. Field types: %s' % ', '.join(sorted(TYPE_NAMES_MAPPING.keys())))
),
)

def handle(self, *args, **options):
if len(args) < 2:
self.stderr.write(_('You must specify a filename and user.\n'))
return

filename = args[0]
email = args[1]
overrides = self._schema_overrides(options)

path = os.path.join(settings.MEDIA_ROOT, filename)

Expand All @@ -42,16 +54,33 @@ def handle(self, *args, **options):
creator=creator,
dataset=None,
encoding='utf-8')

dataset = Dataset.objects.create(
name=filename,
creator=creator,
initial_upload=upload)

self.stdout.write('%s http://%s/#dataset/%s\n' % (_('Dataset created:'), config_value('DOMAIN', 'SITE_DOMAIN'), dataset.slug))

dataset.import_data(creator, upload)
dataset.import_data(creator, upload, schema_overrides=overrides)

dataset.update_full_text()

self.stdout.write(_('Import started. Check dataset page for progress.\n'))

def _schema_overrides(self, opts):
try:
fields_file = opts['overrides']
except KeyError:
return {}
#TODO: error-handling if file doesn't exist or is malformed
valid_types = set(TYPE_NAMES_MAPPING.keys())
with open(fields_file) as csvfile:
data = {}
for field, dtype in csv.reader(csvfile):
# Activate indexing
data[field] = { 'indexed': True }
# Update data type if provided and valid
if dtype in valid_types:
data[field]['type'] = dtype
return data
Loading