pandaproject · zstumgoren · Dec 24, 2014 · Dec 24, 2014 · Jan 9, 2015 · Mar 18, 2015
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,7 +1,8 @@
 1.1.2
 -----
-
-
+* Add support for schema overrides on manual import. (#946)
+* Fix openpyxl dependency bug. (#941)
+* Fix db column-size bug for large uploads. (#945)
 
 1.1.1
 -----

diff --git a/docs/manual_imports.rst b/docs/manual_imports.rst
@@ -5,21 +5,21 @@ Manually importing large datasets
 When you need to manual import
 ==============================
 
-The PANDA web interface may fail when you attempt to upload very large datasets. The exact size at which the uploads will fail depends on the specifics of your server (RAM size, in particular), but anything larger than 100MB may be a problem.
+The PANDA web interface may fail when you attempt to upload very large datasets. The exact size at which the uploads will fail depends on the specifics of your server (RAM size, in particular), but anything larger than 100MB may be a problem. PANDA may also experience issues when re-indexing very large datasets for the purpose of enabling field-level search.
 
-If you experience problems uploading large files, this document describes an alternative way of uploading them that bypasses the web interface. This method is much less convenient, but should be accessible for intermediate to advanced PANDA operators. 
+If you experience either of these problems, this document describes an alternative way of uploading data that bypasses the web interface. This method is much less convenient, but should be accessible for intermediate to advanced PANDA operators.
 
 Uploading a file to your server
--------------------------------
+===============================
 
 Manually importing files is a two-step process. First you must upload them to your server, then you can execute the import process.
 
-Uploading files your server requires using a command-line program called ``scp``. This program allows you to send a file to your server over :doc:`SSH <ssh>`. It may help to quickly review the :doc:`SSH <ssh>` documentation now. If you are on Mac/Linux, `scp` comes preinstalled. On Windows it comes as part of `Putty <http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/putty.html>`_. In either case, the command to upload your file will look like:
+Uploading files to your server requires using a command-line program called ``scp``. This program allows you to send a file to your server over :doc:`SSH <ssh>`. It may help to quickly review the :doc:`SSH <ssh>` documentation now. If you are on Mac/Linux, `scp` comes preinstalled. On Windows it comes as part of `Putty <http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/putty.html>`_. In either case, the command to upload your file will look like:
 
 ``scp -i /path/to/my/ec2_key.pem /path/to/my/dataset.csv ubuntu@my_server_domain_name.com:/tmp/``
 
 Executing the manual import
---------------------------
+===========================
 
 Once your file has finished copying to your PANDA server, you will need to SSH in to execute the manual import process. Refer to the :doc:`SSH <ssh>` documentation for instructions on how to SSH in. Once you're at the command line on your server, execute the following commands to import your file:
 
@@ -37,3 +37,64 @@ Once your file has finished copying to your PANDA server, you will need to SSH i
 In the example ``dataset.csv`` is the name of the file you uploaded (not including the path) and ``user@email.com`` is the login of the user you want the to "own" the dataset.
 
 Once this script returns your file will be importing via the normal process and you can review it's progress via the web interface. The dataset name and description will be set to the system defaults and should be updated in the web interface. From this point forward the dataset should be indistinguishable from one uploaded via the normal process.
+
+
+Enabling field search during bulk load
+=======================================
+
+PANDA may have trouble re-indexing "large" datasets, typically of millions of rows or more. Re-indexing is performed when you add field-level search to a dataset after initial import.
+If you have trouble re-indexing a large dataset, you can supply the bulk import command with a schema override file that enables field-level search during initial import.
+
+.. code-block:: bash
+
+    sudo mv /tmp/dataset.csv /var/lib/panda/uploads/
+    sudo chown panda:panda /var/lib/panda/uploads/dataset.csv
+    cd /opt/panda
+    sudo -u panda -E python manage.py manual_import dataset.csv user@email.com -o /path/to/schema_overrides.csv
+
+
+Schema override file format
+----------------------------
+
+The schema override file provides the ability to enable field-level search and customize the data types for any combination of fields. The override file should be a simple comma-separated CSV with two columns:
+
+- **field name** (required) must precisely match corresponding field name in source data file (note, match is case sensitive!)
+- **data type** (optional) is a valid PANDA data type. Otherwise uses PANDA's defaults:
+
+  - unicode
+  - int
+  - float
+  - bool
+  - datetime
+  - date
+  - time
+
+When defining a schema override file, it's a good idea to test a smaller sample of data to ensure you have the correct column names and data types.
+PANDA will often guess the right data type for a column based on a sampling of data. However, this may not always work as expected,
+such as a salary field prefixed with a dollar sign (PANDA will treat this as a string rather than interpreting it as a float).
+
+Experimenting with a subset of data will help identify such issues and suggest potential pre-processing steps that might be necessary prior
+to final import (e.g. stripping a leading dollar sign from a currency field).
+
+Once you've ironed out such kinks on the smaller data slice, you can apply the schema overrides to the full data set.
+
+Below is a sample data set and schema override file.
+
+.. code-block:: bash
+
+    # my_sample_data.csv
+    name,birthdate,salary,zip
+    John,1990-01-01,55000,20007
+    Jane,1989-01-01,65000,20007
+
+The related schema override file (below) would add indexes on *birthdate*, *salary* and *zip*.
+
+.. code-block:: bash
+
+    # schema_overrides.csv
+    birthdate,
+    salary,
+    zip,unicode
+
+In this example, PANDA correctly assigns data types for *birthdate* and *salary*, so we can leave the data type column blank for those fields.
+However, we explicitly specify *unicode* for zip code to ensure it is treated as a string rather than an integer.
diff --git a/panda/management/commands/manual_import.py b/panda/management/commands/manual_import.py
@@ -1,25 +1,37 @@
 #!/usr/bin/env python
 
+import csv
 import os
 
 from django.conf import settings
 from django.core.management.base import BaseCommand
 from django.utils.translation import ugettext as _
 from livesettings import config_value
 
+from optparse import make_option
 from panda.models import Dataset, DataUpload, UserProxy
+from panda.utils.typecoercion import TYPE_NAMES_MAPPING
 
 class Command(BaseCommand):
     args = '<dataset_filename user_email>'
     help = _('Manually import data for when the web UI fails. See http://panda.readthedocs.org/en/latest/manual_imports.html')
 
+    option_list = BaseCommand.option_list + (
+        make_option('-o', '--schema_overrides',
+            action='store',
+            dest='overrides',
+            help=_('Full path to CSV containing schema overrides. Field types: %s' % ', '.join(sorted(TYPE_NAMES_MAPPING.keys())))
+        ),
+    )
+
     def handle(self, *args, **options):
         if len(args) < 2:
             self.stderr.write(_('You must specify a filename and user.\n'))
             return
 
         filename = args[0]
         email = args[1]
+        overrides = self._schema_overrides(options)
 
         path = os.path.join(settings.MEDIA_ROOT, filename)
 
@@ -42,16 +54,33 @@ def handle(self, *args, **options):
             creator=creator,
             dataset=None,
             encoding='utf-8')
-     
+
         dataset = Dataset.objects.create(
             name=filename,
             creator=creator,
             initial_upload=upload)
 
         self.stdout.write('%s http://%s/#dataset/%s\n' % (_('Dataset created:'), config_value('DOMAIN', 'SITE_DOMAIN'), dataset.slug))
 
-        dataset.import_data(creator, upload)
-        
+        dataset.import_data(creator, upload, schema_overrides=overrides)
+
         dataset.update_full_text()
 
         self.stdout.write(_('Import started. Check dataset page for progress.\n'))
+
+    def _schema_overrides(self, opts):
+        try:
+            fields_file = opts['overrides']
+        except KeyError:
+            return {}
+        #TODO: error-handling if file doesn't exist or is malformed
+        valid_types = set(TYPE_NAMES_MAPPING.keys())
+        with open(fields_file) as csvfile:
+            data = {}
+            for field, dtype in csv.reader(csvfile):
+                # Activate indexing
+                data[field] = { 'indexed': True }
+                # Update data type if provided and valid
+                if dtype in valid_types:
+                    data[field]['type'] =  dtype
+        return data