This task takes care of scanning the entire datasource and classify the values to known types.
The taskConfiguration field for this task accepts the following structure:
{
"localization": "ISO CODE",
"firstN": 100,
"identifiers": [ "IDENTIFIERS TO BE LOADED" ]
}Where localization specified the localization that is required for this classification process.
firstN allows to limit the processing to the first N values. N must me a number greater than 0. If not specified or if a value less or equal to 0 is passed all the dataset is used. Similarly, if the value of N is greated the the dataset size no error will be produced.
identifiers is an (optional) field that allows the specification of the type identifiers enabled for this specific scan.
This field is an array accepting two types of values:
- strings, which must map be the fully qualified class name of a Java class. The default list of available identifers is presented in the following of the documentation.
- object, which allows the definition of custom identifiers. If missing, the default identifiers will be used.
What follows is the list of identifiers available with the command line version of DPT. The list can be extended with any class implementing the com.ibm.research.drl.dpt.providers.identifiers.Identifier interface that is available in the class path of the executing JVM.
| Full Name | Short Name |
|---|---|
| com.ibm.research.drl.dpt.providers.identifiers.AddressForFreeTextIdentifier | ADDRESS |
| com.ibm.research.drl.dpt.providers.identifiers.AddressIdentifier | ADDRESS |
| com.ibm.research.drl.dpt.providers.identifiers.AgeIdentifier | AGE |
| com.ibm.research.drl.dpt.providers.identifiers.AnimalSpeciesIdentifier | ANIMAL |
| com.ibm.research.drl.dpt.providers.identifiers.ATCIdentifier | ATC |
| com.ibm.research.drl.dpt.providers.identifiers.AustraliaAddressIdentifier | LOCATION |
| com.ibm.research.drl.dpt.providers.identifiers.CityIdentifier | CITY |
| com.ibm.research.drl.dpt.providers.identifiers.ContinentIdentifier | CONTINENT |
| com.ibm.research.drl.dpt.providers.identifiers.CountryIdentifier | COUNTRY |
| com.ibm.research.drl.dpt.providers.identifiers.CountyIdentifier | COUNTY |
| com.ibm.research.drl.dpt.providers.identifiers.CreditCardIdentifier | CREDIT_CARD |
| com.ibm.research.drl.dpt.providers.identifiers.CreditCardTypeIdentifier | CREDIT_CARD_TYPE |
| com.ibm.research.drl.dpt.providers.identifiers.DateTimeIdentifier | DATETIME |
| com.ibm.research.drl.dpt.providers.identifiers.DayIdentifier | DAY |
| com.ibm.research.drl.dpt.providers.identifiers.DependentIdentifier | DEPENDENT |
| com.ibm.research.drl.dpt.providers.identifiers.EmailIdentifier | |
| com.ibm.research.drl.dpt.providers.identifiers.FreeTextNamesIdentifier | NAME |
| com.ibm.research.drl.dpt.providers.identifiers.GenderIdentifier | GENDER |
| com.ibm.research.drl.dpt.providers.identifiers.HospitalIdentifier | HOSPITAL |
| com.ibm.research.drl.dpt.providers.identifiers.IBANIdentifier | IBAN |
| com.ibm.research.drl.dpt.providers.identifiers.ICDv9Identifier | ICDv9 |
| com.ibm.research.drl.dpt.providers.identifiers.IMEIIdentifier | IMEI |
| com.ibm.research.drl.dpt.providers.identifiers.IMSIIdentifier | IMSI |
| com.ibm.research.drl.dpt.providers.identifiers.InternationalPhoneIdentifier | PHONE |
| com.ibm.research.drl.dpt.providers.identifiers.IPAddressIdentifier | IP_ADDRESS |
| com.ibm.research.drl.dpt.providers.identifiers.ItalianFiscalCodeIdentifier | ITALIAN_FISCAL_CODE |
| com.ibm.research.drl.dpt.providers.identifiers.ItalianVATCodeIdentifier | ITALIAN_VAT |
| com.ibm.research.drl.dpt.providers.identifiers.LatitudeLongitudeIdentifier | LATITUDE_LONGITUDE |
| com.ibm.research.drl.dpt.providers.identifiers.LotusNotesIdentifier | PERSON |
| com.ibm.research.drl.dpt.providers.identifiers.MACAddressIdentifier | MAC_ADDRESS |
| com.ibm.research.drl.dpt.providers.identifiers.MaritalStatusIdentifier | MARITAL_STATUS |
| com.ibm.research.drl.dpt.providers.identifiers.MedicalPatternIdentifier | MRN |
| com.ibm.research.drl.dpt.providers.identifiers.MedicineIdentifier | MEDICINE |
| com.ibm.research.drl.dpt.providers.identifiers.MonthIdentifier | MONTH |
| com.ibm.research.drl.dpt.providers.identifiers.NameIdentifier | NAME |
| com.ibm.research.drl.dpt.providers.identifiers.NationalRegistrationIdentityCardIdentifier | NRIC |
| com.ibm.research.drl.dpt.providers.identifiers.NumericIdentifier | NUMERIC |
| com.ibm.research.drl.dpt.providers.identifiers.OccupationIdentifier | OCCUPATION |
| com.ibm.research.drl.dpt.providers.identifiers.OSIdentifier | OS-NAMES |
| com.ibm.research.drl.dpt.providers.identifiers.PatientIDIdentifier | |
| com.ibm.research.drl.dpt.providers.identifiers.PhoneIdentifier | PHONE |
| com.ibm.research.drl.dpt.providers.identifiers.POBOXIdentifier | ADDRESS |
| com.ibm.research.drl.dpt.providers.identifiers.RaceEthnicityIdentifier | RACE |
| com.ibm.research.drl.dpt.providers.identifiers.ReligionIdentifier | RELIGION |
| com.ibm.research.drl.dpt.providers.identifiers.SortCodeIdentifier | SORT_CODE |
| com.ibm.research.drl.dpt.providers.identifiers.SSNUKIdentifier | SSN_UK |
| com.ibm.research.drl.dpt.providers.identifiers.SSNUSIdentifier | SSN_US |
| com.ibm.research.drl.dpt.providers.identifiers.StatesUSIdentifier | STATES_US |
| com.ibm.research.drl.dpt.providers.identifiers.StreetTypeIdentifier | STREET_TYPES |
| com.ibm.research.drl.dpt.providers.identifiers.SWIFTCodeIdentifier | SWIFT |
| com.ibm.research.drl.dpt.providers.identifiers.UKPostCodeIdentifier | POSTCODE |
| com.ibm.research.drl.dpt.providers.identifiers.URLIdentifier | URL |
| com.ibm.research.drl.dpt.providers.identifiers.USPhoneIdentifier | PHONE |
| com.ibm.research.drl.dpt.providers.identifiers.VINIdentifier | VIN |
| com.ibm.research.drl.dpt.providers.identifiers.YOBIdentifier | YOB |
| com.ibm.research.drl.dpt.providers.identifiers.ZIPCodeIdentifier | ZIPCODE |
DPT allows the definition of custom identifiers. This is done by specifing the required identification strategy and customizing the identification patterns.
Currently, there are two supported types of custom identifiers, regular expression and dictionary based. The skeleton for defining a custom identifier is the following:
{
"type":"IDENTIFIER TYPE",
"providerType":"SEMANTIC TYPE"
}where IDENTIIFER TYPE is either REGEX or DICTIONARY and where providerType specifies the semantic type that the custom identifier needs to return in case of match. Thus, the provider type is the "label" that the user wants to associated to this specific identifier.
If the type of custom identifier is REGEX the following structure can be used:
{
"type":"REGEX",
"providerType":"MY_TYPE",
"regex":[
"PATTERN1",
"PATTERN2"
]
}Where regex is a list of patterns following the specification of Java Pattern.
If the type of custom identifier is DICTIONARY the following structure can be used:
{
"type":"DICTIONARY",
"providerType":"MY_TYPE",
"terms":[
"TERM1",
"TERM2"
],
"ignoreCase": true
}Where terms is a list of dictionary terms
and where ignoreCase is a (optional) boolean value specifying how to treat the terms.
Note that if the list of terms is too long, it is possible to replace terms with paths as follows:
{
"type":"DICTIONARY",
"providerType":"MY_TYPE",
"paths":[
"/my/file/path/file1.txt",
"/my/file/path/file2.txt"
],
"ignoreCase": true
}Where paths is a list of file paths from which the terms will be loaded.
The structure of the dictionary file is assumed to be a list of terms, one per line.
Note that replicated terms will be discarded and also note that the files need to be accessible from the JVM running the task.
For each processed file a report file will be generated. The file name of the report will be the same as the original one, so pay attention that input and output references do not refer to the same file or folder.
The output will be a dictionary. The output contains two sections: the rawResults section that contains the raw counters of the identification process and the bestTypes which infers the best matching type based on the raw counters.
The keys to each section will be the field names. For each key the value is an array of the identified types information, which includes the data type as typeName and the number of instances that was found in the data as count.
If the input file has two columns and the column names are specified in the header: f0, f1 and their data types are NAME, EMAIL then the output will be a JSON object with the identified column names as keys and the type name and instances detected as values. In our example, the output will look like this:
{
"bestTypes" : {
"f0" : {
"typeName" : "NAME",
"count" : 66
},
"f1" : {
"typeName" : "EMAIL",
"count" : 60
}
},
"rawResults": {
"f0" : [ {
"typeName" : "NAME",
"count" : 66
} ],
"f1" : [
{ "typeName" : "EMAIL", "count" : 60 },
{ "typeName" : "EMPTY", "count" : 6 }
]
}
}In case no header is present (for example if the field hasHeader set to false with a CSV input) the columns will be named "Column 0", "Column 1", ... etc.
If a type is not detected for a column, then the type name will be UNKNOWN.
For proper output format, the JVM needs to be set up to use UTF-8
Two options:
a) use -Dfile.encoding=UTF-8 option when invoking the toolkit
b) set the value of LC_ALL environmental variable in the shell where the toolkit will run: export LC_ALL="en_US.UTF-8"
Example of config.json for the Identification of structured data (CSV):
{
"task":"Identification",
"inputFormat":"CSV",
"inputOptions":{
"fieldDelimiter":",",
"quoteChar":"\"",
"hasHeader":true,
"trimFields":true
},
"taskOptions":{
"localization":"en-US",
"firstN":10000,
"identificationPatterns":"",
"identificationLookup":"",
"piList":""
}
}Example of config.json for the Identification of structured data (JSON):
{
"task":"Identification",
"extension":"json",
"inputFormat":"JSON",
"inputOptions":{
},
"taskOptions":{
"localization":"en-US"
}
}