Dedicated CSV handler with json -> CSV conversion#611
Open
5cover wants to merge 3 commits intop2r3:masterfrom
Open
Dedicated CSV handler with json -> CSV conversion#6115cover wants to merge 3 commits intop2r3:masterfrom
5cover wants to merge 3 commits intop2r3:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The current json handler supports JSON to CSV, but the resulting CSV is not as readable as it could be (see comparisons)
I implemented a dedicated
csvhandler which accepts JSON data and can be extended to support more input formats in the future.The handler tries to expand nested structures into meaningful columns, avoiding JSON blobs unless strictly necessary.
Principle: JSON is flattened to primitive leaf paths, then projected into a single CSV table by splitting each path into a row fragment and a column fragment.
The converter tries prefix-based and suffix-based global split strategies, scores them by the number of distinct rows plus columns, and picks the most compact readable result.
Comparisons
Simple
Input JSON:
{ "A": { "a": 1, "b": 2 }, "B": { "a": 3, "b": 4 } }Current CSV output:
CSV handler output:
Nested object
Input JSON:
{ "a": [ {} ], "b": [ [] ], "c": [ [{}] ] }Current CSV output:
CSV handler output:
Orders
Input JSON:
{ "order_1001": { "customer": {"name": "Iris Market", "tier": "gold"}, "shipping": {"city": "Berlin", "country": "DE"}, "totals" : {"subtotal": 120.5, "tax": 22.9, "grand": 143.4}, "state" : "paid" }, "order_1002": { "customer": {"name": "Northwind Labs", "tier": "silver"}, "shipping": {"city": "Paris", "country": "FR"}, "totals" : {"subtotal": 80, "tax": 16, "grand": 96}, "state" : "paid" }, "order_1003": { "customer": {"name": "Sun Harbor", "tier": "gold"}, "pickup" : {"store": "AMS-04", "window": "10:00-12:00"}, "totals" : {"subtotal": 48, "tax": 0, "grand": 48}, "state" : "pickup" } }Current CSV output:
CSV handler output:
package.json
Input JSON:
{ "name" : "p2r3-convert", "productName" : "Convert to it!", "author" : "PortalRunner", "description" : "Truly universal browser-based file converter", "private" : true, "version" : "0.0.0", "type" : "module", "main" : "src/electron.cjs", "scripts" : { "dev": "vite", "build": "tsc && vite build", "cache:build": "bun run buildCache.js dist/cache.json --minify", "cache:build:dev": "bun run buildCache.js dist/cache.json", "preview": "vite preview", "docker": "bun run docker:build && bun run docker:up", "docker:build": "docker compose -f docker/docker-compose.yml -f docker/docker-compose.override.yml build --build-arg VITE_COMMIT_SHA=8bdb272720d0b62bd3baab1ee2e7146b6b84a692", "docker:up": "docker compose -f docker/docker-compose.yml -f docker/docker-compose.override.yml up -d", "desktop:build": "tsc && IS_DESKTOP=true vite build && bun run cache:build", "desktop:preview": "electron .", "desktop:start": "bun run desktop:build && bun run desktop:preview", "desktop:dist:win": "bun run desktop:build && electron-builder --win --publish never", "desktop:dist:mac": "bun run desktop:build && electron-builder --mac --publish never", "desktop:dist:linux": "bun run desktop:build && electron-builder --linux --publish never" }, "build" : { "appId" : "com.p2r3.convert", "directories": {"output": "release"}, "files" : ["dist/**/*", "src/electron.cjs"], "win" : {"target": "nsis"}, "mac" : {"target": "dmg"}, "linux" : {"target": "AppImage"} }, "devDependencies": { "@types/hjson" : "^2.4.6", "@types/jszip" : "^3.4.0", "@types/msgpack" : "^0.0.34", "@types/opentype.js" : "^1.3.9", "electron" : "^40.6.0", "electron-builder" : "^26.8.1", "puppeteer" : "^24.36.0", "typescript" : "~5.9.3", "vite" : "^7.2.4", "vite-tsconfig-paths": "^6.0.5" }, "dependencies" : { "@ably/msgpack-js" : "^0.4.1", "@bjorn3/browser_wasi_shim": "^0.4.2", "@bokuweb/zstd-wasm" : "^0.0.27", "@ffmpeg/core" : "^0.12.10", "@ffmpeg/ffmpeg" : "^0.12.15", "@ffmpeg/util" : "^0.12.2", "@flo-audio/reflo" : "^0.1.2", "@imagemagick/magick-wasm" : "^0.0.37", "@shelacek/ubjson" : "^1.1.1", "@sqlite.org/sqlite-wasm" : "^3.51.2-build6", "@stringsync/vexml" : "^0.1.8", "@toon-format/toon" : "^2.1.0", "@types/bun" : "^1.3.9", "@types/meyda" : "^5.3.0", "@types/pako" : "^2.0.4", "@types/papaparse" : "^5.5.2", "@types/three" : "^0.182.0", "bson" : "^7.2.0", "cbor" : "^10.0.12", "hjson" : "^3.2.2", "imagetracer" : "^0.2.2", "js-synthesizer" : "^1.11.0", "json6" : "^1.0.3", "jsonl-parse-stringify" : "^1.0.3", "jszip" : "^3.10.1", "meyda" : "^5.6.3", "mime" : "^4.1.0", "nanotar" : "^0.3.0", "nbtify" : "^2.2.0", "opentype.js" : "^1.3.4", "pako" : "^2.1.0", "papaparse" : "^5.5.3", "pdf-parse" : "^2.4.5", "pdftoimg-js" : "^0.2.5", "pe-library" : "^2.0.1", "svg-pathdata" : "^8.0.0", "three" : "^0.182.0", "three-bvh-csg" : "^0.0.17", "three-mesh-bvh" : "^0.9.8", "tiny-jsonc" : "^1.0.2", "ts-flp" : "^1.0.3", "verovio" : "^6.0.1", "vexflow" : "^5.0.0", "vite-plugin-static-copy" : "^3.1.6", "wavefile" : "^11.0.0", "woff2-encoder" : "^2.0.0", "xml2js" : "^0.6.2", "xz-decompress" : "^0.2.3", "yaml" : "^2.8.2" } }Current CSV output:
CSV handler output:
Detailed explanation
The JSON → CSV conversion treats JSON as a tree of primitive values and tries to project it into a single 2D table.
Since CSV is untyped, "5" is indistinguishable from 5, making the conversion lossy.
It always succeeds, because every primitive value can always be represented as one row in a fallback
key,valueshape.The algorithm has two phases:
1. Flatten JSON into primitive paths
The input JSON is traversed recursively.
Every primitive leaf becomes:
Example:
{ "build": { "appId": "com.p2r3.convert", "directories": { "output": "release" } } }becomes:
build.appId"com.p2r3.convert"build.directories.output"release"Arrays are handled the same way, using indices as path segments.
Only primitive leaves (including the empty object / empty array) are emitted. Objects and arrays are traversed, not emitted directly.
Cycles are rejected.
2. Encode paths as strings
Paths are stored as arrays internally, but for row/column labels they are encoded as strings with escaping.
This allows arbitrary property names, including names containing
.or\.So the algorithm works on path fragments safely without losing path identity.
Examples:
a.b→['a', 'b']a\.b→['a.b'],a\\.b→['a\', 'b'],a\\\.b→['a\.b'],3. Search for a table split
Each primitive path must be split into:
row_fragment . column_fragmentThe value is then placed in the CSV cell at
(row_fragment, column_fragment)The algorithm does not search all possible splits. Instead, it searches two constrained families of splits.
Both families are tried for all
kfrom1tomax_path_length - 1.A. Prefix splits
Choose a global
k.For each path:
kk, leave at least one segment in the columnThis means:
B. Suffix splits
Choose a global
k.For each path:
kThis means:
4. Score each split
For one candidate split, collect:
Then score it by
cost = row_count + column_countThe algorithm prefers the split with the smallest cost.
Tie-breaker
If two splits have the same cost, prefer:
This biases the output toward taller, narrower tables, which are usually more CSV-like and more readable than very wide tables.
5. Build the table
Once the best split is chosen:
If all row fragments are empty, the
keycolumn is omitted.Otherwise, the first column is:
keycontaining the row fragment.
Missing cells are emitted as empty strings.
6. Fallback behavior
The algorithm allows the structure to collapse into a simple key/value table when no useful 2D projection exists.
That happens naturally when the best split effectively assigns:
k=0in the column side, meaning the path fragment is empty, which the algorithm inteprets asvaluecolumn indicating "the value at this key"So highly heterogeneous objects such as
package.jsonend up as:This is intentional. It is the correct best-effort representation for data that is not meaningfully tabular.
What kind of structures compress well
The algorithm produces good tables when many values share common suffixes or prefixes.
Example:
{ "A": { "a": 1, "b": 2 }, "B": { "a": 3, "b": 4 } }becomes:
because the paths:
A.aA.bB.aB.bcan be split compactly as:
A,Ba,bWhat kind of structures do not compress well
Objects that are really just maps or unrelated subtrees do not have a good shared schema.
Example:
{ "dependencies": { "bson": "^7.2.0", "cbor": "^10.0.12" } }is not naturally a wide table. The best representation according to our cost function is:
Likewise, heterogeneous top-level documents like
package.jsonorpackage-lock.jsonmay only partially compress. The algorithm still produces one table, but falls back to key/value structure where needed.EDIT: the existing json->csv conversion wasn't provided by pandoc but by the handwritten json.ts handler.