-
Notifications
You must be signed in to change notification settings - Fork 83
#777 Add support for writing numeric data types having binary format (comp, comp-4, comp-9) #779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds binary COMP numeric encoding to the writer: new BinaryEncoders, encoder selection for COMP-4/COMP-9, tests and test utilities, writer output-file naming and output-directory checks, README updates to list supported EBCDIC types, removal of an unused variable in BinaryUtils, and change of append behavior to a no-op. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User as Spark Job
participant DS as DefaultSource
participant OF as RawBinaryOutputFormat
participant ES as EncoderSelector
participant BE as BinaryEncoders
participant FS as Filesystem
User->>DS: save(DataFrame, SaveMode, options)
DS->>OF: configure job (output path, writeJobUUID)
OF->>OF: checkOutputSpecs()
note right of OF #DDEBF7: validate output directory
par Per task
OF->>OF: getDefaultWorkFile(taskId, jobUUID, attemptId)
DS->>ES: select encoder (field metadata: COMP4/COMP9, precision, scale, sign, endian)
ES->>BE: encodeBinaryNumber(BigDecimal, isSigned, size, bigEndian, precision, scale, scaleFactor)
BE-->>ES: Array[Byte] (encoded)
ES-->>DS: encoded bytes
DS->>FS: write bytes to part file
end
FS-->>User: per-task part files
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Possibly related PRs
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
JaCoCo code coverage report - 'cobol-parser'
|
JaCoCo code coverage report - 'spark-cobol'
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (7)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1)
46-49
: Harden output path validation.Also assert it’s a directory and accessible; fail fast with a clearer message.
override def checkOutputSpecs(job: JobContext): Unit = { - val outDir = getOutputPath(job) - if (outDir == null) throw new IllegalStateException("Output directory not set.") + val outDir = getOutputPath(job) + if (outDir == null) throw new IllegalStateException("Output directory not set.") + val fs = outDir.getFileSystem(job.getConfiguration) + if (fs.exists(outDir) && !fs.getFileStatus(outDir).isDirectory) + throw new IllegalStateException(s"Output path '$outDir' is not a directory.") }cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryUtils.scala (1)
101-113
: Unify 1–2 digit byte-size rule across COMP variants.IBM mapping (1–2 digits → 1 byte) applies to COMP/COMP-4/COMP-5/COMP-9. The
comp == COMP9()
guard is unnecessary and obscures intent.- case Some(comp) if comp == COMP4() || comp == COMP5() || comp == COMP9() => // || comp == binary2() + case Some(comp) if comp == COMP4() || comp == COMP5() || comp == COMP9() => // if native binary follow IBM guide to digit binary length precision match { - case p if p >= 1 && p <= 2 && comp == COMP9() => 1 // byte + case p if p >= 1 && p <= 2 => 1 // byte case p if p >= minShortPrecision && p <= maxShortPrecision => binaryShortSizeBytes case p if p >= minIntegerPrecision && p <= maxIntegerPrecision => binaryIntSizeBytes case p if p >= minLongPrecision && p <= maxLongPrecision => binaryLongSizeBytesREADME.md (1)
1687-1690
: Clarify COMP endianness note.Since COMP is big-endian and COMP-9 is little-endian (Cobrix extension), add a parenthetical to avoid confusion.
- - `PIC S9(n)` numeric (integral and decimal) with `COMP`, `COMP-3`, `COMP-4`, `COMP-9` (little-endian). + - `PIC S9(n)` numeric (integral and decimal) with `COMP`/`COMP-4`/`COMP-5` (big-endian), `COMP-3`, and `COMP-9` (Cobrix little-endian).cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BCDNumberEncodersSuite.scala (2)
25-28
: Avoidnew BigDecimal(double)
in tests.Construct from string or use
BigDecimal.valueOf
to prevent binary-float rounding artifacts.Example pattern:
- val actual = BCDNumberEncoders.encodeBCDNumber(new java.math.BigDecimal(123.45), 5, 2, 0, signed = true, mandatorySignNibble = true) + val actual = BCDNumberEncoders.encodeBCDNumber(new java.math.BigDecimal("123.45"), 5, 2, 0, signed = true, mandatorySignNibble = true)(Apply similarly across cases.)
Also applies to: 34-35, 41-42, 48-49, 55-56, 62-63, 69-70, 76-77, 83-84, 90-91, 97-98, 104-105, 111-112, 118-119, 125-126, 137-139, 145-146, 152-153, 159-160, 166-167, 173-174, 180-181, 187-188, 194-195, 201-202, 208-209
95-99
: Fix typos in test names.“nbegative” → “negative”; “prexision” → “precision”.
- "encode a number with nbegative scale" in { + "encode a number with negative scale" in { - "attempt to encode a number with zero prexision" in { + "attempt to encode a number with zero precision" in {Also applies to: 130-132
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)
249-249
: Consider clarifying the null handling comment.The comment could be more specific about which field and why.
- 0x00, 0x00 // null, because -20 cannot fix the unsigned type + 0x00, 0x00 // null for field F (COMP-9 unsigned): -20 cannot be encoded as unsignedcobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncoders.scala (1)
40-41
: Consider adding a comment about BigInteger's byte array format.It would be helpful to document that
BigInteger.toByteArray()
returns bytes in big-endian format with the sign bit.+ // Note: BigInteger.toByteArray() returns bytes in big-endian two's complement format val intValue = bigInt.toByteArray val intValueLen = intValue.length
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (10)
README.md
(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryUtils.scala
(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncoders.scala
(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala
(3 hunks)cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BCDNumberEncodersSuite.scala
(3 hunks)cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncodersSuite.scala
(1 hunks)cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/testutils/ComparisonUtils.scala
(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala
(2 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala
(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncoders.scala (1)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/parse/FieldSizeSpec.scala (1)
scale
(63-74)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncodersSuite.scala (2)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/testutils/ComparisonUtils.scala (2)
ComparisonUtils
(22-33)assertArraysEqual
(23-31)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncoders.scala (2)
BinaryEncoders
(21-71)encodeBinaryNumber
(22-70)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BCDNumberEncodersSuite.scala (1)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/testutils/ComparisonUtils.scala (2)
ComparisonUtils
(22-33)assertArraysEqual
(23-31)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (6)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryUtils.scala (3)
cobol
(164-208)BinaryUtils
(25-247)getBytesCount
(98-125)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/datatype/AlphaNumeric.scala (1)
AlphaNumeric
(28-36)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/datatype/Usage.scala (4)
COMP3
(30-32)COMP3U
(34-36)COMP4
(38-40)COMP9
(44-46)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/datatype/Decimal.scala (1)
Decimal
(32-63)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/ast/datatype/Integral.scala (1)
Integral
(30-40)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncoders.scala (2)
BinaryEncoders
(21-71)encodeBinaryNumber
(22-70)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (2)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/fixtures/BinaryFileFixture.scala (1)
withTempDirectory
(71-78)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (2)
write
(71-75)close
(76-78)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Spark 3.5.5 on Scala 2.13.16
- GitHub Check: Spark 3.5.5 on Scala 2.12.20
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
🔇 Additional comments (10)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1)
44-44
: Per-job UUID fallback is fine.Good fallback for environments that don’t set
spark.sql.sources.writeJobUUID
.spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)
87-87
: Append semantics: confirm committer-staged writes.With Append now a no-op, ensure
RawBinaryOutputFormat
writes under the committer work path (see separate comment) so speculative attempts don’t leak extra files.cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BCDNumberEncodersSuite.scala (1)
19-21
: Nice: shared byte-array assertion utility.The switch to
assertArraysEqual
improves failure diagnostics and removes duplication.Also applies to: 29-29, 36-36, 43-43, 50-50, 57-57, 64-64, 71-71, 78-78, 85-85, 92-92, 99-99, 106-106, 113-113, 120-120, 127-127, 140-140, 147-147, 154-154, 161-161, 168-168, 175-175, 182-182, 189-189, 196-196, 203-203, 210-210
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/testutils/ComparisonUtils.scala (1)
22-33
: LGTM! Well-designed test utility for byte array comparisons.The utility provides clear hex-formatted output for debugging test failures, making it easy to spot differences between expected and actual byte arrays.
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncodersSuite.scala (1)
23-89
: Test coverage looks comprehensive!The test suite thoroughly covers positive integers, negative integers, big-endian/little-endian formats, boundary cases, and zero values.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (2)
33-50
: Binary encoder integration looks correct!The pattern matching for COMP4 (big-endian) and COMP9 (little-endian) binary types follows the existing pattern and correctly routes to the new binary encoder.
100-119
: getBinaryEncoder implementation is well-structured.The function properly handles various numeric types and delegates to BinaryEncoders.encodeBinaryNumber with the correct parameters.
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (2)
186-259
: Excellent test coverage for binary COMP fields!The test thoroughly validates COMP, COMP-4, BINARY, and COMP-9 encodings with appropriate expected byte arrays for big-endian and little-endian formats.
261-287
: Append mode test properly validates multi-file output.The test correctly validates that append mode creates multiple part files rather than overwriting.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncoders.scala (1)
22-70
: Binary encoder implementation is solid and efficient!The implementation correctly handles:
- Null and invalid parameter guards
- Scale factor adjustments for decimal point positioning
- Two's complement representation for signed numbers
- Both big-endian and little-endian byte ordering
- Proper padding with sign extension
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncodersSuite.scala
Outdated
Show resolved
Hide resolved
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala
Outdated
Show resolved
Hide resolved
…ther PR fixes (Thanks @coderabbitai).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (1)
54-63
: Resolved: now writing under the committer work path (no direct writes to final dir).This addresses the earlier concern about bypassing the committer’s staging. Naming is deterministic and attempt-aware. Nice.
🧹 Nitpick comments (3)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (3)
44-44
: Micro: make the fallback UUID lazy.Avoids creating a UUID when spark.sql.sources.writeJobUUID is always present.
Apply:
- private val uniqueUuid = UUID.randomUUID().toString + private lazy val uniqueUuid = UUID.randomUUID().toString
46-52
: Delegate to FileOutputFormat’s default checks; prefer IOExceptions.Preserve Hadoop/Spark semantics across versions and use IOExceptions expected by MR APIs; keep the directory-type guard.
Apply:
- override def checkOutputSpecs(job: JobContext): Unit = { - val outDir = getOutputPath(job) - if (outDir == null) throw new IllegalStateException("Output directory not set.") - val fs = outDir.getFileSystem(job.getConfiguration) - if (fs.exists(outDir) && !fs.getFileStatus(outDir).isDirectory) - throw new IllegalStateException(s"Output path '$outDir' is not a directory.") - } + override def checkOutputSpecs(job: JobContext): Unit = { + super.checkOutputSpecs(job) // ensures out dir is set and validates committer expectations + val outDir = getOutputPath(job) + val fs = outDir.getFileSystem(job.getConfiguration) + if (fs.exists(outDir) && !fs.getFileStatus(outDir).isDirectory) { + throw new java.io.IOException(s"Output path '$outDir' is not a directory.") + } + }
56-57
: Optional: prefer JobID as deterministic fallback for writeJobId.When spark.sql.sources.writeJobUUID is absent, using the TaskAttempt’s JobID aids traceability and consistency across tasks/attempts.
Apply:
- val writeJobId = Option(conf.get("spark.sql.sources.writeJobUUID")).getOrElse(uniqueUuid) - val attempt = context.getTaskAttemptID + val attempt = context.getTaskAttemptID + val writeJobId = Option(conf.get("spark.sql.sources.writeJobUUID")) + .orElse(Option(attempt.getJobID).map(_.toString)) + .getOrElse(uniqueUuid)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
README.md
(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryUtils.scala
(1 hunks)cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BCDNumberEncodersSuite.scala
(2 hunks)cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncodersSuite.scala
(1 hunks)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BCDNumberEncodersSuite.scala
- cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryUtils.scala
- README.md
- cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/BinaryEncodersSuite.scala
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
- GitHub Check: Spark 3.4.4 on Scala 2.12.20
- GitHub Check: Spark 3.5.5 on Scala 2.12.20
- GitHub Check: Spark 2.4.8 on Scala 2.11.12
- GitHub Check: Spark 3.5.5 on Scala 2.13.16
🔇 Additional comments (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (2)
23-23
: LGTM: static import is appropriate.Used by checkOutputSpecs; keeps the call concise.
26-26
: LGTM: UUID import for filename entropy.Appropriate for per-job/task uniqueness.
…empt’s JobID aids traceability and consistency across tasks/attempts.
Closes #777
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests