ICC Book Fetcher
Overview
The ICC book fetcher has been enhanced with a more thorough download verification system that goes beyond the previous shallow check. The new system provides detailed information about the download status of each book and can detect partially downloaded or corrupted books.
Problem Solved
Previous Behavior
The original isBookAlreadyDownloaded() method only checked for the existence of:
api/icc/content/info/{documentId}.json(or.raw.json)api/icc/content/chapters/{documentId}.json(or.raw.json)
This was insufficient because:
- It didn't verify that all expected chapter XML files were present
- It didn't check if files were corrupted or truncated
- It could miss cases where some chapters failed to download
New Behavior
The enhanced system now performs a comprehensive check that:
- Verifies metadata files exist (same as before)
- Parses the chapters metadata to determine expected chapter content IDs
- Checks each expected chapter file for existence
- Validates XML integrity by checking for well-formed structure
- Provides detailed reporting on missing, corrupted, and valid chapters
New Features
1. Thorough Download Status Checking
// New method for detailed status checking
BookDownloadStatus status = IccBookClient.checkBookDownloadStatus(documentId, fileSystemHandler);
// Legacy method still works but now uses the thorough check
boolean isDownloaded = IccBookClient.isBookAlreadyDownloaded(documentId, fileSystemHandler);
2. BookDownloadStatus Class
The new BookDownloadStatus class provides detailed information:
public class BookDownloadStatus {
private String documentId;
private boolean metadataFilesPresent;
private boolean fullyDownloaded;
private int expectedChapterCount;
private int missingChapterCount;
private int corruptedChapterCount;
private int validChapterCount;
private List<String> missingChapterIds;
private List<String> corruptedChapterIds;
private List<String> validChapterIds;
private List<String> issues;
}
3. Enhanced CLI Output
The command-line interface now provides detailed status reports:
cli/codeproof.sh icc-book-fetcher --search-result-file search-results.json
# Example output:
=== Download Status Report ===
Book 2217: ✓ Fully downloaded (100 chapters)
Book 3757: ⚠ Partially downloaded (95/100 chapters valid, 3 missing, 2 corrupted)
Missing chapters: 35712407, 35712408, 35712409
Corrupted chapters: 35712410, 35712411
Book 3100: ✗ Not downloaded
✓ Skipped 1 fully downloaded book(s): 2217
⚠ Will re-download 1 partially downloaded book(s): 3757
Will download 2 book(s) (new or to fix partial downloads)
4. Status-Only Mode
New --status-only option to check status without downloading:
cli/codeproof.sh icc-book-fetcher --search-result-file search-results.json --status-only
5. Example Outputs
Status Check Output
$ cli/codeproof.sh icc-book-fetcher 2217 3757 3100 --status-only
Checking download status for all books...
=== Download Status Report ===
Book 2217: ✓ Fully downloaded (100 chapters)
Book 3757: ⚠ Partially downloaded (95/100 chapters valid, 3 missing, 2 corrupted)
Missing chapters: 35712407, 35712408, 35712409
Corrupted chapters: 35712410, 35712411
Book 3100: ✗ Not downloaded
Status check completed. Use --status-only to check status without downloading.
Download with Status Report
$ cli/codeproof.sh icc-book-fetcher 2217 3757 3100
Checking download status for all books...
=== Download Status Report ===
Book 2217: ✓ Fully downloaded (100 chapters)
Book 3757: ⚠ Partially downloaded (95/100 chapters valid, 3 missing, 2 corrupted)
Missing chapters: 35712407, 35712408, 35712409
Corrupted chapters: 35712410, 35712411
Book 3100: ✗ Not downloaded
Starting ICC book fetch for 3 book(s) with pause range: 3000-5000 ms
✓ Skipped 1 fully downloaded book(s): 2217
⚠ Will re-download 1 partially downloaded book(s): 3757
Will download 2 book(s) (new or to fix partial downloads)
[1/2] Fetching book ID: 3757
Fetching chapter 1 of 100: Chapter 1: Scope and Administration
Pausing for 4.2 seconds before next chapter fetch
...
✓ Successfully fetched book ID: 3757
[2/2] Fetching book ID: 3100
...
✓ Successfully fetched book ID: 3100
Completed fetching 3 ICC book(s)
XML Validation
The system includes basic XML well-formedness checking that verifies:
- File structure: Files should start with
<and end with> - Root elements: Should contain
<section>or<html>as root - Tag balance: Opening and closing tags should be reasonably balanced
- Proper endings: Files should not end abruptly with incomplete tags
Validation Examples
<!-- ✅ Well-formed -->
<section><div><p>Content</p></div></section>
<!-- ❌ Malformed (missing closing p tag) -->
<section><div><p>Content</div></section>
<!-- ❌ Malformed (ends abruptly) -->
<section><div><p>Content</p></div>
Usage Examples
Check Status Only
# Check status without downloading anything
cli/codeproof.sh icc-book-fetcher \
--search-result-file api/icc/codes/united-states/california/search-results.json \
--status-only
# Check status of specific books
cli/codeproof.sh icc-book-fetcher 2217 3757 3100 --status-only
# Check status of a non-existent book (for testing)
cli/codeproof.sh icc-book-fetcher 99999 --status-only
Normal Download with Enhanced Checking
# Download all books from search results with enhanced checking
cli/codeproof.sh icc-book-fetcher \
--search-result-file api/icc/codes/united-states/california/search-results.json \
--min-pause 3000 \
--max-pause 5000 \
--filesystem LOCAL
# Download specific books with enhanced checking
cli/codeproof.sh icc-book-fetcher 2217 3757 3100
Demonstration Commands
# 1. Check what books are available in a search results file
cli/codeproof.sh icc-search --file api/icc/codes/united-states/california/search-results.json --document-ids-only
# 2. Check download status of those books
cli/codeproof.sh icc-book-fetcher \
--search-result-file api/icc/codes/united-states/california/search-results.json \
--status-only
# 3. Download only the books that need it
cli/codeproof.sh icc-book-fetcher \
--search-result-file api/icc/codes/united-states/california/search-results.json \
--min-pause 3000 \
--max-pause 5000
# 4. Verify the download was successful
cli/codeproof.sh icc-book-fetcher \
--search-result-file api/icc/codes/united-states/california/search-results.json \
--status-only
Help and Options
# Show all available options
cli/codeproof.sh icc-book-fetcher --help
# Show help for the main CLI
cli/codeproof.sh --help
Benefits
- Reliability: Catches partial downloads and corrupted files
- Transparency: Clear reporting of what's missing or broken
- Efficiency: Only re-downloads what's actually needed
- Debugging: Detailed information helps identify download issues
- Backward Compatibility: Existing code continues to work
Technical Details
File Structure Checked
api/icc/content/info/{documentId}.json- Book metadataapi/icc/content/chapters/{documentId}.json- Chapter indexapi/icc/content/chapter-xml/{documentId}/{chapterId}.html- Chapter content files
Performance Considerations
- XML validation is lightweight and doesn't require full parsing
- Status checking reads files but doesn't make network calls
- Chapter count is limited to 100 by default (configurable in
fetchBookChapters())
Error Handling
- Graceful handling of missing or corrupted files
- Detailed error messages for debugging
- Fallback behavior for parsing errors