LLM Log Sanitization
This document describes the LLM log sanitization feature that automatically removes or redacts sensitive information from log traces.
Overview
The LLM log sanitization system provides configurable filtering of sensitive content from both:
- BigQuery trace logs (via
LlmLogTracer) - Standard application logs (via
Logger)
Features
Supported Content Types
-
Binary Content Sanitization:
- PDF files (
<pdf-binary-data-redacted>) - Images: JPEG, PNG, GIF, WebP (
<image-binary-data-redacted>) - Videos: MP4, AVI (
<video-binary-data-redacted>) - Audio: MP3, WAV (
<audio-binary-data-redacted>)
- PDF files (
-
Personal Identifiable Information (PII) Sanitization:
- Email addresses (
<pii-data-redacted>) - Phone numbers (
<pii-data-redacted>) - Social Security Numbers (
<pii-data-redacted>)
- Email addresses (
Configuration
The sanitizer is configured via environment variables and system properties:
Environment Variables
# Master switch for sanitization
export LLM_LOG_SANITIZATION_ENABLED=true
# Individual content type controls
export LLM_LOG_SANITIZE_PDF_CONTENT=true
export LLM_LOG_SANITIZE_PII_CONTENT=false
export LLM_LOG_SANITIZE_IMAGE_CONTENT=true
export LLM_LOG_SANITIZE_VIDEO_CONTENT=true
export LLM_LOG_SANITIZE_AUDIO_CONTENT=true
System Properties
Alternatively, use system properties (environment variables take precedence):
-Dllm.log.sanitization.enabled=true
-Dllm.log.sanitize.pdf.content=true
-Dllm.log.sanitize.pii.content=false
-Dllm.log.sanitize.image.content=true
-Dllm.log.sanitize.video.content=true
-Dllm.log.sanitize.audio.content=true
Default Configuration
- Sanitization Enabled:
true - PDF Content:
true(sanitized) - PII Content:
false(not sanitized by default) - Image Content:
true(sanitized) - Video Content:
true(sanitized) - Audio Content:
true(sanitized)
Preset Configurations
Development Configuration
LlmLogSanitizationConfig.createDevelopmentConfig()
// Sanitizes only PDF content for minimal impact during development
Production Configuration
LlmLogSanitizationConfig.createProductionConfig()
// Sanitizes all content types including PII for maximum security
Disabled Configuration
LlmLogSanitizationConfig.createDisabledConfig()
// Disables all sanitization
Usage Examples
Programmatic Usage
// Create a custom sanitizer
LlmLogTraceSanitizer sanitizer = new LlmLogTraceSanitizer.Builder()
.sanitizePdfContent(true)
.sanitizePiiContent(true)
.sanitizeImageContent(false)
.build();
// Sanitize JSON content
String sanitizedJson = sanitizer.sanitizeJson(originalJson);
// Sanitize plain text
String sanitizedText = sanitizer.sanitizeText(originalText);
Automatic Integration
The sanitizer is automatically integrated into:
- BigQuery Logging: All traces logged to BigQuery are automatically sanitized
- Standard Logging: Request/response JSON in application logs are sanitized
Configuration Loading
// Load configuration from environment/properties
LlmLogSanitizationConfig config = new LlmLogSanitizationConfig();
LlmLogTraceSanitizer sanitizer = config.createSanitizer();
// Check configuration
logger.info("Sanitization config: " + config.getSummary());
Security Considerations
What Gets Sanitized
-
Base64 Encoded Binary Data: Detected by:
- Pattern matching for base64 strings > 1000 characters
- File signature verification (magic numbers)
- Field name heuristics (
data,content,bytes,blob)
-
PII in Text: Detected by regex patterns for:
- Email addresses
- US phone numbers
- Social Security Numbers
What Doesn't Get Sanitized
- Small base64 strings (< 1000 characters)
- Binary data that doesn't match known file signatures
- PII in structured formats (when PII sanitization is disabled)
- Non-standard PII formats
Limitations
- Performance: Large content sanitization may impact performance
- False Positives: Some legitimate base64 data might be sanitized
- False Negatives: Sophisticated encoding might bypass detection
- PII Detection: Basic regex patterns may miss complex PII formats
Testing
Run the sanitizer test to verify functionality:
mvn compile exec:java -Dexec.mainClass="org.codetricks.construction.code.assistant.ai.model.LlmLogSanitizerTest"
This will test different sanitizer configurations with sample PDF and PII data.
Monitoring
The sanitizer logs its configuration at startup:
INFO: LLM Log Sanitization Configuration loaded:
INFO: Sanitization Enabled: true
INFO: PDF Content: true
INFO: PII Content: false
INFO: Image Content: true
INFO: Video Content: true
INFO: Audio Content: true
Extending the Sanitizer
To add new content types or PII patterns:
- Add new detection methods in
LlmLogTraceSanitizer - Update configuration in
LlmLogSanitizationConfig - Add new patterns to the appropriate detection methods
- Update tests to verify new functionality
Example:
// Add credit card detection
private static final Pattern CREDIT_CARD_PATTERN = Pattern.compile(
"\\b\\d{4}[- ]?\\d{4}[- ]?\\d{4}[- ]?\\d{4}\\b"
);
Best Practices
- Production: Enable PII sanitization in production environments
- Development: Use minimal sanitization for faster development
- Testing: Verify sanitization doesn't break functionality
- Monitoring: Check logs to ensure configuration is applied correctly
- Performance: Monitor impact on large payloads
Troubleshooting
Issue: Sanitization not working
- Check environment variables are set correctly
- Verify configuration logs at startup
- Test with the provided test class
Issue: False positives
- Adjust detection thresholds
- Customize patterns for your use case
- Consider disabling specific sanitization types
Issue: Performance impact
- Monitor processing time for large payloads
- Consider disabling unnecessary sanitization types
- Optimize detection patterns
Related Classes
LlmLogTraceSanitizer- Main sanitization logicLlmLogSanitizationConfig- Configuration managementLlmLogTracer- BigQuery logging integrationGoogleGenAiClient- Standard logging integration