Refactor startpakket processing tool: modularize code into dedicated scripts for argument parsing, logging setup, data processing, and file utilities; enhance documentation and usage examples in README.

2025-07-29 16:43:53 +02:00 · 2025-07-29 16:43:53 +02:00 · 067a450bfd
commit 067a450bfd
parent 8236038f11
6 changed files with 366 additions and 179 deletions
--- a/startpakketten/README.md
+++ b/startpakketten/README.md
@ -0,0 +1,105 @@
+# Startpakket Processing Tool
+
+A Python tool for processing and comparing student data from predeliberation and dashboard Excel files. The tool identifies students with FAIL status and compares SP (study points) values between different data sources.
+
+## Project Structure
+
+The codebase has been organized into focused modules:
+
+### Core Scripts
+
+- **`script.py`** - Main orchestration script, handles command-line interface and coordinates all processing
+- **`data_processor.py`** - Core data processing functions for Excel file handling
+- **`cli_args.py`** - Command-line argument parsing and validation
+- **`config.py`** - Configuration management and logging setup
+- **`file_utils.py`** - File I/O utilities and output formatting
+
+### Processing Modules
+
+- **`checkheaders.py`** - Excel file header processing and normalization
+- **`process_predelib_file.py`** - Predeliberation file analysis and FAIL status detection
+- **`compare_sp.py`** - SP value comparison between predeliberation and dashboard files
+
+## Usage
+
+### Basic Usage
+
+```bash
+python script.py --predelib db.xlsx --dashboard dashboard_inschrijvingen.xlsx
+```
+
+### Advanced Usage
+
+```bash
+# Save results to JSON file
+python script.py -p db.xlsx -d dashboard_inschrijvingen.xlsx --output results.json
+
+# Enable verbose logging
+python script.py --predelib db.xlsx --dashboard dashboard_inschrijvingen.xlsx --verbose
+
+# Custom log file
+python script.py --predelib db.xlsx --dashboard dashboard_inschrijvingen.xlsx --log-file custom.log
+```
+
+### Command-line Options
+
+- `--predelib, -p`: Path to predeliberation Excel file (required)
+- `--dashboard, -d`: Path to dashboard Excel file (required)
+- `--output, -o`: Output file path for JSON results (optional)
+- `--verbose, -v`: Enable verbose logging
+- `--log-file`: Custom log file path (default: startpakket_processing.log)
+
+## Features
+
+1. **File Validation**: Automatically validates that input files exist and are Excel format
+2. **Header Processing**: Intelligently detects and normalizes Excel headers
+3. **FAIL Detection**: Identifies students with FAIL status in adviesrapport code
+4. **SP Comparison**: Compares study points between predeliberation and dashboard data
+5. **Comprehensive Logging**: Detailed logging with configurable verbosity
+6. **Flexible Output**: Console summary with optional JSON export
+7. **Error Handling**: Robust error handling with appropriate exit codes
+
+## Installation
+
+1. Ensure Python 3.12+ is installed
+2. Install required dependencies:
+   ```bash
+   pip install pandas openpyxl
+   ```
+
+## Input File Requirements
+
+### Predeliberation File (db.xlsx)
+
+Must contain columns:
+
+- ID, Achternaam, Voornaam, E-mail
+- Totaal aantal SP, Aantal SP vereist
+- Adviesrapport code, Waarschuwing
+
+### Dashboard File (dashboard_inschrijvingen.xlsx)
+
+Must contain columns:
+
+- ID, Naam, Voornaam
+- Ingeschr. SP (intern)
+
+## Output
+
+The tool provides:
+
+1. **Console Summary**: Overview of processing results
+2. **Failed Students Report**: Detailed list of students with FAIL status
+3. **SP Mismatch Report**: Any discrepancies between predeliberation and dashboard SP values
+4. **Optional JSON Export**: Machine-readable results for further processing
+
+## Exit Codes
+
+- `0`: Success (no mismatches found)
+- `1`: Processing completed but mismatches found
+- `130`: Process interrupted by user
+- Other: Fatal error occurred
+
+## Logging
+
+All processing activities are logged to `startpakket_processing.log` by default. Use `--verbose` for detailed debug information.
--- a/startpakketten/cli_args.py
+++ b/startpakketten/cli_args.py
@ -0,0 +1,65 @@
+"""
+Command-line argument parsing for the startpakket processing script.
+"""
+import argparse
+import os
+
+
+def validate_file_path(file_path: str) -> str:
+    """Validate that the file exists and is an Excel file"""
+    if not os.path.exists(file_path):
+        raise argparse.ArgumentTypeError(f"File '{file_path}' does not exist")
+    
+    if not file_path.lower().endswith(('.xlsx', '.xls')):
+        raise argparse.ArgumentTypeError(f"File '{file_path}' is not an Excel file (.xlsx or .xls)")
+    
+    return file_path
+
+
+def parse_arguments():
+    """Parse command line arguments"""
+    parser = argparse.ArgumentParser(
+        description='Process and compare student data from predeliberation and dashboard Excel files',
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  %(prog)s --predelib db.xlsx --dashboard dashboard_inschrijvingen.xlsx
+  %(prog)s -p /path/to/predelib.xlsx -d /path/to/dashboard.xlsx --output results.json
+  %(prog)s --predelib db.xlsx --dashboard dashboard.xlsx --verbose
+        """
+    )
+    
+    parser.add_argument(
+        '--predelib', '-p',
+        type=validate_file_path,
+        required=True,
+        help='Path to the predeliberation Excel file (db.xlsx)'
+    )
+    
+    parser.add_argument(
+        '--dashboard', '-d',
+        type=validate_file_path,
+        required=True,
+        help='Path to the dashboard Excel file (dashboard_inschrijvingen.xlsx)'
+    )
+    
+    parser.add_argument(
+        '--output', '-o',
+        type=str,
+        help='Output file path for results (optional, prints to console if not specified)'
+    )
+    
+    parser.add_argument(
+        '--verbose', '-v',
+        action='store_true',
+        help='Enable verbose logging'
+    )
+    
+    parser.add_argument(
+        '--log-file',
+        type=str,
+        default='startpakket_processing.log',
+        help='Path to log file (default: startpakket_processing.log)'
+    )
+    
+    return parser.parse_args()
--- a/startpakketten/config.py
+++ b/startpakketten/config.py
@ -0,0 +1,55 @@
+"""
+Configuration and logging setup for the startpakket processing application.
+"""
+import logging
+import sys
+from typing import Optional
+
+
+def setup_logging(log_file: str = 'startpakket_processing.log', verbose: bool = False) -> logging.Logger:
+    """
+    Configure logging for the application.
+    
+    Args:
+        log_file: Path to the log file
+        verbose: Enable debug logging if True
+        
+    Returns:
+        Configured logger instance
+    """
+    # Remove existing handlers to avoid duplicates
+    for handler in logging.root.handlers[:]:
+        logging.root.removeHandler(handler)
+    
+    # Set logging level
+    level = logging.DEBUG if verbose else logging.INFO
+    
+    # Configure logging
+    logging.basicConfig(
+        level=level,
+        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+        handlers=[
+            logging.FileHandler(log_file),
+            logging.StreamHandler(sys.stdout)
+        ]
+    )
+    
+    logger = logging.getLogger(__name__)
+    
+    if verbose:
+        logger.debug("Verbose logging enabled")
+    
+    return logger
+
+
+def get_exit_code(results: dict) -> int:
+    """
+    Determine the appropriate exit code based on processing results.
+    
+    Args:
+        results: Processing results dictionary
+        
+    Returns:
+        Exit code (0 for success, 1 for mismatches found)
+    """
+    return 0 if results.get('mismatches_count', 0) == 0 else 1
--- a/startpakketten/data_processor.py
+++ b/startpakketten/data_processor.py
@ -0,0 +1,73 @@
+"""
+Core data processing functions for the startpakket processing script.
+"""
+import pandas as pd
+import logging
+from typing import Dict, Any, List
+
+from checkheaders import check_headers_dashboard_inschrijvingenfile, check_headers_predelibfile
+from process_predelib_file import check_students_with_fail_adviesrapport
+from compare_sp import compare_sp_values
+
+logger = logging.getLogger(__name__)
+
+
+def process_files(predelib_path: str, dashboard_path: str, verbose: bool = False) -> Dict[str, Any]:
+    """
+    Process the Excel files and return results.
+    
+    Args:
+        predelib_path: Path to the predeliberation Excel file
+        dashboard_path: Path to the dashboard Excel file  
+        verbose: Enable verbose logging
+        
+    Returns:
+        Dictionary containing processing results
+        
+    Raises:
+        Exception: If file processing fails
+    """
+    try:
+        # Read Excel files
+        logger.info(f"Reading predeliberation file: {predelib_path}")
+        df_predelib = pd.read_excel(predelib_path)
+        logger.info(f"Predelib file loaded successfully. Shape: {df_predelib.shape}")
+        
+        logger.info(f"Reading dashboard file: {dashboard_path}")
+        df_dashboard = pd.read_excel(dashboard_path)
+        logger.info(f"Dashboard file loaded successfully. Shape: {df_dashboard.shape}")
+        
+        # Process the dataframes
+        logger.info("Processing predeliberation file headers")
+        processed_predelib_df = check_headers_predelibfile(df_predelib)
+        
+        logger.info("Processing dashboard file headers")
+        processed_dashboard_df = check_headers_dashboard_inschrijvingenfile(df_dashboard)
+
+        # Check the predeliberation file for students with a fail in 'Adviesrapport code'
+        logger.info("Checking for students with FAIL status in predeliberation file")
+        students_with_fail = check_students_with_fail_adviesrapport(processed_predelib_df)
+
+        # Compare SP values
+        logger.info("Comparing SP values between files")
+        mismatches = compare_sp_values(processed_predelib_df, processed_dashboard_df)
+        
+        # Prepare results
+        results = {
+            'predelib_file': predelib_path,
+            'dashboard_file': dashboard_path,
+            'predelib_records': len(processed_predelib_df),
+            'dashboard_records': len(processed_dashboard_df),
+            'students_with_fail_count': len(students_with_fail),
+            'students_with_fail': students_with_fail,
+            'mismatches_count': len(mismatches),
+            'mismatches': mismatches,
+            'status': 'completed'
+        }
+        
+        logger.info(f"Processing completed successfully. Found {len(mismatches)} mismatches.")
+        return results
+        
+    except Exception as e:
+        logger.error(f"Error processing files: {e}")
+        raise
--- a/startpakketten/file_utils.py
+++ b/startpakketten/file_utils.py
@ -0,0 +1,49 @@
+"""
+File I/O utilities and output formatting for the startpakket processing script.
+"""
+import json
+import logging
+from typing import Dict, Any
+
+from process_predelib_file import print_students_with_fail_ar_summary
+
+logger = logging.getLogger(__name__)
+
+
+def save_results(results: Dict[str, Any], output_path: str) -> None:
+    """Save results to a JSON file"""
+    try:
+        with open(output_path, 'w', encoding='utf-8') as f:
+            json.dump(results, f, indent=2, ensure_ascii=False)
+        logger.info(f"Results saved to: {output_path}")
+    except Exception as e:
+        logger.error(f"Error saving results to {output_path}: {e}")
+        raise
+
+
+def print_summary(results: Dict[str, Any]) -> None:
+    """Print a summary of the results to console"""
+    print(f"\n{'='*60}")
+    print("STARTPAKKET PROCESSING SUMMARY")
+    print(f"{'='*60}")
+    print(f"Predelib file: {results['predelib_file']}")
+    print(f"Dashboard file: {results['dashboard_file']}")
+    print(f"Predelib records processed: {results['predelib_records']}")
+    print(f"Dashboard records processed: {results['dashboard_records']}")
+    print(f"Students with FAIL adviesrapport found: {results['students_with_fail_count']}")
+    print(f"Mismatches found: {results['mismatches_count']}")
+    
+    if results['students_with_fail_count'] > 0:
+       print_students_with_fail_ar_summary(results['students_with_fail'], results['predelib_file'])
+
+    if results['mismatches']:
+        print(f"\nDetailed mismatches between SP predeliberatierapport and Dashboard Inschrijvingen:")
+        for mismatch in results['mismatches']:
+            print(f"Mismatch - ID {mismatch['ID']} ({mismatch['Name']}): "
+                  f"Predeliberatierapport SP={mismatch['Predelib_SP']}, "
+                  f"Dashboard Inschrijvingen SP={mismatch['Dashboard_SP']}")
+    else:
+        print("\n✅ All SP values match perfectly!")
+    
+    print(f"Status: {results['status']}")
+    print(f"{'='*60}")
--- a/startpakketten/script.py
+++ b/startpakketten/script.py
@ -1,209 +1,49 @@
-import pandas as pd
-import argparse
-import logging
+"""
+Main script for processing and comparing student data from predeliberation and dashboard Excel files.
+
+"""
 import sys
-import os
-from pathlib import Path
+import logging

-from checkheaders import check_headers_dashboard_inschrijvingenfile, check_headers_predelibfile
-from process_predelib_file import check_students_with_fail_adviesrapport, print_students_with_fail_ar_summary
-from compare_sp import compare_sp_values
-
-# Configure logging
-logging.basicConfig(
-    level=logging.INFO,
-    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
-    handlers=[
-        logging.FileHandler('startpakket_processing.log'),
-        logging.StreamHandler()
-    ]
-)
-
-logger = logging.getLogger(__name__)
-
-
-def validate_file_path(file_path: str) -> str:
-    """Validate that the file exists and is an Excel file"""
-    if not os.path.exists(file_path):
-        raise argparse.ArgumentTypeError(f"File '{file_path}' does not exist")
-    
-    if not file_path.lower().endswith(('.xlsx', '.xls')):
-        raise argparse.ArgumentTypeError(f"File '{file_path}' is not an Excel file (.xlsx or .xls)")
-    
-    return file_path
-
-
-def parse_arguments():
-    """Parse command line arguments"""
-    parser = argparse.ArgumentParser(
-        description='Process and compare student data from predeliberation and dashboard Excel files',
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog="""
-Examples:
-  %(prog)s --predelib db.xlsx --dashboard dashboard_inschrijvingen.xlsx
-  %(prog)s -p /path/to/predelib.xlsx -d /path/to/dashboard.xlsx --output results.json
-  %(prog)s --predelib db.xlsx --dashboard dashboard.xlsx --verbose
-        """
-    )
-    
-    parser.add_argument(
-        '--predelib', '-p',
-        type=validate_file_path,
-        required=True,
-        help='Path to the predeliberation Excel file (db.xlsx)'
-    )
-    
-    parser.add_argument(
-        '--dashboard', '-d',
-        type=validate_file_path,
-        required=True,
-        help='Path to the dashboard Excel file (dashboard_inschrijvingen.xlsx)'
-    )
-    
-    parser.add_argument(
-        '--output', '-o',
-        type=str,
-        help='Output file path for results (optional, prints to console if not specified)'
-    )
-    
-    parser.add_argument(
-        '--verbose', '-v',
-        action='store_true',
-        help='Enable verbose logging'
-    )
-    
-    parser.add_argument(
-        '--log-file',
-        type=str,
-        default='startpakket_processing.log',
-        help='Path to log file (default: startpakket_processing.log)'
-    )
-    
-    return parser.parse_args()
-
-
-def process_files(predelib_path: str, dashboard_path: str, verbose: bool = False):
-    """Process the Excel files and return results"""
-    try:
-        # Read Excel files
-        logger.info(f"Reading predeliberation file: {predelib_path}")
-        df_predelib = pd.read_excel(predelib_path)
-        logger.info(f"Predelib file loaded successfully. Shape: {df_predelib.shape}")
-        
-        logger.info(f"Reading dashboard file: {dashboard_path}")
-        df_dashboard = pd.read_excel(dashboard_path)
-        logger.info(f"Dashboard file loaded successfully. Shape: {df_dashboard.shape}")
-        
-        # Process the dataframes
-        logger.info("Processing predeliberation file headers")
-        processed_predelib_df = check_headers_predelibfile(df_predelib)
-        
-        logger.info("Processing dashboard file headers")
-        processed_dashboard_df = check_headers_dashboard_inschrijvingenfile(df_dashboard)
-
-        # Check the predeliberation file for students with a fail in 'Adviesrapport code'
-        logger.info("Checking for students with FAIL status in predeliberation file")
-        students_with_fail = check_students_with_fail_adviesrapport(processed_predelib_df)
-
-        # Compare SP values
-        logger.info("Comparing SP values between files")
-        mismatches = compare_sp_values(processed_predelib_df, processed_dashboard_df)
-        
-        # Prepare results
-        results = {
-            'predelib_file': predelib_path,
-            'dashboard_file': dashboard_path,
-            'predelib_records': len(processed_predelib_df),
-            'dashboard_records': len(processed_dashboard_df),
-            'students_with_fail_count': len(students_with_fail),
-            'students_with_fail': students_with_fail,
-            'mismatches_count': len(mismatches),
-            'mismatches': mismatches,
-            'status': 'completed'
-        }
-        
-        logger.info(f"Processing completed successfully. Found {len(mismatches)} mismatches.")
-        return results
-        
-    except Exception as e:
-        logger.error(f"Error processing files: {e}")
-        raise
-
-
-def save_results(results: dict, output_path: str):
-    """Save results to a file"""
-    try:
-        import json
-        with open(output_path, 'w', encoding='utf-8') as f:
-            json.dump(results, f, indent=2, ensure_ascii=False)
-        logger.info(f"Results saved to: {output_path}")
-    except Exception as e:
-        logger.error(f"Error saving results to {output_path}: {e}")
-        raise
-
-
-def print_summary(results: dict):
-    """Print a summary of the results to console"""
-    print(f"\n{'='*60}")
-    print("STARTPAKKET PROCESSING SUMMARY")
-    print(f"{'='*60}")
-    print(f"Predelib file: {results['predelib_file']}")
-    print(f"Dashboard file: {results['dashboard_file']}")
-    print(f"Predelib records processed: {results['predelib_records']}")
-    print(f"Dashboard records processed: {results['dashboard_records']}")
-    print(f"Students with FAIL adviesrapport found: {results['students_with_fail_count']}")
-    print(f"Mismatches found: {results['mismatches_count']}")
-    
-    if results['students_with_fail_count'] > 0:
-       print_students_with_fail_ar_summary(results['students_with_fail'], results['predelib_file'])
-
-    if results['mismatches']:
-        print(f"\nDetailed mismatches between SP predeliberatierapport and Dashboard Inschrijvingen:")
-        for mismatch in results['mismatches']:
-            print(f"Mismatch - ID {mismatch['ID']} ({mismatch['Name']}): Predeliberatierapport SP={mismatch['Predelib_SP']}, Dashboard Inschrijvingen SP={mismatch['Dashboard_SP']}")
-    else:
-        print("\n✅ All SP values match perfectly!")
-    
-    print(f"Status: {results['status']}")
-    print(f"{'='*60}")
+from cli_args import parse_arguments
+from config import setup_logging, get_exit_code
+from data_processor import process_files
+from file_utils import save_results, print_summary


 def main():
-    """Main function"""
+    """Main function - orchestrates the entire processing pipeline"""
    try:
-        # Parse arguments
+        # Parse command-line arguments
        args = parse_arguments()
        
-        # Configure logging level
-        if args.verbose:
-            logging.getLogger().setLevel(logging.DEBUG)
-            logger.debug("Verbose logging enabled")
+        # Set up logging configuration
+        logger = setup_logging(args.log_file, args.verbose)
        
        logger.info("Starting startpakket processing")
        logger.info(f"Predelib file: {args.predelib}")
        logger.info(f"Dashboard file: {args.dashboard}")
        
-        # Process files
+        # Process the Excel files
        results = process_files(args.predelib, args.dashboard, args.verbose)
        
-        # Save results if output path specified
+        # Save results to file if specified
        if args.output:
            save_results(results, args.output)
        
-        # Print summary
+        # Print summary to console
        print_summary(results)
        
        # Exit with appropriate code
-        exit_code = 0 if results['mismatches_count'] == 0 else 1
+        exit_code = get_exit_code(results)
        logger.info(f"Processing completed with exit code: {exit_code}")
        sys.exit(exit_code)
        
    except KeyboardInterrupt:
-        logger.info("Processing interrupted by user")
+        print("\nProcessing interrupted by user")
        sys.exit(130)
    except Exception as e:
-        logger.error(f"Fatal error: {e}")
-        print(f"Error: {e}")
+        print(f"Fatal error: {e}")
        sys.exit(1)