ows-master/examen dubbels
bdaneels b021eabdab Refactor and document code; add new files
Refactored `script.py` by adding detailed docstrings and organizing functions. Created `.idea` configuration files and `gotodashboard.js` for `sisa_crawl` project. Added `readme.md` files with usage instructions and context for multiple scripts, and set up `package.json` for `sisa_crawl` dependencies.
2024-11-18 14:03:25 +01:00
..
.idea Normalize line endings 2024-11-18 11:26:32 +01:00
readme.md Refactor and document code; add new files 2024-11-18 14:03:25 +01:00
script.py Refactor and document code; add new files 2024-11-18 14:03:25 +01:00

Duplicate Detection in Excel Sheets with Pandas

This project provides a Python script to detect and list duplicate values in a specified column of an Excel sheet. The script leverages the pandas library to perform data manipulation and analysis.

It is useful for the OWS because it can easily check Student-ID doubles in exam groups of different exams. For example: the author used it to check whether or not some students had two exams on the same date, an oral and a written one. I pasted the tables of two excel files under each other in one file and then used this script to check if the issue of exam overlaps was resolved or not. But really it can be used to check for doubles in any other situation if you manipulate the variables.

Table of Contents

Requirements

  • Python 3.6 or higher
  • pandas library

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/your-repository.git
    cd your-repository
    
  2. Set up a virtual environment (optional but recommended):

    python -m venv venv
    source venv/bin/activate # On Windows: venv\Scripts\activate
    
  3. Install the required libraries:

    pip install pandas
    

Usage

  1. Ensure your Excel file is in the same directory as the script or provide an absolute path to the file.

  2. Update the file_path, sheet_name, and column_name variables in the script to match your file details.

  3. Run the script:

    python script.py
    

Example

import pandas as pd

# Variables
file_path = 'ps (30).xlsx'
sheet_name = 'ps (30)'
column_name = 'Student-ID'

# Read the Excel file and specified sheet into a DataFrame
df = pd.read_excel(file_path, sheet_name=sheet_name)

# Find duplicated entries in the specified column
duplicate_ids = df[df.duplicated(subset=[column_name], keep=False)][column_name]

# Drop duplicate values to get unique duplicate IDs
unique_duplicate_ids = duplicate_ids.drop_duplicates()

# Count the number of unique duplicate IDs
num_duplicates = len(unique_duplicate_ids)

# Print the results
if not unique_duplicate_ids.empty:
    print(f"Duplicated Student-ID values (count: {num_duplicates}) :")
    print(unique_duplicate_ids)
else:
    print("No duplicates found.")

How It Works

  1. Import the pandas library.
  2. Set file variables: Define the path to the Excel file, the sheet name, and the column to check for duplicates.
  3. Read the Excel file: Load the specified sheet into a DataFrame.
  4. Identify duplicates: Use the df.duplicated() method to find and filter duplicate entries in the specified column.
  5. Get unique duplicates: Remove duplicate values to find unique duplicate IDs.
  6. Count duplicates: Calculate the number of unique duplicate IDs.
  7. Print results: Display the count and the actual duplicate IDs, if any.

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-branch).
  3. Make your changes.
  4. Commit your changes (git commit -m 'Add some feature').
  5. Push to the branch (git push origin feature-branch).
  6. Open a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.