ows-master/examen dubbels/readme.md

109 lines
3.4 KiB
Markdown
Raw Permalink Normal View History

# Duplicate Detection in Excel Sheets with Pandas
This project provides a Python script to detect and list duplicate values in a
specified column of an Excel sheet. The script leverages the `pandas` library to perform data manipulation and analysis.
It is useful for the OWS because it can easily check Student-ID doubles in exam groups of different exams. For example:
the author used it to check whether or not some students had two exams on the same date, an oral and a written one. I pasted
the tables of two excel files under each other in one file and then used
this script to check if the issue of exam overlaps was resolved or not. But really it can be used
to check for doubles in any other situation if you manipulate the variables.
## Table of Contents
- [Requirements](#requirements)
- [Installation](#installation)
- [Usage](#usage)
- [How It Works](#how-it-works)
- [Contributing](#contributing)
- [License](#license)
## Requirements
- Python 3.6 or higher
- `pandas` library
## Installation
1. **Clone the repository**:
```bash
git clone https://github.com/your-username/your-repository.git
cd your-repository
```
2. **Set up a virtual environment (optional but recommended)**:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. **Install the required libraries**:
```bash
pip install pandas
```
## Usage
1. Ensure your Excel file is in the same directory as the script or provide an absolute path to the file.
2. Update the `file_path`, `sheet_name`, and `column_name` variables in the script to match your file details.
3. Run the script:
```bash
python script.py
```
### Example
```python
import pandas as pd
# Variables
file_path = 'ps (30).xlsx'
sheet_name = 'ps (30)'
column_name = 'Student-ID'
# Read the Excel file and specified sheet into a DataFrame
df = pd.read_excel(file_path, sheet_name=sheet_name)
# Find duplicated entries in the specified column
duplicate_ids = df[df.duplicated(subset=[column_name], keep=False)][column_name]
# Drop duplicate values to get unique duplicate IDs
unique_duplicate_ids = duplicate_ids.drop_duplicates()
# Count the number of unique duplicate IDs
num_duplicates = len(unique_duplicate_ids)
# Print the results
if not unique_duplicate_ids.empty:
print(f"Duplicated Student-ID values (count: {num_duplicates}) :")
print(unique_duplicate_ids)
else:
print("No duplicates found.")
```
## How It Works
1. **Import the `pandas` library**.
2. **Set file variables**: Define the path to the Excel file, the sheet name, and the column to check for duplicates.
3. **Read the Excel file**: Load the specified sheet into a DataFrame.
4. **Identify duplicates**: Use the `df.duplicated()` method to find and filter duplicate entries in the specified column.
5. **Get unique duplicates**: Remove duplicate values to find unique duplicate IDs.
6. **Count duplicates**: Calculate the number of unique duplicate IDs.
7. **Print results**: Display the count and the actual duplicate IDs, if any.
## Contributing
Contributions are welcome! Please follow these steps:
1. Fork the repository.
2. Create a new branch (`git checkout -b feature-branch`).
3. Make your changes.
4. Commit your changes (`git commit -m 'Add some feature'`).
5. Push to the branch (`git push origin feature-branch`).
6. Open a pull request.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.