109 lines
3.4 KiB
Markdown
109 lines
3.4 KiB
Markdown
|
# Duplicate Detection in Excel Sheets with Pandas
|
||
|
|
||
|
This project provides a Python script to detect and list duplicate values in a
|
||
|
specified column of an Excel sheet. The script leverages the `pandas` library to perform data manipulation and analysis.
|
||
|
|
||
|
It is useful for the OWS because it can easily check Student-ID doubles in exam groups of different exams. For example:
|
||
|
the author used it to check whether or not some students had two exams on the same date, an oral and a written one. I pasted
|
||
|
the tables of two excel files under each other in one file and then used
|
||
|
this script to check if the issue of exam overlaps was resolved or not. But really it can be used
|
||
|
to check for doubles in any other situation if you manipulate the variables.
|
||
|
|
||
|
## Table of Contents
|
||
|
|
||
|
- [Requirements](#requirements)
|
||
|
- [Installation](#installation)
|
||
|
- [Usage](#usage)
|
||
|
- [How It Works](#how-it-works)
|
||
|
- [Contributing](#contributing)
|
||
|
- [License](#license)
|
||
|
|
||
|
## Requirements
|
||
|
|
||
|
- Python 3.6 or higher
|
||
|
- `pandas` library
|
||
|
|
||
|
## Installation
|
||
|
|
||
|
1. **Clone the repository**:
|
||
|
```bash
|
||
|
git clone https://github.com/your-username/your-repository.git
|
||
|
cd your-repository
|
||
|
```
|
||
|
|
||
|
2. **Set up a virtual environment (optional but recommended)**:
|
||
|
```bash
|
||
|
python -m venv venv
|
||
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
||
|
```
|
||
|
|
||
|
3. **Install the required libraries**:
|
||
|
```bash
|
||
|
pip install pandas
|
||
|
```
|
||
|
|
||
|
## Usage
|
||
|
|
||
|
1. Ensure your Excel file is in the same directory as the script or provide an absolute path to the file.
|
||
|
|
||
|
2. Update the `file_path`, `sheet_name`, and `column_name` variables in the script to match your file details.
|
||
|
|
||
|
3. Run the script:
|
||
|
```bash
|
||
|
python script.py
|
||
|
```
|
||
|
|
||
|
### Example
|
||
|
|
||
|
```python
|
||
|
import pandas as pd
|
||
|
|
||
|
# Variables
|
||
|
file_path = 'ps (30).xlsx'
|
||
|
sheet_name = 'ps (30)'
|
||
|
column_name = 'Student-ID'
|
||
|
|
||
|
# Read the Excel file and specified sheet into a DataFrame
|
||
|
df = pd.read_excel(file_path, sheet_name=sheet_name)
|
||
|
|
||
|
# Find duplicated entries in the specified column
|
||
|
duplicate_ids = df[df.duplicated(subset=[column_name], keep=False)][column_name]
|
||
|
|
||
|
# Drop duplicate values to get unique duplicate IDs
|
||
|
unique_duplicate_ids = duplicate_ids.drop_duplicates()
|
||
|
|
||
|
# Count the number of unique duplicate IDs
|
||
|
num_duplicates = len(unique_duplicate_ids)
|
||
|
|
||
|
# Print the results
|
||
|
if not unique_duplicate_ids.empty:
|
||
|
print(f"Duplicated Student-ID values (count: {num_duplicates}) :")
|
||
|
print(unique_duplicate_ids)
|
||
|
else:
|
||
|
print("No duplicates found.")
|
||
|
```
|
||
|
|
||
|
## How It Works
|
||
|
|
||
|
1. **Import the `pandas` library**.
|
||
|
2. **Set file variables**: Define the path to the Excel file, the sheet name, and the column to check for duplicates.
|
||
|
3. **Read the Excel file**: Load the specified sheet into a DataFrame.
|
||
|
4. **Identify duplicates**: Use the `df.duplicated()` method to find and filter duplicate entries in the specified column.
|
||
|
5. **Get unique duplicates**: Remove duplicate values to find unique duplicate IDs.
|
||
|
6. **Count duplicates**: Calculate the number of unique duplicate IDs.
|
||
|
7. **Print results**: Display the count and the actual duplicate IDs, if any.
|
||
|
|
||
|
## Contributing
|
||
|
|
||
|
Contributions are welcome! Please follow these steps:
|
||
|
|
||
|
1. Fork the repository.
|
||
|
2. Create a new branch (`git checkout -b feature-branch`).
|
||
|
3. Make your changes.
|
||
|
4. Commit your changes (`git commit -m 'Add some feature'`).
|
||
|
5. Push to the branch (`git push origin feature-branch`).
|
||
|
6. Open a pull request.
|
||
|
|
||
|
## License
|
||
|
|
||
|
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
|