ows-master/examen dubbels/readme.md

# Duplicate Detection in Excel Sheets with Pandas

This project provides a Python script to detect and list duplicate values in a 
specified column of an Excel sheet. The script leverages the `pandas` library to perform data manipulation and analysis.

It is useful for the OWS because it can easily check Student-ID doubles in exam groups of different exams. For example: 
the author used it to check whether or not some students had two exams on the same date, an oral and a written one. I pasted 
the tables of two excel files under each other in one file and then used
this script to check if the issue of exam overlaps was resolved or not. But really it can be used
to check for doubles in any other situation if you manipulate the variables.

## Table of Contents

- [Requirements](#requirements)
- [Installation](#installation)
- [Usage](#usage)
- [How It Works](#how-it-works)
- [Contributing](#contributing)
- [License](#license)

## Requirements

- Python 3.6 or higher
- `pandas` library

## Installation

1. **Clone the repository**:
   ```bash
   git clone https://github.com/your-username/your-repository.git
   cd your-repository
   ```

2. **Set up a virtual environment (optional but recommended)**:
   ```bash
   python -m venv venv
   source venv/bin/activate # On Windows: venv\Scripts\activate
   ```

3. **Install the required libraries**:
   ```bash
   pip install pandas
   ```

## Usage

1. Ensure your Excel file is in the same directory as the script or provide an absolute path to the file.

2. Update the `file_path`, `sheet_name`, and `column_name` variables in the script to match your file details.

3. Run the script:
   ```bash
   python script.py
   ```

### Example

```python
import pandas as pd

# Variables
file_path = 'ps (30).xlsx'
sheet_name = 'ps (30)'
column_name = 'Student-ID'

# Read the Excel file and specified sheet into a DataFrame
df = pd.read_excel(file_path, sheet_name=sheet_name)

# Find duplicated entries in the specified column
duplicate_ids = df[df.duplicated(subset=[column_name], keep=False)][column_name]

# Drop duplicate values to get unique duplicate IDs
unique_duplicate_ids = duplicate_ids.drop_duplicates()

# Count the number of unique duplicate IDs
num_duplicates = len(unique_duplicate_ids)

# Print the results
if not unique_duplicate_ids.empty:
    print(f"Duplicated Student-ID values (count: {num_duplicates}) :")
    print(unique_duplicate_ids)
else:
    print("No duplicates found.")
```

## How It Works

1. **Import the `pandas` library**.
2. **Set file variables**: Define the path to the Excel file, the sheet name, and the column to check for duplicates.
3. **Read the Excel file**: Load the specified sheet into a DataFrame.
4. **Identify duplicates**: Use the `df.duplicated()` method to find and filter duplicate entries in the specified column.
5. **Get unique duplicates**: Remove duplicate values to find unique duplicate IDs.
6. **Count duplicates**: Calculate the number of unique duplicate IDs.
7. **Print results**: Display the count and the actual duplicate IDs, if any.

## Contributing

Contributions are welcome! Please follow these steps:

1. Fork the repository.
2. Create a new branch (`git checkout -b feature-branch`).
3. Make your changes.
4. Commit your changes (`git commit -m 'Add some feature'`).
5. Push to the branch (`git push origin feature-branch`).
6. Open a pull request.

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
Refactor and document code; add new files Refactored `script.py` by adding detailed docstrings and organizing functions. Created `.idea` configuration files and `gotodashboard.js` for `sisa_crawl` project. Added `readme.md` files with usage instructions and context for multiple scripts, and set up `package.json` for `sisa_crawl` dependencies. 2024-11-18 13:03:25 +00:00			`# Duplicate Detection in Excel Sheets with Pandas`

			`This project provides a Python script to detect and list duplicate values in a`
			specified column of an Excel sheet. The script leverages the `pandas` library to perform data manipulation and analysis.

			`It is useful for the OWS because it can easily check Student-ID doubles in exam groups of different exams. For example:`
			`the author used it to check whether or not some students had two exams on the same date, an oral and a written one. I pasted`
			`the tables of two excel files under each other in one file and then used`
			`this script to check if the issue of exam overlaps was resolved or not. But really it can be used`
			`to check for doubles in any other situation if you manipulate the variables.`

			`## Table of Contents`

			`- [Requirements](#requirements)`
			`- [Installation](#installation)`
			`- [Usage](#usage)`
			`- [How It Works](#how-it-works)`
			`- [Contributing](#contributing)`
			`- [License](#license)`

			`## Requirements`

			`- Python 3.6 or higher`
			- `pandas` library

			`## Installation`

			`1. Clone the repository:`
			```bash
			`git clone https://github.com/your-username/your-repository.git`
			`cd your-repository`
			```

			`2. Set up a virtual environment (optional but recommended):`
			```bash
			`python -m venv venv`
			`source venv/bin/activate # On Windows: venv\Scripts\activate`
			```

			`3. Install the required libraries:`
			```bash
			`pip install pandas`
			```

			`## Usage`

			`1. Ensure your Excel file is in the same directory as the script or provide an absolute path to the file.`

			2. Update the `file_path`, `sheet_name`, and `column_name` variables in the script to match your file details.

			`3. Run the script:`
			```bash
			`python script.py`
			```

			`### Example`

			```python
			`import pandas as pd`

			`# Variables`
			`file_path = 'ps (30).xlsx'`
			`sheet_name = 'ps (30)'`
			`column_name = 'Student-ID'`

			`# Read the Excel file and specified sheet into a DataFrame`
			`df = pd.read_excel(file_path, sheet_name=sheet_name)`

			`# Find duplicated entries in the specified column`
			`duplicate_ids = df[df.duplicated(subset=[column_name], keep=False)][column_name]`

			`# Drop duplicate values to get unique duplicate IDs`
			`unique_duplicate_ids = duplicate_ids.drop_duplicates()`

			`# Count the number of unique duplicate IDs`
			`num_duplicates = len(unique_duplicate_ids)`

			`# Print the results`
			`if not unique_duplicate_ids.empty:`
			`print(f"Duplicated Student-ID values (count: {num_duplicates}) :")`
			`print(unique_duplicate_ids)`
			`else:`
			`print("No duplicates found.")`
			```

			`## How It Works`

			1. Import the `pandas` library.
			`2. Set file variables: Define the path to the Excel file, the sheet name, and the column to check for duplicates.`
			`3. Read the Excel file: Load the specified sheet into a DataFrame.`
			4. Identify duplicates: Use the `df.duplicated()` method to find and filter duplicate entries in the specified column.
			`5. Get unique duplicates: Remove duplicate values to find unique duplicate IDs.`
			`6. Count duplicates: Calculate the number of unique duplicate IDs.`
			`7. Print results: Display the count and the actual duplicate IDs, if any.`

			`## Contributing`

			`Contributions are welcome! Please follow these steps:`

			`1. Fork the repository.`
			2. Create a new branch (`git checkout -b feature-branch`).
			`3. Make your changes.`
			4. Commit your changes (`git commit -m 'Add some feature'`).
			5. Push to the branch (`git push origin feature-branch`).
			`6. Open a pull request.`

			`## License`

			`This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.`