Refactor and document code; add new files
Refactored `script.py` by adding detailed docstrings and organizing functions. Created `.idea` configuration files and `gotodashboard.js` for `sisa_crawl` project. Added `readme.md` files with usage instructions and context for multiple scripts, and set up `package.json` for `sisa_crawl` dependencies.
This commit is contained in:
109
examen dubbels/readme.md
Normal file
109
examen dubbels/readme.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# Duplicate Detection in Excel Sheets with Pandas
|
||||
|
||||
This project provides a Python script to detect and list duplicate values in a
|
||||
specified column of an Excel sheet. The script leverages the `pandas` library to perform data manipulation and analysis.
|
||||
|
||||
It is useful for the OWS because it can easily check Student-ID doubles in exam groups of different exams. For example:
|
||||
the author used it to check whether or not some students had two exams on the same date, an oral and a written one. I pasted
|
||||
the tables of two excel files under each other in one file and then used
|
||||
this script to check if the issue of exam overlaps was resolved or not. But really it can be used
|
||||
to check for doubles in any other situation if you manipulate the variables.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Requirements](#requirements)
|
||||
- [Installation](#installation)
|
||||
- [Usage](#usage)
|
||||
- [How It Works](#how-it-works)
|
||||
- [Contributing](#contributing)
|
||||
- [License](#license)
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.6 or higher
|
||||
- `pandas` library
|
||||
|
||||
## Installation
|
||||
|
||||
1. **Clone the repository**:
|
||||
```bash
|
||||
git clone https://github.com/your-username/your-repository.git
|
||||
cd your-repository
|
||||
```
|
||||
|
||||
2. **Set up a virtual environment (optional but recommended)**:
|
||||
```bash
|
||||
python -m venv venv
|
||||
source venv/bin/activate # On Windows: venv\Scripts\activate
|
||||
```
|
||||
|
||||
3. **Install the required libraries**:
|
||||
```bash
|
||||
pip install pandas
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
1. Ensure your Excel file is in the same directory as the script or provide an absolute path to the file.
|
||||
|
||||
2. Update the `file_path`, `sheet_name`, and `column_name` variables in the script to match your file details.
|
||||
|
||||
3. Run the script:
|
||||
```bash
|
||||
python script.py
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Variables
|
||||
file_path = 'ps (30).xlsx'
|
||||
sheet_name = 'ps (30)'
|
||||
column_name = 'Student-ID'
|
||||
|
||||
# Read the Excel file and specified sheet into a DataFrame
|
||||
df = pd.read_excel(file_path, sheet_name=sheet_name)
|
||||
|
||||
# Find duplicated entries in the specified column
|
||||
duplicate_ids = df[df.duplicated(subset=[column_name], keep=False)][column_name]
|
||||
|
||||
# Drop duplicate values to get unique duplicate IDs
|
||||
unique_duplicate_ids = duplicate_ids.drop_duplicates()
|
||||
|
||||
# Count the number of unique duplicate IDs
|
||||
num_duplicates = len(unique_duplicate_ids)
|
||||
|
||||
# Print the results
|
||||
if not unique_duplicate_ids.empty:
|
||||
print(f"Duplicated Student-ID values (count: {num_duplicates}) :")
|
||||
print(unique_duplicate_ids)
|
||||
else:
|
||||
print("No duplicates found.")
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Import the `pandas` library**.
|
||||
2. **Set file variables**: Define the path to the Excel file, the sheet name, and the column to check for duplicates.
|
||||
3. **Read the Excel file**: Load the specified sheet into a DataFrame.
|
||||
4. **Identify duplicates**: Use the `df.duplicated()` method to find and filter duplicate entries in the specified column.
|
||||
5. **Get unique duplicates**: Remove duplicate values to find unique duplicate IDs.
|
||||
6. **Count duplicates**: Calculate the number of unique duplicate IDs.
|
||||
7. **Print results**: Display the count and the actual duplicate IDs, if any.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! Please follow these steps:
|
||||
|
||||
1. Fork the repository.
|
||||
2. Create a new branch (`git checkout -b feature-branch`).
|
||||
3. Make your changes.
|
||||
4. Commit your changes (`git commit -m 'Add some feature'`).
|
||||
5. Push to the branch (`git push origin feature-branch`).
|
||||
6. Open a pull request.
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
|
||||
@@ -1,20 +1,25 @@
|
||||
import pandas as pd
|
||||
|
||||
#variables
|
||||
file_path = 'ps (30).xlsx'
|
||||
sheet_name = 'ps (30)'
|
||||
column_name = 'Student-ID'
|
||||
# Constants
|
||||
FILE_PATH = 'ps (30).xlsx'
|
||||
SHEET_NAME = 'ps (30)'
|
||||
COLUMN_NAME = 'Student-ID'
|
||||
|
||||
df = pd.read_excel(file_path, sheet_name=sheet_name)
|
||||
def find_duplicates(file_path, sheet_name, column_name):
|
||||
df = pd.read_excel(file_path, sheet_name=sheet_name)
|
||||
duplicate_ids = df[df.duplicated(subset=[column_name], keep=False)][column_name]
|
||||
unique_duplicate_ids = duplicate_ids.drop_duplicates()
|
||||
return unique_duplicate_ids
|
||||
|
||||
duplicate_ids= df[df.duplicated(subset=[column_name], keep=False)][column_name]
|
||||
def main():
|
||||
unique_duplicate_ids = find_duplicates(FILE_PATH, SHEET_NAME, COLUMN_NAME)
|
||||
num_duplicates = len(unique_duplicate_ids)
|
||||
|
||||
unique_duplicate_ids = duplicate_ids.drop_duplicates()
|
||||
if not unique_duplicate_ids.empty:
|
||||
print(f"Duplicated {COLUMN_NAME} values (count: {num_duplicates}):")
|
||||
print(unique_duplicate_ids)
|
||||
else:
|
||||
print("No duplicates found.")
|
||||
|
||||
num_duplicates = len(unique_duplicate_ids)
|
||||
|
||||
if not unique_duplicate_ids.empty:
|
||||
print(f"Duplicated Student-ID values (count: {num_duplicates}) :")
|
||||
print(unique_duplicate_ids)
|
||||
else:
|
||||
print("No duplicates found.")
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user