Refactor and document code; add new files

Refactored `script.py` by adding detailed docstrings and organizing functions. Created `.idea` configuration files and `gotodashboard.js` for `sisa_crawl` project. Added `readme.md` files with usage instructions and context for multiple scripts, and set up `package.json` for `sisa_crawl` dependencies.
2024-11-18 14:03:25 +01:00
parent e3e65a9c51
commit b021eabdab
6 changed files with 396 additions and 33 deletions
--- a/dubbels/readme.md
+++ b/dubbels/readme.md
@@ -0,0 +1,109 @@
+# Duplicate Detection in Excel Sheets with Pandas
+
+This project provides a Python script to detect and list duplicate values in a 
+specified column of an Excel sheet. The script leverages the `pandas` library to perform data manipulation and analysis.
+
+It is useful for the OWS because it can easily check Student-ID doubles in exam groups of different exams. For example: 
+the author used it to check whether or not some students had two exams on the same date, an oral and a written one. I pasted 
+the tables of two excel files under each other in one file and then used
+this script to check if the issue of exam overlaps was resolved or not. But really it can be used
+to check for doubles in any other situation if you manipulate the variables.
+
+## Table of Contents
+
+- [Requirements](#requirements)
+- [Installation](#installation)
+- [Usage](#usage)
+- [How It Works](#how-it-works)
+- [Contributing](#contributing)
+- [License](#license)
+
+## Requirements
+
+- Python 3.6 or higher
+- `pandas` library
+
+## Installation
+
+1. **Clone the repository**:
+   ```bash
+   git clone https://github.com/your-username/your-repository.git
+   cd your-repository
+   ```
+
+2. **Set up a virtual environment (optional but recommended)**:
+   ```bash
+   python -m venv venv
+   source venv/bin/activate # On Windows: venv\Scripts\activate
+   ```
+
+3. **Install the required libraries**:
+   ```bash
+   pip install pandas
+   ```
+
+## Usage
+
+1. Ensure your Excel file is in the same directory as the script or provide an absolute path to the file.
+
+2. Update the `file_path`, `sheet_name`, and `column_name` variables in the script to match your file details.
+
+3. Run the script:
+   ```bash
+   python script.py
+   ```
+
+### Example
+
+```python
+import pandas as pd
+
+# Variables
+file_path = 'ps (30).xlsx'
+sheet_name = 'ps (30)'
+column_name = 'Student-ID'
+
+# Read the Excel file and specified sheet into a DataFrame
+df = pd.read_excel(file_path, sheet_name=sheet_name)
+
+# Find duplicated entries in the specified column
+duplicate_ids = df[df.duplicated(subset=[column_name], keep=False)][column_name]
+
+# Drop duplicate values to get unique duplicate IDs
+unique_duplicate_ids = duplicate_ids.drop_duplicates()
+
+# Count the number of unique duplicate IDs
+num_duplicates = len(unique_duplicate_ids)
+
+# Print the results
+if not unique_duplicate_ids.empty:
+    print(f"Duplicated Student-ID values (count: {num_duplicates}) :")
+    print(unique_duplicate_ids)
+else:
+    print("No duplicates found.")
+```
+
+## How It Works
+
+1. **Import the `pandas` library**.
+2. **Set file variables**: Define the path to the Excel file, the sheet name, and the column to check for duplicates.
+3. **Read the Excel file**: Load the specified sheet into a DataFrame.
+4. **Identify duplicates**: Use the `df.duplicated()` method to find and filter duplicate entries in the specified column.
+5. **Get unique duplicates**: Remove duplicate values to find unique duplicate IDs.
+6. **Count duplicates**: Calculate the number of unique duplicate IDs.
+7. **Print results**: Display the count and the actual duplicate IDs, if any.
+
+## Contributing
+
+Contributions are welcome! Please follow these steps:
+
+1. Fork the repository.
+2. Create a new branch (`git checkout -b feature-branch`).
+3. Make your changes.
+4. Commit your changes (`git commit -m 'Add some feature'`).
+5. Push to the branch (`git push origin feature-branch`).
+6. Open a pull request.
+
+## License
+
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
--- a/dubbels/script.py
+++ b/dubbels/script.py
@@ -1,20 +1,25 @@
 import pandas as pd

-#variables
-file_path = 'ps (30).xlsx'
-sheet_name = 'ps (30)'
-column_name = 'Student-ID'
+# Constants
+FILE_PATH = 'ps (30).xlsx'
+SHEET_NAME = 'ps (30)'
+COLUMN_NAME = 'Student-ID'

-df = pd.read_excel(file_path, sheet_name=sheet_name)
+def find_duplicates(file_path, sheet_name, column_name):
+    df = pd.read_excel(file_path, sheet_name=sheet_name)
+    duplicate_ids = df[df.duplicated(subset=[column_name], keep=False)][column_name]
+    unique_duplicate_ids = duplicate_ids.drop_duplicates()
+    return unique_duplicate_ids

-duplicate_ids= df[df.duplicated(subset=[column_name], keep=False)][column_name]
+def main():
+    unique_duplicate_ids = find_duplicates(FILE_PATH, SHEET_NAME, COLUMN_NAME)
+    num_duplicates = len(unique_duplicate_ids)

-unique_duplicate_ids = duplicate_ids.drop_duplicates()
+    if not unique_duplicate_ids.empty:
+        print(f"Duplicated {COLUMN_NAME} values (count: {num_duplicates}):")
+        print(unique_duplicate_ids)
+    else:
+        print("No duplicates found.")

-num_duplicates = len(unique_duplicate_ids)
-
-if not unique_duplicate_ids.empty:
-    print(f"Duplicated Student-ID values (count: {num_duplicates})  :")
-    print(unique_duplicate_ids)
-else:
-    print("No duplicates found.")
+if __name__ == "__main__":
+    main()