Refactor and document code; add new files
Refactored `script.py` by adding detailed docstrings and organizing functions. Created `.idea` configuration files and `gotodashboard.js` for `sisa_crawl` project. Added `readme.md` files with usage instructions and context for multiple scripts, and set up `package.json` for `sisa_crawl` dependencies.
This commit is contained in:
parent
e3e65a9c51
commit
b021eabdab
109
examen dubbels/readme.md
Normal file
109
examen dubbels/readme.md
Normal file
|
@ -0,0 +1,109 @@
|
|||
# Duplicate Detection in Excel Sheets with Pandas
|
||||
|
||||
This project provides a Python script to detect and list duplicate values in a
|
||||
specified column of an Excel sheet. The script leverages the `pandas` library to perform data manipulation and analysis.
|
||||
|
||||
It is useful for the OWS because it can easily check Student-ID doubles in exam groups of different exams. For example:
|
||||
the author used it to check whether or not some students had two exams on the same date, an oral and a written one. I pasted
|
||||
the tables of two excel files under each other in one file and then used
|
||||
this script to check if the issue of exam overlaps was resolved or not. But really it can be used
|
||||
to check for doubles in any other situation if you manipulate the variables.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Requirements](#requirements)
|
||||
- [Installation](#installation)
|
||||
- [Usage](#usage)
|
||||
- [How It Works](#how-it-works)
|
||||
- [Contributing](#contributing)
|
||||
- [License](#license)
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.6 or higher
|
||||
- `pandas` library
|
||||
|
||||
## Installation
|
||||
|
||||
1. **Clone the repository**:
|
||||
```bash
|
||||
git clone https://github.com/your-username/your-repository.git
|
||||
cd your-repository
|
||||
```
|
||||
|
||||
2. **Set up a virtual environment (optional but recommended)**:
|
||||
```bash
|
||||
python -m venv venv
|
||||
source venv/bin/activate # On Windows: venv\Scripts\activate
|
||||
```
|
||||
|
||||
3. **Install the required libraries**:
|
||||
```bash
|
||||
pip install pandas
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
1. Ensure your Excel file is in the same directory as the script or provide an absolute path to the file.
|
||||
|
||||
2. Update the `file_path`, `sheet_name`, and `column_name` variables in the script to match your file details.
|
||||
|
||||
3. Run the script:
|
||||
```bash
|
||||
python script.py
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Variables
|
||||
file_path = 'ps (30).xlsx'
|
||||
sheet_name = 'ps (30)'
|
||||
column_name = 'Student-ID'
|
||||
|
||||
# Read the Excel file and specified sheet into a DataFrame
|
||||
df = pd.read_excel(file_path, sheet_name=sheet_name)
|
||||
|
||||
# Find duplicated entries in the specified column
|
||||
duplicate_ids = df[df.duplicated(subset=[column_name], keep=False)][column_name]
|
||||
|
||||
# Drop duplicate values to get unique duplicate IDs
|
||||
unique_duplicate_ids = duplicate_ids.drop_duplicates()
|
||||
|
||||
# Count the number of unique duplicate IDs
|
||||
num_duplicates = len(unique_duplicate_ids)
|
||||
|
||||
# Print the results
|
||||
if not unique_duplicate_ids.empty:
|
||||
print(f"Duplicated Student-ID values (count: {num_duplicates}) :")
|
||||
print(unique_duplicate_ids)
|
||||
else:
|
||||
print("No duplicates found.")
|
||||
```
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **Import the `pandas` library**.
|
||||
2. **Set file variables**: Define the path to the Excel file, the sheet name, and the column to check for duplicates.
|
||||
3. **Read the Excel file**: Load the specified sheet into a DataFrame.
|
||||
4. **Identify duplicates**: Use the `df.duplicated()` method to find and filter duplicate entries in the specified column.
|
||||
5. **Get unique duplicates**: Remove duplicate values to find unique duplicate IDs.
|
||||
6. **Count duplicates**: Calculate the number of unique duplicate IDs.
|
||||
7. **Print results**: Display the count and the actual duplicate IDs, if any.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! Please follow these steps:
|
||||
|
||||
1. Fork the repository.
|
||||
2. Create a new branch (`git checkout -b feature-branch`).
|
||||
3. Make your changes.
|
||||
4. Commit your changes (`git commit -m 'Add some feature'`).
|
||||
5. Push to the branch (`git push origin feature-branch`).
|
||||
6. Open a pull request.
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
|
|
@ -1,20 +1,25 @@
|
|||
import pandas as pd
|
||||
|
||||
#variables
|
||||
file_path = 'ps (30).xlsx'
|
||||
sheet_name = 'ps (30)'
|
||||
column_name = 'Student-ID'
|
||||
# Constants
|
||||
FILE_PATH = 'ps (30).xlsx'
|
||||
SHEET_NAME = 'ps (30)'
|
||||
COLUMN_NAME = 'Student-ID'
|
||||
|
||||
def find_duplicates(file_path, sheet_name, column_name):
|
||||
df = pd.read_excel(file_path, sheet_name=sheet_name)
|
||||
|
||||
duplicate_ids = df[df.duplicated(subset=[column_name], keep=False)][column_name]
|
||||
|
||||
unique_duplicate_ids = duplicate_ids.drop_duplicates()
|
||||
return unique_duplicate_ids
|
||||
|
||||
def main():
|
||||
unique_duplicate_ids = find_duplicates(FILE_PATH, SHEET_NAME, COLUMN_NAME)
|
||||
num_duplicates = len(unique_duplicate_ids)
|
||||
|
||||
if not unique_duplicate_ids.empty:
|
||||
print(f"Duplicated Student-ID values (count: {num_duplicates}) :")
|
||||
print(f"Duplicated {COLUMN_NAME} values (count: {num_duplicates}):")
|
||||
print(unique_duplicate_ids)
|
||||
else:
|
||||
print("No duplicates found.")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -1,18 +1,48 @@
|
|||
import pandas as pd
|
||||
|
||||
file_path = 'file.xlsx'
|
||||
sheet_name = 'ps (32)'
|
||||
# Constants
|
||||
FILE_PATH = 'file.xlsx'
|
||||
SHEET_NAME = 'ps (32)'
|
||||
OUTPUT_FILE_PATH = 'filtered_grote_lokalen.xlsx'
|
||||
EXAM_FORM_COLUMN = 'Examenvorm'
|
||||
REGISTRATION_COLUMN = 'Aant. inschr.'
|
||||
BEGIN_TIME_COLUMN = 'Beginuur S+'
|
||||
END_TIME_COLUMN = 'Einduur S+'
|
||||
TEACHERS_COLUMN = 'Docenten'
|
||||
LOCATION_COLUMNS = ['Datum S+', BEGIN_TIME_COLUMN, END_TIME_COLUMN, 'Studiegidsnr.', 'Omschrijving', TEACHERS_COLUMN, REGISTRATION_COLUMN]
|
||||
|
||||
df = pd.read_excel(file_path, sheet_name=sheet_name)
|
||||
filtered_df = df[df['Examenvorm'] == 'Schriftelijk' ]
|
||||
filtered_df = filtered_df[filtered_df['Aant. inschr.'] > 65]
|
||||
filtered_df = filtered_df[['Datum S+','Beginuur S+','Einduur S+', 'Studiegidsnr.', 'Omschrijving', 'Docenten', 'Aant. inschr.']]
|
||||
# Read the Excel file
|
||||
def read_excel(file_path, sheet_name):
|
||||
return pd.read_excel(file_path, sheet_name=sheet_name)
|
||||
|
||||
# Filter DataFrame
|
||||
def filter_dataframe(df):
|
||||
df = df[df[EXAM_FORM_COLUMN] == 'Schriftelijk']
|
||||
df = df[df[REGISTRATION_COLUMN] > 65]
|
||||
return df[LOCATION_COLUMNS]
|
||||
|
||||
#formatting the timestrings
|
||||
filtered_df['Beginuur S+'] = filtered_df['Beginuur S+'].apply(lambda x: x.strftime('%H:%M'))
|
||||
filtered_df['Einduur S+'] = filtered_df['Einduur S+'].apply(lambda x: x.strftime('%H:%M'))
|
||||
filtered_df['Docenten'] = filtered_df['Docenten'].str.replace(r'\b(Titularis|Co-Titularis|Medewerker)\b', '',
|
||||
regex=True).str.strip()
|
||||
# Format time strings
|
||||
def format_time_strings(df):
|
||||
df[BEGIN_TIME_COLUMN] = df[BEGIN_TIME_COLUMN].apply(lambda x: x.strftime('%H:%M') if pd.notnull(x) else '')
|
||||
df[END_TIME_COLUMN] = df[END_TIME_COLUMN].apply(lambda x: x.strftime('%H:%M') if pd.notnull(x) else '')
|
||||
return df
|
||||
|
||||
filtered_df.to_excel('filtered_grote_lokalen.xlsx', index=False)
|
||||
# Clean up teacher titles
|
||||
def clean_teacher_titles(df):
|
||||
df[TEACHERS_COLUMN] = df[TEACHERS_COLUMN].str.replace(r'\b(Titularis|Co-Titularis|Medewerker)\b', '', regex=True).str.strip()
|
||||
return df
|
||||
|
||||
# Save DataFrame to Excel
|
||||
def save_to_excel(df, file_path):
|
||||
df.to_excel(file_path, index=False)
|
||||
|
||||
# Main process
|
||||
def main():
|
||||
df = read_excel(FILE_PATH, SHEET_NAME)
|
||||
filtered_df = filter_dataframe(df)
|
||||
filtered_df = format_time_strings(filtered_df)
|
||||
filtered_df = clean_teacher_titles(filtered_df)
|
||||
save_to_excel(filtered_df, OUTPUT_FILE_PATH)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
79
examen grote lokalen/readme.md
Normal file
79
examen grote lokalen/readme.md
Normal file
|
@ -0,0 +1,79 @@
|
|||
# Excel Filtering and Formatting Script
|
||||
|
||||
The file in this repository filters has the intent to filter all written exams that require a 'large room'
|
||||
(>65 inschrijvingen) and thus need to be brought to the meeting which assigns large rooms to written exams on campus. The output has
|
||||
the layout in mind of the master file provided by E-campus but may need changes if the master file changes.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Make sure you have the following software installed:
|
||||
- Python 3.x
|
||||
- Pip (Python package installer)
|
||||
|
||||
## Required Packages
|
||||
|
||||
The script depends on the following Python packages:
|
||||
- `pandas`
|
||||
|
||||
You can install the required package using pip:
|
||||
```bash
|
||||
pip install pandas
|
||||
```
|
||||
|
||||
## Description
|
||||
|
||||
The script performs the following operations:
|
||||
|
||||
1. Reads data from the specified Excel file and sheet.
|
||||
2. Filters rows based on the value of the 'Examenvorm' column and the count of 'Aant. inschr.' column.
|
||||
3. Selects specific columns from the filtered DataFrame.
|
||||
4. Formats time strings in the columns 'Beginuur S+' and 'Einduur S+'.
|
||||
5. Cleans the 'Docenten' column by removing specific keywords and trimming whitespace.
|
||||
6. Writes the processed DataFrame to a new Excel file.
|
||||
|
||||
## Usage
|
||||
|
||||
1. Place your Excel file in the same directory as the script.
|
||||
2. Update the `file_path` and `sheet_name` variables in the script with your specific file path and sheet name.
|
||||
3. Run the script:
|
||||
|
||||
```bash
|
||||
python script.py
|
||||
```
|
||||
|
||||
## Code
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
file_path = 'file.xlsx'
|
||||
sheet_name = 'ps (32)'
|
||||
|
||||
# Read the data from the Excel file
|
||||
df = pd.read_excel(file_path, sheet_name=sheet_name)
|
||||
|
||||
# Filter the data based on certain criteria
|
||||
filtered_df = df[df['Examenvorm'] == 'Schriftelijk']
|
||||
filtered_df = filtered_df[filtered_df['Aant. inschr.'] > 65]
|
||||
filtered_df = filtered_df[['Datum S+', 'Beginuur S+', 'Einduur S+', 'Studiegidsnr.', 'Omschrijving', 'Docenten', 'Aant. inschr.']]
|
||||
|
||||
# Format the time strings
|
||||
filtered_df['Beginuur S+'] = filtered_df['Beginuur S+'].apply(lambda x: x.strftime('%H:%M'))
|
||||
filtered_df['Einduur S+'] = filtered_df['Einduur S+'].apply(lambda x: x.strftime('%H:%M'))
|
||||
filtered_df['Docenten'] = filtered_df['Docenten'].str.replace(r'\b(Titularis|Co-Titularis|Medewerker)\b', '', regex=True).str.strip()
|
||||
|
||||
# Save the filtered and formatted data to a new Excel file
|
||||
filtered_df.to_excel('filtered_grote_lokalen.xlsx', index=False)
|
||||
```
|
||||
|
||||
## Additional Notes
|
||||
|
||||
- This script assumes that the input Excel file has specific columns like 'Examenvorm', 'Aant. inschr.', 'Datum S+', 'Beginuur S+', 'Einduur S+', 'Studiegidsnr.', 'Omschrijving', and 'Docenten'.
|
||||
- Make sure that the time columns ('Beginuur S+' and 'Einduur S+') are in datetime format in the original Excel file for the `.strftime('%H:%M')` method to work correctly.
|
||||
- The `Docenten` column will be cleaned by removing occurrences of the keywords 'Titularis', 'Co-Titularis', and 'Medewerker'.
|
||||
|
||||
Feel free to adjust the script according to your specific needs.
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
|
105
examengegevens template generator/readme.md
Normal file
105
examengegevens template generator/readme.md
Normal file
|
@ -0,0 +1,105 @@
|
|||
# Project Name: Examination Data Processing
|
||||
|
||||
## Overview
|
||||
This project is designed to process examination data from an Excel file and generate filtered output and communication messages for teaching staff. It's developed using Python and pandas, and it provides functionalities such as filtering records, converting time formats, and generating message columns.
|
||||
|
||||
## Features
|
||||
- **Read Excel File**: Reads examination data from an Excel file into a Pandas DataFrame.
|
||||
- **Filter Data**: Filters records based on specific criteria in 'Studiegidsnummer' and 'Opmerkingen' columns.
|
||||
- **Convert Time Format**: Converts time columns to 'HH:MM' format.
|
||||
- **Generate Messages**: Creates message and subject columns for email communication.
|
||||
- **Save to Excel**: Saves the processed data to a new Excel file.
|
||||
|
||||
## Prerequisites
|
||||
- Python 3.12.5
|
||||
- Pandas
|
||||
- openpyxl
|
||||
|
||||
## Installation
|
||||
1. **Clone the repository**:
|
||||
```sh
|
||||
git clone https://github.com/username/examination-data-processing.git
|
||||
cd examination-data-processing
|
||||
```
|
||||
2. **Install the required Python packages**:
|
||||
```sh
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
Ensure the `requirements.txt` file should contain:
|
||||
```text
|
||||
pandas
|
||||
openpyxl
|
||||
```
|
||||
|
||||
## Usage
|
||||
1. **Place the input Excel file**: Ensure that the Excel file (`examengegevens2425.xlsx`) is placed in the root directory of the project.
|
||||
|
||||
2. **Run the script**:
|
||||
```sh
|
||||
python script.py
|
||||
```
|
||||
|
||||
3. **Output**: The filtered and processed data will be saved in an output Excel file (`filtered_examengegevens2425.xlsx`).
|
||||
|
||||
## Functions
|
||||
|
||||
### `read_excel_file(file_path)`
|
||||
- **Parameters**: `file_path` (str) - Path to the Excel file.
|
||||
- **Returns**: DataFrame or None
|
||||
|
||||
### `filter_studiegidsnummer(df)`
|
||||
- **Parameters**: `df` (DataFrame) - Input DataFrame.
|
||||
- **Returns**: Filtered DataFrame or empty DataFrame
|
||||
|
||||
### `filter_opmerkingen(df)`
|
||||
- **Parameters**: `df` (DataFrame) - Input DataFrame.
|
||||
- **Returns**: Filtered DataFrame or empty DataFrame
|
||||
|
||||
### `create_message_column(df)`
|
||||
- **Parameters**: `df` (DataFrame) - Input DataFrame.
|
||||
- **Returns**: DataFrame with 'Message' and 'subject' columns
|
||||
|
||||
### `save_to_excel(df, output_file_path)`
|
||||
- **Parameters**:
|
||||
- `df` (DataFrame) - DataFrame to save.
|
||||
- `output_file_path` (str) - Path to save the Excel file.
|
||||
- **Returns**: None
|
||||
|
||||
### `convert_time_format(time_str)`
|
||||
- **Parameters**: `time_str` (str) - Time string to convert.
|
||||
- **Returns**: Formatted time string
|
||||
|
||||
### `apply_time_format_conversion(df, columns)`
|
||||
- **Parameters**:
|
||||
- `df` (DataFrame) - DataFrame with time columns.
|
||||
- `columns` (list of str) - List of column names to format.
|
||||
- **Returns**: DataFrame with formatted time columns
|
||||
|
||||
### `main()`
|
||||
- Main function to execute the entire process: reading the Excel file, filtering data, converting time formats, creating message columns, and saving to Excel.
|
||||
|
||||
## Example
|
||||
```python
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
## Contributing
|
||||
1. Fork the repository.
|
||||
2. Create a new branch: `git checkout -b feature-branch`.
|
||||
3. Make your changes and commit: `git commit -m 'Add new feature'`.
|
||||
4. Push to the branch: `git push origin feature-branch`.
|
||||
5. Submit a pull request.
|
||||
|
||||
## License
|
||||
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more information.
|
||||
|
||||
## Acknowledgements
|
||||
- [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)
|
||||
- [Openpyxl Documentation](https://openpyxl.readthedocs.io/en/stable/)
|
||||
|
||||
## Author
|
||||
- AI Assistant (Your Name or Contributors)
|
||||
|
||||
> For additional information or support, please contact `your-email@example.com`.
|
|
@ -1,7 +1,10 @@
|
|||
import pandas as pd
|
||||
|
||||
def read_excel_file(file_path):
|
||||
"""Read the Excel file and return a DataFrame."""
|
||||
"""
|
||||
:param file_path: The path to the Excel file to be read.
|
||||
:return: The contents of the Excel file as a DataFrame if successful, otherwise None.
|
||||
"""
|
||||
try:
|
||||
return pd.read_excel(file_path)
|
||||
except Exception as e:
|
||||
|
@ -9,7 +12,10 @@ def read_excel_file(file_path):
|
|||
return None
|
||||
|
||||
def filter_studiegidsnummer(df):
|
||||
"""Filter rows where 'studiegidsnummer' contains 'GES'."""
|
||||
"""
|
||||
:param df: Input DataFrame that contains various columns including 'Studiegidsnummer'.
|
||||
:return: DataFrame filtered to include only rows where the 'Studiegidsnummer' column contains 'GES'. Returns an empty DataFrame if 'Studiegidsnummer' column is not found.
|
||||
"""
|
||||
if 'Studiegidsnummer' not in df.columns:
|
||||
print("Column 'studiegidsnummer' not found in the DataFrame.")
|
||||
print("Available columns:", df.columns)
|
||||
|
@ -17,7 +23,10 @@ def filter_studiegidsnummer(df):
|
|||
return df[df['Studiegidsnummer'].str.contains('GES', na=False)].copy()
|
||||
|
||||
def filter_opmerkingen(df):
|
||||
"""Filter rows where 'Opmerkingen' does NOT contain '24-25'."""
|
||||
"""
|
||||
:param df: The input DataFrame containing various columns including 'Opmerkingen'
|
||||
:return: A filtered DataFrame excluding rows where the 'Opmerkingen' column contains the string '24-25'. If the 'Opmerkingen' column is not found, returns an empty DataFrame and prints available columns.
|
||||
"""
|
||||
if 'Opmerkingen' not in df.columns:
|
||||
print("Column 'Opmerkingen' not found in the DataFrame.")
|
||||
print("Available columns:", df.columns)
|
||||
|
@ -25,7 +34,11 @@ def filter_opmerkingen(df):
|
|||
return df[~df['Opmerkingen'].str.contains('24-25', na=False)].copy()
|
||||
|
||||
def create_message_column(df):
|
||||
"""Create 'Message' and 'subject' columns with the specified format."""
|
||||
"""
|
||||
:param df: A pandas DataFrame containing examination details.
|
||||
:return: A pandas DataFrame with additional 'Message' and 'subject' columns
|
||||
for communication with teaching staff regarding examination details.
|
||||
"""
|
||||
df.loc[:, 'Message'] = df.apply(lambda row: (
|
||||
f"Beste docent,\n\n"
|
||||
f"Ik ben de examengegevens aan het controleren van {row['Omschrijving']} {row['Studiegidsnummer']}. De huidige gegevens zijn als volgt:\n\n"
|
||||
|
@ -38,14 +51,23 @@ def create_message_column(df):
|
|||
return df
|
||||
|
||||
def save_to_excel(df, output_file_path):
|
||||
"""Save the DataFrame to a new Excel file."""
|
||||
"""
|
||||
:param df: The DataFrame to be saved to an Excel file.
|
||||
:type df: pandas.DataFrame
|
||||
:param output_file_path: The path where the Excel file will be saved.
|
||||
:type output_file_path: str
|
||||
:return: None
|
||||
"""
|
||||
try:
|
||||
df.to_excel(output_file_path, index=False)
|
||||
except Exception as e:
|
||||
print(f"Error saving the Excel file: {e}")
|
||||
|
||||
def convert_time_format(time_str):
|
||||
"""Convert time from 'HH:MM:SS' to 'HH:MM'."""
|
||||
"""
|
||||
:param time_str: A string representing the time to be converted.
|
||||
:return: A string representing the time in 'HH:MM' format, or the original string if conversion fails.
|
||||
"""
|
||||
try:
|
||||
return pd.to_datetime(time_str).strftime('%H:%M')
|
||||
except Exception as e:
|
||||
|
@ -53,13 +75,26 @@ def convert_time_format(time_str):
|
|||
return time_str
|
||||
|
||||
def apply_time_format_conversion(df, columns):
|
||||
"""Apply time format conversion to specified columns in the DataFrame."""
|
||||
"""
|
||||
:param df: The DataFrame containing the columns to be formatted.
|
||||
:type df: pandas.DataFrame
|
||||
:param columns: A list of column names in the DataFrame to apply the time format conversion.
|
||||
:type columns: list of str
|
||||
:return: A DataFrame with the specified columns converted to the '%H:%M' format.
|
||||
:rtype: pandas.DataFrame
|
||||
"""
|
||||
for column in columns:
|
||||
df[column] = pd.to_datetime(df[column], format='%H:%M:%S', errors='coerce').dt.strftime('%H:%M')
|
||||
return df
|
||||
|
||||
# Example usage within the main function
|
||||
def main():
|
||||
"""
|
||||
Reads an Excel file, filters data based on specific criteria, converts time formats for specified columns,
|
||||
creates a message column, and saves the filtered data to a new Excel file.
|
||||
|
||||
:return: None
|
||||
"""
|
||||
file_path = 'examengegevens2425.xlsx'
|
||||
output_file_path = 'filtered_examengegevens2425.xlsx'
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user