From c805605ecf2dffb59cf7930200f6fd070b6fb080 Mon Sep 17 00:00:00 2001 From: wendtalexander Date: Thu, 17 Oct 2024 18:06:00 +0200 Subject: [PATCH 1/2] [project] adding readmeas to code and data folder --- code/README.md | 41 +++++++++++++++++++++++++++++++++++++++++ data/README.md | 3 +++ 2 files changed, 44 insertions(+) create mode 100644 code/README.md create mode 100644 data/README.md diff --git a/code/README.md b/code/README.md new file mode 100644 index 0000000..393fabe --- /dev/null +++ b/code/README.md @@ -0,0 +1,41 @@ +# Structure of a script + +1. Initially you should specify which packages you use in the scripts + +~~~python +import pathlib # Packages that are provided from python + +import numpy as np # Packages that are downloaded, specified in the requierements.txt + +import myscript # Scripts from your Project/Pipeline +~~~ + + +2. Next your code for the specific problem that you are trying to solve, all written code should be containded in a function/classes + It should contain a main function with is calling all individual function to solve the problem. + +~~~python +def load_data(path): + with open(path, "r") as f: + f.read() + return f + +def main(path): + load_data(path) +~~~ + +3. If the script is a standalone script, it can be run by calling python myscript.py it should contain... + +~~~python +if __name__ == "__main__": + path = "../data/README.md" + main(path) +~~~ + +# Tips and tricks +- Plotting scripts should be named the same as the output figure for easier backtracking +- Plotting scripts should start with plot, so that one can create a bash script for that executes all plot* scripts +- If you use a directory for managing specific task, in python it is called a module, you neeed a __init__.py file in the directory more in [packagehowto](https://whale.am28.uni-tuebingen.de/git/pweygoldt/packagehowto) + + + diff --git a/data/README.md b/data/README.md new file mode 100644 index 0000000..dfa6eb5 --- /dev/null +++ b/data/README.md @@ -0,0 +1,3 @@ +# Tips and Tricks for the data directory + +The filename should contain information about the date From 092ea67f403e8bd5d82f9b34a1f43f1985724281 Mon Sep 17 00:00:00 2001 From: weygoldt <88969563+weygoldt@users.noreply.github.com> Date: Fri, 18 Oct 2024 14:07:53 +0200 Subject: [PATCH 2/2] added to code readme --- code/README.md | 312 ++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 286 insertions(+), 26 deletions(-) diff --git a/code/README.md b/code/README.md index 393fabe..dc5ec98 100644 --- a/code/README.md +++ b/code/README.md @@ -1,41 +1,301 @@ -# Structure of a script +# Writing a Good Python Script: A Primer -1. Initially you should specify which packages you use in the scripts +This primer will guide you through best practices to write effective and clean +Python scripts. Whether you're working on a data processing pipeline, a machine +learning model, or a simple utility script, following these guidelines will +help you create maintainable and readable code. -~~~python -import pathlib # Packages that are provided from python +## 1. Use a Declarative and Meaningful Script Name -import numpy as np # Packages that are downloaded, specified in the requierements.txt +Choose a script name that clearly describes its purpose. This makes it easier +for others (and yourself) to understand what the script does without reading +the code. -import myscript # Scripts from your Project/Pipeline -~~~ +**Examples:** +- `data_cleaning.py` instead of `script1.py` +- `generate_report.py` instead of `run.py` -2. Next your code for the specific problem that you are trying to solve, all written code should be containded in a function/classes - It should contain a main function with is calling all individual function to solve the problem. +## 2. Start with a Short Explanation (Docstring) -~~~python -def load_data(path): - with open(path, "r") as f: - f.read() - return f +At the beginning of your script, include a docstring that briefly explains what the script does. This helps users quickly grasp the script's functionality. -def main(path): - load_data(path) -~~~ +```python +""" +This script loads raw data, cleans it by removing null values and duplicates, +and saves the processed data to a new file. +""" +``` -3. If the script is a standalone script, it can be run by calling python myscript.py it should contain... +## 3. Import All Required Packages at the Beginning -~~~python +List all your imports at the top of the script. This makes dependencies clear and simplifies maintenance. + +```python +import sys # Packages that are provided by Python +from pathlib import Path +import numpy as np # Packages that are downloaded, specified in the requierements.txt +import pandas as pd +import my_module # Modules that are written by yourself +``` + +## 4. Encapsulate Code in Functions and Classes + +Organize your code by wrapping functionality within functions or classes. This +promotes code reuse, testing, and readability. Ideally, functions should do one +thing and do it well. Classes can be used for more complex logic or when you need +to maintain state. Clean functions and classes contain type hints and docstrings +to explain their purpose and inputs/outputs. + +**Examples of Functions:** + +```python +def load_data(file_path: str) -> pd.DataFrame: + """Loads data from a CSV file. + + Parameters: + ---------- + file_path : str + Path to the CSV file. + + Returns: + ------- + pd.DataFrame + Loaded data as a DataFrame. + """ + return pd.read_csv(file_path) + +def clean_data(df: pd.DataFrame) -> pd.DataFrame: + """Cleans the DataFrame by removing null values and duplicates. + + Parameters: + ---------- + df : pd.DataFrame + Input DataFrame. + + Returns: + ------- + pd.DataFrame + Cleaned DataFrame. + """ + df = df.dropna() + df = df.drop_duplicates() + return df + +def save_data(df: pd.DataFrame, output_path: str) -> None: + """Saves the DataFrame to a CSV file. + + Parameters: + ---------- + df : pd.DataFrame + DataFrame to save. + output_path : str + Path to save the CSV file. + """ + df.to_csv(output_path, index=False) +``` + +**Example of a Class:** + +```python +class DataProcessor: + """A class for processing data.""" + + def __init__(self, file_path): + self.data = self.load_data(file_path) + + def load_data(self, file_path): + return pd.read_csv(file_path) + + def clean_data(self): + self.data.dropna(inplace=True) + self.data.drop_duplicates(inplace=True) + + def save_data(self, output_path): + self.data.to_csv(output_path, index=False) +``` + +## 5. Define a `main()` Function + +Create a `main()` function that serves as the entry point of your script. This +function should orchestrate the flow of your program. + +```python +def main(): + """Main function that orchestrates the data processing.""" + input_file = 'data/raw/data.csv' + output_file = 'data/processed/clean_data.csv' + + # Using functions + data = load_data(input_file) + clean_data = clean_data(data) + save_data(clean_data, output_file) + + # Or using a class + # processor = DataProcessor(input_file) + # processor.clean_data() + # processor.save_data(output_file) + + print("Data processing complete.") +``` + +## 6. Use the `if __name__ == "__main__":` Statement + +This is a common Python idiom that allows you to check if the script is being +run as the main program. This ensures that the `main()` function is only called +when the script is executed directly. If you execute the `main()` function +directly, it will be executed when the module, or just parts of it, are +imported in another script. + +So at the end of your script, add: + +```python if __name__ == "__main__": - path = "../data/README.md" - main(path) -~~~ + main() +``` + +This checks if the script is being run as the main program and calls `main()` accordingly. + +## Putting It All Together + +Here's how your script might look when you combine all these best practices: + +```python +""" +This script loads raw data, cleans it by removing null values and duplicates, and saves the processed data to a new file. +""" + +import os +import sys +import pandas as pd +import numpy as np + +def load_data(file_path: str) -> pd.DataFrame: + """Loads data from a CSV file. + + Parameters: + ---------- + file_path : str + Path to the CSV file. + + Returns: + ------- + pd.DataFrame + Loaded data as a DataFrame. + """ + return pd.read_csv(file_path) + +def clean_data(df: pd.DataFrame) -> pd.DataFrame: + """Cleans the DataFrame by removing null values and duplicates. + + Parameters: + ---------- + df : pd.DataFrame + Input DataFrame. + + Returns: + ------- + pd.DataFrame + Cleaned DataFrame. + """ + df = df.dropna() + df = df.drop_duplicates() + return df + +def save_data(df: pd.DataFrame, output_path: str) -> None: + """Saves the DataFrame to a CSV file. + + Parameters: + ---------- + df : pd.DataFrame + DataFrame to save. + output_path : str + Path to save the CSV file. + """ + df.to_csv(output_path, index=False) + +def main(): + """Main function that orchestrates the data processing.""" + input_file = 'data/raw/data.csv' + output_file = 'data/processed/clean_data.csv' + + data = load_data(input_file) + clean_data = clean_data(data) + save_data(clean_data, output_file) + + print("Data processing complete.") + +if __name__ == "__main__": + main() +``` + +## Additional Tips + +- **Comment Your Code:** Use comments to explain non-obvious parts of your code. However, strive to write code that is self-explanatory. +- **Follow PEP 8 Guidelines:** Adhere to the [PEP 8](https://www.python.org/dev/peps/pep-0008/) style guide for Python code to improve readability. To make this easy, use an auto-formatter like `black` or `ruff`. +- **Use Meaningful Variable, Function and ClasE Names:** Choose names that convey their purpose. Avoid single-letter variable names except for simple iterators. Instead of `x` and `y` use e.g., `time` and `signal`. +- **Handle Exceptions:** Use try-except blocks to handle potential errors gracefully. + + ```python + try: + data = load_data(input_file) + except FileNotFoundError: + print(f"Error: The file {input_file} was not found.") + sys.exit(1) + ``` + +- **Use Logging Instead of Print Statements:** For larger scripts, consider using the `logging` module for better control over logging levels and outputs. + + ```python + import logging + + logging.basicConfig(level=logging.INFO) + + logging.info("Data processing complete.") + ``` + +- **Parameterize Your Scripts:** Use command-line arguments or a configuration file to make your script more flexible. + + ```python + import argparse + + def parse_arguments(): + parser = argparse.ArgumentParser(description="Process and clean data.") + parser.add_argument('--input', required=True, help='Input file path') + parser.add_argument('--output', required=True, help='Output file path') + return parser.parse_args() + + def main(): + args = parse_arguments() + data = load_data(args.input) + clean_data = clean_data(data) + save_data(clean_data, args.output) + ``` + + - **Make Your Code Modular:** Break down your script into multiple files or + modules for better organization and reusability. For example, move data + processing functions that are used in multiple scripts to a separate module + called `data_processing.py`. + + - **Coding a figure:** If you are coding a figure, you can follow our [coding + a figure + guide](https://github.com/bendalab/plottools/blob/master/docs/guide.md). + Applying the same principles to your figure code will make it easier to + modify and reuse. + + +## Conclusion + +By following these best practices, you'll create Python scripts that are: -# Tips and tricks -- Plotting scripts should be named the same as the output figure for easier backtracking -- Plotting scripts should start with plot, so that one can create a bash script for that executes all plot* scripts -- If you use a directory for managing specific task, in python it is called a module, you neeed a __init__.py file in the directory more in [packagehowto](https://whale.am28.uni-tuebingen.de/git/pweygoldt/packagehowto) +- **Readable:** Clear structure and naming make your code easy to understand. +- **Maintainable:** Encapsulation and modularity simplify updates and debugging. +- **Reusable:** Functions and classes can be imported and used in other scripts. +- **Robust:** Error handling ensures your script can handle unexpected situations gracefully. +Remember, good coding practices not only make your life easier but also help +others who may work with your code in the future. The effort you put into +writing clean and effective scripts will pay off in the long run. +Happy coding!