diff --git a/code/README.md b/code/README.md new file mode 100644 index 0000000..dc5ec98 --- /dev/null +++ b/code/README.md @@ -0,0 +1,301 @@ +# Writing a Good Python Script: A Primer + +This primer will guide you through best practices to write effective and clean +Python scripts. Whether you're working on a data processing pipeline, a machine +learning model, or a simple utility script, following these guidelines will +help you create maintainable and readable code. + +## 1. Use a Declarative and Meaningful Script Name + +Choose a script name that clearly describes its purpose. This makes it easier +for others (and yourself) to understand what the script does without reading +the code. + +**Examples:** + +- `data_cleaning.py` instead of `script1.py` +- `generate_report.py` instead of `run.py` + +## 2. Start with a Short Explanation (Docstring) + +At the beginning of your script, include a docstring that briefly explains what the script does. This helps users quickly grasp the script's functionality. + +```python +""" +This script loads raw data, cleans it by removing null values and duplicates, +and saves the processed data to a new file. +""" +``` + +## 3. Import All Required Packages at the Beginning + +List all your imports at the top of the script. This makes dependencies clear and simplifies maintenance. + +```python +import sys # Packages that are provided by Python +from pathlib import Path +import numpy as np # Packages that are downloaded, specified in the requierements.txt +import pandas as pd +import my_module # Modules that are written by yourself +``` + +## 4. Encapsulate Code in Functions and Classes + +Organize your code by wrapping functionality within functions or classes. This +promotes code reuse, testing, and readability. Ideally, functions should do one +thing and do it well. Classes can be used for more complex logic or when you need +to maintain state. Clean functions and classes contain type hints and docstrings +to explain their purpose and inputs/outputs. + +**Examples of Functions:** + +```python +def load_data(file_path: str) -> pd.DataFrame: + """Loads data from a CSV file. + + Parameters: + ---------- + file_path : str + Path to the CSV file. + + Returns: + ------- + pd.DataFrame + Loaded data as a DataFrame. + """ + return pd.read_csv(file_path) + +def clean_data(df: pd.DataFrame) -> pd.DataFrame: + """Cleans the DataFrame by removing null values and duplicates. + + Parameters: + ---------- + df : pd.DataFrame + Input DataFrame. + + Returns: + ------- + pd.DataFrame + Cleaned DataFrame. + """ + df = df.dropna() + df = df.drop_duplicates() + return df + +def save_data(df: pd.DataFrame, output_path: str) -> None: + """Saves the DataFrame to a CSV file. + + Parameters: + ---------- + df : pd.DataFrame + DataFrame to save. + output_path : str + Path to save the CSV file. + """ + df.to_csv(output_path, index=False) +``` + +**Example of a Class:** + +```python +class DataProcessor: + """A class for processing data.""" + + def __init__(self, file_path): + self.data = self.load_data(file_path) + + def load_data(self, file_path): + return pd.read_csv(file_path) + + def clean_data(self): + self.data.dropna(inplace=True) + self.data.drop_duplicates(inplace=True) + + def save_data(self, output_path): + self.data.to_csv(output_path, index=False) +``` + +## 5. Define a `main()` Function + +Create a `main()` function that serves as the entry point of your script. This +function should orchestrate the flow of your program. + +```python +def main(): + """Main function that orchestrates the data processing.""" + input_file = 'data/raw/data.csv' + output_file = 'data/processed/clean_data.csv' + + # Using functions + data = load_data(input_file) + clean_data = clean_data(data) + save_data(clean_data, output_file) + + # Or using a class + # processor = DataProcessor(input_file) + # processor.clean_data() + # processor.save_data(output_file) + + print("Data processing complete.") +``` + +## 6. Use the `if __name__ == "__main__":` Statement + +This is a common Python idiom that allows you to check if the script is being +run as the main program. This ensures that the `main()` function is only called +when the script is executed directly. If you execute the `main()` function +directly, it will be executed when the module, or just parts of it, are +imported in another script. + +So at the end of your script, add: + +```python +if __name__ == "__main__": + main() +``` + +This checks if the script is being run as the main program and calls `main()` accordingly. + +## Putting It All Together + +Here's how your script might look when you combine all these best practices: + +```python +""" +This script loads raw data, cleans it by removing null values and duplicates, and saves the processed data to a new file. +""" + +import os +import sys +import pandas as pd +import numpy as np + +def load_data(file_path: str) -> pd.DataFrame: + """Loads data from a CSV file. + + Parameters: + ---------- + file_path : str + Path to the CSV file. + + Returns: + ------- + pd.DataFrame + Loaded data as a DataFrame. + """ + return pd.read_csv(file_path) + +def clean_data(df: pd.DataFrame) -> pd.DataFrame: + """Cleans the DataFrame by removing null values and duplicates. + + Parameters: + ---------- + df : pd.DataFrame + Input DataFrame. + + Returns: + ------- + pd.DataFrame + Cleaned DataFrame. + """ + df = df.dropna() + df = df.drop_duplicates() + return df + +def save_data(df: pd.DataFrame, output_path: str) -> None: + """Saves the DataFrame to a CSV file. + + Parameters: + ---------- + df : pd.DataFrame + DataFrame to save. + output_path : str + Path to save the CSV file. + """ + df.to_csv(output_path, index=False) + +def main(): + """Main function that orchestrates the data processing.""" + input_file = 'data/raw/data.csv' + output_file = 'data/processed/clean_data.csv' + + data = load_data(input_file) + clean_data = clean_data(data) + save_data(clean_data, output_file) + + print("Data processing complete.") + +if __name__ == "__main__": + main() +``` + +## Additional Tips + +- **Comment Your Code:** Use comments to explain non-obvious parts of your code. However, strive to write code that is self-explanatory. +- **Follow PEP 8 Guidelines:** Adhere to the [PEP 8](https://www.python.org/dev/peps/pep-0008/) style guide for Python code to improve readability. To make this easy, use an auto-formatter like `black` or `ruff`. +- **Use Meaningful Variable, Function and ClasE Names:** Choose names that convey their purpose. Avoid single-letter variable names except for simple iterators. Instead of `x` and `y` use e.g., `time` and `signal`. +- **Handle Exceptions:** Use try-except blocks to handle potential errors gracefully. + + ```python + try: + data = load_data(input_file) + except FileNotFoundError: + print(f"Error: The file {input_file} was not found.") + sys.exit(1) + ``` + +- **Use Logging Instead of Print Statements:** For larger scripts, consider using the `logging` module for better control over logging levels and outputs. + + ```python + import logging + + logging.basicConfig(level=logging.INFO) + + logging.info("Data processing complete.") + ``` + +- **Parameterize Your Scripts:** Use command-line arguments or a configuration file to make your script more flexible. + + ```python + import argparse + + def parse_arguments(): + parser = argparse.ArgumentParser(description="Process and clean data.") + parser.add_argument('--input', required=True, help='Input file path') + parser.add_argument('--output', required=True, help='Output file path') + return parser.parse_args() + + def main(): + args = parse_arguments() + data = load_data(args.input) + clean_data = clean_data(data) + save_data(clean_data, args.output) + ``` + + - **Make Your Code Modular:** Break down your script into multiple files or + modules for better organization and reusability. For example, move data + processing functions that are used in multiple scripts to a separate module + called `data_processing.py`. + + - **Coding a figure:** If you are coding a figure, you can follow our [coding + a figure + guide](https://github.com/bendalab/plottools/blob/master/docs/guide.md). + Applying the same principles to your figure code will make it easier to + modify and reuse. + + +## Conclusion + +By following these best practices, you'll create Python scripts that are: + +- **Readable:** Clear structure and naming make your code easy to understand. +- **Maintainable:** Encapsulation and modularity simplify updates and debugging. +- **Reusable:** Functions and classes can be imported and used in other scripts. +- **Robust:** Error handling ensures your script can handle unexpected situations gracefully. + +Remember, good coding practices not only make your life easier but also help +others who may work with your code in the future. The effort you put into +writing clean and effective scripts will pay off in the long run. + +Happy coding! + diff --git a/data/README.md b/data/README.md new file mode 100644 index 0000000..dfa6eb5 --- /dev/null +++ b/data/README.md @@ -0,0 +1,3 @@ +# Tips and Tricks for the data directory + +The filename should contain information about the date